-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add detailed failure attributes to exporter send_failed metrics #14247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
82b21b3 to
17eb1c3
Compare
CodSpeed Performance ReportMerging #14247 will not alter performanceComparing
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14247 +/- ##
==========================================
- Coverage 92.17% 92.17% -0.01%
==========================================
Files 668 668
Lines 41467 41523 +56
==========================================
+ Hits 38221 38272 +51
- Misses 2213 2215 +2
- Partials 1033 1036 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
axw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iblancasa I think this is a good direction, left a handful of suggestions
| if strings.Contains(err.Error(), "no more retries left") { | ||
| return "retries_exhausted" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if strings.Contains(err.Error(), "no more retries left") { | |
| return "retries_exhausted" | |
| } |
Based on #13957 (comment), I'm not sure that this one makes sense. Won't it be the case that any error that could be retried will always end up with this as the reason? Then you lose information about the underlying reason that caused the retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you want to know when you reached that. To count that moment, right? With that, you can create alerts based on the number of exhausted retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.
Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.
Let's consider two scenarios:
- Export fails due to an authentication error, which gets classified as a permanent error.
- Export fails due to a network error, which gets classified as a temporary error.
In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.
In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I understand the point about filtering on failure.permanent, but I have concerns about clarity for end users.
To distinguish these, users need to know:
- Which error types are retry-able (no clear consensus - Allow configuring retryable status codes #14228)
- Whether retries were actually attempted (can't tell from just the error type)
Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new failure.retries_exhausted. With this, we can have all the situations covered. What do you think?
Signed-off-by: Israel Blancas <[email protected]>
|
Sorry for the force-push but I got some issues with the CI after merging |
Signed-off-by: Israel Blancas <[email protected]>
axw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I thought I had hit send already
| type httpStatusCoder interface { | ||
| HTTPStatusCode() int | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What implements this? I think mostly we go the other way, propagate gRPC status through errors, and convert them to HTTP status codes.
| if strings.Contains(err.Error(), "no more retries left") { | ||
| return "retries_exhausted" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.
Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.
Let's consider two scenarios:
- Export fails due to an authentication error, which gets classified as a permanent error.
- Export fails due to a network error, which gets classified as a temporary error.
In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.
In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.
Signed-off-by: Israel Blancas <[email protected]>
Description
failure.reasonandfailure.permanentattributes in detailed mode tootelcol_exporter_send_failed_<signal>metricsSuggested here #13957 (comment)
Link to tracking issue
Fixes #13956