Skip to content

Conversation

@iblancasa
Copy link
Contributor

Description

  • Added failure.reason and failure.permanent attributes in detailed mode to otelcol_exporter_send_failed_<signal> metrics

Suggested here #13957 (comment)

Link to tracking issue

Fixes #13956

@codspeed-hq
Copy link

codspeed-hq bot commented Dec 2, 2025

CodSpeed Performance Report

Merging #14247 will not alter performance

Comparing iblancasa:13956-2 (bff292f) with main (a330ae2)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 59 untouched
⏩ 20 skipped1

Footnotes

  1. 20 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 95.58824% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.17%. Comparing base (a0cbea7) to head (bff292f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...orter/exporterhelper/internal/obs_report_sender.go 94.11% 1 Missing and 1 partial ⚠️
exporter/exporterhelper/internal/retry_sender.go 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14247      +/-   ##
==========================================
- Coverage   92.17%   92.17%   -0.01%     
==========================================
  Files         668      668              
  Lines       41467    41523      +56     
==========================================
+ Hits        38221    38272      +51     
- Misses       2213     2215       +2     
- Partials     1033     1036       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iblancasa I think this is a good direction, left a handful of suggestions

Comment on lines 183 to 185
if strings.Contains(err.Error(), "no more retries left") {
return "retries_exhausted"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if strings.Contains(err.Error(), "no more retries left") {
return "retries_exhausted"
}

Based on #13957 (comment), I'm not sure that this one makes sense. Won't it be the case that any error that could be retried will always end up with this as the reason? Then you lose information about the underlying reason that caused the retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you want to know when you reached that. To count that moment, right? With that, you can create alerts based on the number of exhausted retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

  1. Export fails due to an authentication error, which gets classified as a permanent error.
  2. Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I understand the point about filtering on failure.permanent, but I have concerns about clarity for end users.

To distinguish these, users need to know:

  1. Which error types are retry-able (no clear consensus - Allow configuring retryable status codes #14228)
  2. Whether retries were actually attempted (can't tell from just the error type)

Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new failure.retries_exhausted. With this, we can have all the situations covered. What do you think?

@iblancasa
Copy link
Contributor Author

Sorry for the force-push but I got some issues with the CI after merging main into my branch and noticed some of them were related to the merge.

Signed-off-by: Israel Blancas <[email protected]>
Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I thought I had hit send already

Comment on lines 202 to 204
type httpStatusCoder interface {
HTTPStatusCode() int
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What implements this? I think mostly we go the other way, propagate gRPC status through errors, and convert them to HTTP status codes.

Comment on lines 183 to 185
if strings.Contains(err.Error(), "no more retries left") {
return "retries_exhausted"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

  1. Export fails due to an authentication error, which gets classified as a permanent error.
  2. Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[exporter/exporterhelper] Add exporter retry-drop metrics

3 participants