Add detailed failure attributes to exporter send_failed metrics #14247

iblancasa · 2025-12-02T16:25:34Z

Description

Added failure.reason and failure.permanent attributes in detailed mode to otelcol_exporter_send_failed_<signal> metrics

Suggested here #13957 (comment)

Link to tracking issue

Fixes #13956

codspeed-hq · 2025-12-02T17:17:25Z

CodSpeed Performance Report

Merging #14247 will not alter performance

_{Comparing iblancasa:13956-2 (bff292f) with main (a330ae2)}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 59 untouched
⏩ 20 skipped¹

20 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

codecov · 2025-12-02T17:44:38Z

Codecov Report

❌ Patch coverage is 95.58824% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.17%. Comparing base (a0cbea7) to head (bff292f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...orter/exporterhelper/internal/obs_report_sender.go	94.11%	1 Missing and 1 partial ⚠️
exporter/exporterhelper/internal/retry_sender.go	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14247      +/-   ##
==========================================
- Coverage   92.17%   92.17%   -0.01%     
==========================================
  Files         668      668              
  Lines       41467    41523      +56     
==========================================
+ Hits        38221    38272      +51     
- Misses       2213     2215       +2     
- Partials     1033     1036       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

axw

@iblancasa I think this is a good direction, left a handful of suggestions

exporter/exporterhelper/internal/obs_report_sender.go

axw · 2025-12-03T02:13:42Z

exporter/exporterhelper/internal/obs_report_sender.go

+	if strings.Contains(err.Error(), "no more retries left") {
+		return "retries_exhausted"
+	}


Suggested change

if strings.Contains(err.Error(), "no more retries left") {

return "retries_exhausted"

}

Based on #13957 (comment), I'm not sure that this one makes sense. Won't it be the case that any error that could be retried will always end up with this as the reason? Then you lose information about the underlying reason that caused the retries.

Yes, but you want to know when you reached that. To count that moment, right? With that, you can create alerts based on the number of exhausted retries.

Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

Export fails due to an authentication error, which gets classified as a permanent error.

Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Thanks for the feedback. I understand the point about filtering on failure.permanent, but I have concerns about clarity for end users.

To distinguish these, users need to know:

Which error types are retry-able (no clear consensus - Allow configuring retryable status codes #14228)

Whether retries were actually attempted (can't tell from just the error type)

Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."

I added a new failure.retries_exhausted. With this, we can have all the situations covered. What do you think?

exporter/exporterhelper/internal/obs_report_sender.go

Signed-off-by: Israel Blancas <[email protected]>

iblancasa · 2025-12-03T15:50:23Z

Sorry for the force-push but I got some issues with the CI after merging main into my branch and noticed some of them were related to the merge.

Signed-off-by: Israel Blancas <[email protected]>

axw

Sorry for the delay, I thought I had hit send already

axw · 2025-12-04T01:55:40Z

exporter/exporterhelper/internal/obs_report_sender.go

+	type httpStatusCoder interface {
+		HTTPStatusCode() int
+	}


What implements this? I think mostly we go the other way, propagate gRPC status through errors, and convert them to HTTP status codes.

axw · 2025-12-04T03:23:12Z

exporter/exporterhelper/internal/obs_report_sender.go

+	if strings.Contains(err.Error(), "no more retries left") {
+		return "retries_exhausted"
+	}


Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

Export fails due to an authentication error, which gets classified as a permanent error.

Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Signed-off-by: Israel Blancas <[email protected]>

iblancasa requested review from a team, bogdandrutu and dmitryax as code owners December 2, 2025 16:25

iblancasa mentioned this pull request Dec 2, 2025

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Open

iblancasa force-pushed the 13956-2 branch 2 times, most recently from 82b21b3 to 17eb1c3 Compare December 2, 2025 17:08

iblancasa requested review from dmathieu, evan-bradley and mx-psi as code owners December 2, 2025 17:08

iblancasa force-pushed the 13956-2 branch from 17eb1c3 to e3b09fa Compare December 2, 2025 17:31

axw reviewed Dec 3, 2025

View reviewed changes

jade-guiton-dd reviewed Dec 3, 2025

View reviewed changes

exporter/exporterhelper/internal/obs_report_sender.go Outdated Show resolved Hide resolved

Add detailed failure attributes to exporter send_failed metrics

eb7f0bd

Signed-off-by: Israel Blancas <[email protected]>

iblancasa force-pushed the 13956-2 branch from 4fa8866 to eb7f0bd Compare December 3, 2025 15:49

Fix CI

ab77cad

Signed-off-by: Israel Blancas <[email protected]>

axw reviewed Dec 5, 2025

View reviewed changes

Apply feedback from code review and failure.retries_exhausted

bff292f

Signed-off-by: Israel Blancas <[email protected]>

	if strings.Contains(err.Error(), "no more retries left") {
	return "retries_exhausted"
	}

Add detailed failure attributes to exporter send_failed metrics #14247

Are you sure you want to change the base?

Add detailed failure attributes to exporter send_failed metrics #14247

Conversation

iblancasa commented Dec 2, 2025

Description

Link to tracking issue

Uh oh!

codspeed-hq bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #14247 will not alter performance

Summary

Footnotes

Uh oh!

codecov bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iblancasa commented Dec 3, 2025

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Dec 2, 2025 •

edited

Loading

codecov bot commented Dec 2, 2025 •

edited

Loading