-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Component(s)
exporter/exporterhelper
Is your feature request related to a problem? Please describe.
Currently, configuring OTel exporters (e.g., otlphttp, otlpgrpc, elasticsearch, loki) to be resilient without overwhelming downstream services is a significant challenge. Users must manually tune static concurrency limits, typically sending_queue.num_consumers.
This creates a "vicious loop" for operators:
- Set concurrency too high: The collector can easily overwhelm a downstream service (like Elasticsearch or a custom OTLP receiver), leading to HTTP
429(Too Many Requests) / gRPCRESOURCE_EXHAUSTEDerrors, dropped data, and potential cascading failures. - Set concurrency too low: The collector under-utilizes the downstream service's capacity, leading to wasted resources, increased buffer usage (high memory/disk), and higher end-to-end latency.
This static limit is a "blunt instrument" for a dynamic problem. The optimal concurrency rate is not static; it changes constantly based on:
- The number of collector instances being deployed (e.g., in a Kubernetes HPA).
- The current capacity of the downstream service (e.g., an Elasticsearch cluster scaling up or down).
- The real-time volume of telemetry data being sent.
Operators are forced to "chase the dragon" by constantly re-tuning this static value, or they must provision backends to handle a worst-case scenario that may rarely occur, which is expensive.
Describe the solution you'd like
I propose implementing an Adaptive Request Concurrency (ARC) mechanism within the exporterhelper to support both HTTP and gRPC-based exporters.
This feature would dynamically and automatically adjust the exporter's concurrency level (sending_queue.num_consumers) based on real-time feedback from the downstream service. The mechanism would be inspired by TCP congestion control algorithms (AIMD - Additive Increase, Multiplicative Decrease).
The core logic would be tailored to the protocol:
For HTTP Exporters (e.g., otlphttp, elasticsearch)
- Monitor key signals:
- Round-Trip Time (RTT) of requests. An Exponentially Weighted Moving Average (EWMA) could be used to establish a baseline RTT.
- HTTP Response Codes: Specifically looking for success (
2xx) vs. backpressure signals (429,503, or other5xxerrors).
- Implement AIMD Logic:
- Additive Increase: If RTT is stable or decreasing AND HTTP responses are consistently successful (
2xx), the collector should linearly increase its concurrency limit. - Multiplicative Decrease: If RTT starts to increase significantly (e.g.,
current_rtt > baseline_rtt * rtt_threshold_ratio) OR the exporter receives backpressure signals (429,503), the collector should exponentially decrease its concurrency limit.
- Additive Increase: If RTT is stable or decreasing AND HTTP responses are consistently successful (
For gRPC Exporters (e.g., otlpgrpc)
gRPC (built on HTTP/2) has native flow control for network-level backpressure, but this proposal addresses application-level backpressure (e.g., the receiving server's application logic is overwhelmed). The signals for this are explicit gRPC status codes.
- Monitor key signals:
- gRPC Status Codes: This is the primary signal.
- Success:
OK(Code 0) - Backpressure Signals:
RESOURCE_EXHAUSTED(Code 8, the gRPC equivalent of HTTP 429) andUNAVAILABLE(Code 14, the gRPC equivalent of HTTP 503).
- Success:
- gRPC Status Codes: This is the primary signal.
- Implement AIMD Logic:
- Additive Increase: On consistent
OKresponses, the collector should linearly increase its concurrency limit (the number of concurrent streams, controlled bynum_consumers). - Multiplicative Decrease: On receiving
RESOURCE_EXHAUSTEDorUNAVAILABLEstatus codes, the collector should exponentially decrease its concurrency limit.
- Additive Increase: On consistent
This combined approach creates a feedback loop that automatically "finds" the optimal concurrency level that the downstream service can handle at any given moment, maximizing throughput while ensuring reliability for all major OTLP exporters.
Proposed Configuration
This feature could be added to the sending_queue settings, where it would be leveraged by any exporter using the queue (both gRPC and HTTP).
Example 1: Simple toggle
exporters:
otlphttp:
endpoint: "http://my-backend:4318"
sending_queue:
enabled: true
queue_size: 1000
num_consumers: adaptive # New "adaptive" keywordExample 2: Detailed configuration block (preferred)
This would allow users to set boundaries and tune the algorithm if needed, while num_consumers would be the static alternative. This single config structure would work for both otlphttp and otlpgrpc.
exporters:
otlphttp:
endpoint: "http://my-backend:4318"
sending_queue:
enabled: true
queue_size: 1000
# num_consumers: 10 # This would be ignored if adaptive_concurrency is enabled
adaptive_concurrency:
enabled: true
min_concurrency: 1 # Optional: The floor for concurrency
max_concurrency: 100 # Optional: The ceiling for concurrency
# Optional: Algorithm tuning parameters with sane defaults
# decrease_ratio: 0.9 # Factor to multiply by on "decrease" signal
# rtt_threshold_ratio: 1.1 # e.g., trigger decrease if RTT > 110% of baseline (HTTP only)Describe alternatives you've considered
The alternative is the current state: manual, static tuning of num_consumers. This is inefficient, error-prone, and adds significant operational overhead, as described in the problem statement.
Additional context
This proposal is heavily inspired by Vector's "Adaptive Request Concurrency" (ARC) feature, which solves this exact problem for its HTTP-based sinks. Vector's implementation (itself inspired by work done at Netflix) has proven to be extremely effective at improving reliability and performance.
By adopting a similar pattern, the OTel Collector would become a "better infrastructure citizen" out-of-the-box, reducing the tuning burden on users and making OTel-based pipelines more resilient to downstream slowdowns or failures.
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.