Skip to content

Implement Adaptive Request Concurrency (ARC) for HTTP and gRPC Exporters #14080

@raghu999

Description

@raghu999

Component(s)

exporter/exporterhelper

Is your feature request related to a problem? Please describe.

Currently, configuring OTel exporters (e.g., otlphttp, otlpgrpc, elasticsearch, loki) to be resilient without overwhelming downstream services is a significant challenge. Users must manually tune static concurrency limits, typically sending_queue.num_consumers.

This creates a "vicious loop" for operators:

  • Set concurrency too high: The collector can easily overwhelm a downstream service (like Elasticsearch or a custom OTLP receiver), leading to HTTP 429 (Too Many Requests) / gRPC RESOURCE_EXHAUSTED errors, dropped data, and potential cascading failures.
  • Set concurrency too low: The collector under-utilizes the downstream service's capacity, leading to wasted resources, increased buffer usage (high memory/disk), and higher end-to-end latency.

This static limit is a "blunt instrument" for a dynamic problem. The optimal concurrency rate is not static; it changes constantly based on:

  1. The number of collector instances being deployed (e.g., in a Kubernetes HPA).
  2. The current capacity of the downstream service (e.g., an Elasticsearch cluster scaling up or down).
  3. The real-time volume of telemetry data being sent.

Operators are forced to "chase the dragon" by constantly re-tuning this static value, or they must provision backends to handle a worst-case scenario that may rarely occur, which is expensive.

Describe the solution you'd like

I propose implementing an Adaptive Request Concurrency (ARC) mechanism within the exporterhelper to support both HTTP and gRPC-based exporters.

This feature would dynamically and automatically adjust the exporter's concurrency level (sending_queue.num_consumers) based on real-time feedback from the downstream service. The mechanism would be inspired by TCP congestion control algorithms (AIMD - Additive Increase, Multiplicative Decrease).

The core logic would be tailored to the protocol:


For HTTP Exporters (e.g., otlphttp, elasticsearch)

  • Monitor key signals:
    • Round-Trip Time (RTT) of requests. An Exponentially Weighted Moving Average (EWMA) could be used to establish a baseline RTT.
    • HTTP Response Codes: Specifically looking for success (2xx) vs. backpressure signals (429, 503, or other 5xx errors).
  • Implement AIMD Logic:
    • Additive Increase: If RTT is stable or decreasing AND HTTP responses are consistently successful (2xx), the collector should linearly increase its concurrency limit.
    • Multiplicative Decrease: If RTT starts to increase significantly (e.g., current_rtt > baseline_rtt * rtt_threshold_ratio) OR the exporter receives backpressure signals (429, 503), the collector should exponentially decrease its concurrency limit.

For gRPC Exporters (e.g., otlpgrpc)

gRPC (built on HTTP/2) has native flow control for network-level backpressure, but this proposal addresses application-level backpressure (e.g., the receiving server's application logic is overwhelmed). The signals for this are explicit gRPC status codes.

  • Monitor key signals:
    • gRPC Status Codes: This is the primary signal.
      • Success: OK (Code 0)
      • Backpressure Signals: RESOURCE_EXHAUSTED (Code 8, the gRPC equivalent of HTTP 429) and UNAVAILABLE (Code 14, the gRPC equivalent of HTTP 503).
  • Implement AIMD Logic:
    • Additive Increase: On consistent OK responses, the collector should linearly increase its concurrency limit (the number of concurrent streams, controlled by num_consumers).
    • Multiplicative Decrease: On receiving RESOURCE_EXHAUSTED or UNAVAILABLE status codes, the collector should exponentially decrease its concurrency limit.

This combined approach creates a feedback loop that automatically "finds" the optimal concurrency level that the downstream service can handle at any given moment, maximizing throughput while ensuring reliability for all major OTLP exporters.

Proposed Configuration

This feature could be added to the sending_queue settings, where it would be leveraged by any exporter using the queue (both gRPC and HTTP).

Example 1: Simple toggle

exporters:
  otlphttp:
    endpoint: "http://my-backend:4318"
    sending_queue:
      enabled: true
      queue_size: 1000
      num_consumers: adaptive # New "adaptive" keyword

Example 2: Detailed configuration block (preferred)

This would allow users to set boundaries and tune the algorithm if needed, while num_consumers would be the static alternative. This single config structure would work for both otlphttp and otlpgrpc.

exporters:
  otlphttp:
    endpoint: "http://my-backend:4318"
    sending_queue:
      enabled: true
      queue_size: 1000
      # num_consumers: 10 # This would be ignored if adaptive_concurrency is enabled
      adaptive_concurrency:
        enabled: true
        min_concurrency: 1      # Optional: The floor for concurrency
        max_concurrency: 100    # Optional: The ceiling for concurrency
        # Optional: Algorithm tuning parameters with sane defaults
        # decrease_ratio: 0.9       # Factor to multiply by on "decrease" signal
        # rtt_threshold_ratio: 1.1  # e.g., trigger decrease if RTT > 110% of baseline (HTTP only)

Describe alternatives you've considered

The alternative is the current state: manual, static tuning of num_consumers. This is inefficient, error-prone, and adds significant operational overhead, as described in the problem statement.

Additional context

This proposal is heavily inspired by Vector's "Adaptive Request Concurrency" (ARC) feature, which solves this exact problem for its HTTP-based sinks. Vector's implementation (itself inspired by work done at Netflix) has proven to be extremely effective at improving reliability and performance.

By adopting a similar pattern, the OTel Collector would become a "better infrastructure citizen" out-of-the-box, reducing the tuning burden on users and making OTel-based pipelines more resilient to downstream slowdowns or failures.

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions