Skip to content

FYI: SEV-lite — retry-induced tail amplification during upstream 5xx burst #1404

@Sasha-main

Description

@Sasha-main

Brief tail instability observed during upstream turbulence. Short p95/p99 expansion with transient 5xx bursts and queue-length amplification via retries, likely self-exciting during provider error spikes.

Observed via public status at 2025-11-01 18:53 IST (13:23 UTC, 14:23 CET/Berlin). This is when I detected the problem.

Minimal rollback guardrail: patch Helm/gateway with jittered exponential backoff, tighter upstream timeouts on 5xx/429, and lower max in-flight per tenant. Add a rollback trigger when a short moving-window breach occurs on p95 or error-rate.

Goal is to contain queue expansion under provider wobble, suppress tail amplification, and preserve autoscaler stability under intermittent upstream failures.

For the tiny patch diff or a single before/after chart, reach me at [email protected]. cc @sk-portkey @VisargD. If not actionable, feel free to close; sharing in case it helps on-call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions