Brief tail instability observed during upstream turbulence. Short p95/p99 expansion with transient 5xx bursts and queue-length amplification via retries, likely self-exciting during provider error spikes.
Observed via public status at 2025-11-01 18:53 IST (13:23 UTC, 14:23 CET/Berlin). This is when I detected the problem.
Minimal rollback guardrail: patch Helm/gateway with jittered exponential backoff, tighter upstream timeouts on 5xx/429, and lower max in-flight per tenant. Add a rollback trigger when a short moving-window breach occurs on p95 or error-rate.
Goal is to contain queue expansion under provider wobble, suppress tail amplification, and preserve autoscaler stability under intermittent upstream failures.
For the tiny patch diff or a single before/after chart, reach me at [email protected]. cc @sk-portkey @VisargD. If not actionable, feel free to close; sharing in case it helps on-call.