-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
vLLM supports KV cache offloading from GPU VRAM to CPU memory (and future: fast storage) to avoid recomputation of prefixes. We should benchmark these features and incorporate them into capacity planning.
Acceptance Criteria
- Test vLLM KV cache offloading to CPU memory
- Measure impact on TTFT, TPOT, and throughput
- Evaluate cost/performance tradeoffs (less GPU memory vs potential latency increase)
- Update benchmark data to include KV cache configuration dimensions
- Document when to recommend KV cache offloading
- Add configuration options to Deployment Engine
Notes
- KV cache offloading can reduce GPU memory requirements for certain workloads
- May enable running larger models on smaller GPUs
- Benchmarks should capture latency impact of offloading
Metadata
Metadata
Assignees
Labels
No labels