Evaluate and benchmark vLLM KV cache offloading options

vLLM supports KV cache offloading from GPU VRAM to CPU memory (and future: fast storage) to avoid recomputation of prefixes. We should benchmark these features and incorporate them into capacity planning.

## Acceptance Criteria
- Test vLLM KV cache offloading to CPU memory
- Measure impact on TTFT, TPOT, and throughput
- Evaluate cost/performance tradeoffs (less GPU memory vs potential latency increase)
- Update benchmark data to include KV cache configuration dimensions
- Document when to recommend KV cache offloading
- Add configuration options to Deployment Engine

## Notes
- KV cache offloading can reduce GPU memory requirements for certain workloads
- May enable running larger models on smaller GPUs
- Benchmarks should capture latency impact of offloading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluate and benchmark vLLM KV cache offloading options #15

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate and benchmark vLLM KV cache offloading options #15

Description

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions