Skip to content

Evaluate and benchmark vLLM KV cache offloading options #15

@anfredette

Description

@anfredette

vLLM supports KV cache offloading from GPU VRAM to CPU memory (and future: fast storage) to avoid recomputation of prefixes. We should benchmark these features and incorporate them into capacity planning.

Acceptance Criteria

  • Test vLLM KV cache offloading to CPU memory
  • Measure impact on TTFT, TPOT, and throughput
  • Evaluate cost/performance tradeoffs (less GPU memory vs potential latency increase)
  • Update benchmark data to include KV cache configuration dimensions
  • Document when to recommend KV cache offloading
  • Add configuration options to Deployment Engine

Notes

  • KV cache offloading can reduce GPU memory requirements for certain workloads
  • May enable running larger models on smaller GPUs
  • Benchmarks should capture latency impact of offloading

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions