Benchmarking all combinations of (model, GPU, traffic profile) may result in an exponentially large matrix. We should investigate whether we can extrapolate missing combinations or build predictive models to reduce the number of explicit benchmarks needed.
Acceptance Criteria
- Research feasibility of extrapolating between traffic profiles (e.g., interpolate TTFT for intermediate prompt lengths)
- Investigate component-level benchmarking to build GPU+LLM performance models
- Prototype an extrapolation or modeling approach
- Validate accuracy against ground-truth benchmarks
- Document approach and limitations
Notes
- This is an exploratory/research task
- Goal: Reduce benchmarking cost while maintaining recommendation accuracy
- May inform Phase 2 architecture