-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
vLLM has many configuration knobs (e.g., max_num_seqs, gpu_memory_utilization, enable_prefix_caching) that impact performance. We should expose relevant parameters for user customization.
Acceptance Criteria
- Identify most impactful vLLM configuration parameters
- Add fields to
DeploymentIntentschema for vLLM config - Update KServe InferenceService template to include vLLM args
- Provide sensible defaults based on use case and traffic profile
- Add UI controls for adjusting vLLM parameters (advanced mode)
- Document each parameter and its impact on performance
Notes
- Start with most critical parameters (e.g., KV cache settings, memory utilization)
- Advanced users may want fine-grained control
- Consider auto-tuning parameters based on traffic profile in future
Metadata
Metadata
Assignees
Labels
No labels