-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform #30212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention backend selection mechanism by introducing an AttentionSelectorConfig NamedTuple. This new configuration object encapsulates various attention parameters such as head size, data type, KV cache data type, block size, MLA usage, sink token presence, sparse attention usage, and attention type. The get_attn_backend and _cached_get_attn_backend functions in vllm/attention/selector.py are updated to create and pass this single config object instead of multiple individual arguments. This change propagates throughout the platform-specific attention backend selection logic in vllm/platforms/cpu.py, vllm/platforms/cuda.py, and vllm/platforms/interface.py, where relevant methods like get_attn_backend_cls and get_valid_backends are modified to accept and utilize the AttentionSelectorConfig object, simplifying their signatures and improving parameter management. Additionally, logging for attention configurations is updated to use the __repr__ method of the new config object.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention backend selection by introducing AttentionSelectorConfig to encapsulate the configuration parameters. This is a great improvement as it simplifies the function signatures in platform-specific modules and makes the interface more stable for out-of-tree platforms. The changes are applied consistently across all relevant files. I've found one minor issue with an incomplete __repr__ implementation which could affect debugging.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Isotr0py <[email protected]>
|
Hi @Isotr0py, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Purpose
get_attn_backend_cls, while not all platforms use all of them. And it will also easily break OOT platform when introduce attention feature with new argument likeuse_mlaanduse_sink.AttentionSelectorConfig, so that platform can use these arguments on demand and no longer need to update interface for each feature update.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.