-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Lora MoE Align Improvements #29257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lora MoE Align Improvements #29257
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several significant improvements to the MoE LoRA alignment kernels. The refactoring to unify LoRA and non-LoRA logic by using common __device__ functions is a great step towards reducing code duplication and improving maintainability. The introduction of a global memory variant for moe_lora_align_sum_kernel and the use of a separate CUDA stream for parallel execution are solid performance optimizations. However, I've found a critical issue in csrc/moe/moe_align_sum_kernels.cu where a data type mismatch for token_lora_mapping could lead to incorrect memory access and undefined behavior. This needs to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Most tests in |
e0cce68 to
ddc47bf
Compare
These should be fixed now |
csrc/moe/moe_align_sum_kernels.cu
Outdated
| const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); | ||
|
|
||
| const int32_t num_thread = max((int32_t)num_experts, 128); // WARP_SIZE, | ||
| TORCH_CHECK(num_thread <= 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should verify that the num_experts` is less than or equal to 1024.
We also need to add more number in https://github.com/vllm-project/vllm/blob/main/tests/lora/test_moe_lora_align_sum.py#L35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I just updated this check and added test cases for larger num_experts
|
LGTM, @yewentao256 could you please take a look? |
|
@gnovack All LoRA tests are failing |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Please fix the unit tests, all related I think
No problem! I think I've found the issue which is causing these test failures, so should have a fix out in the next few hours |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: gnovack <[email protected]>
0b5e995 to
8d847ab
Compare
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution adn paitence
Signed-off-by: gnovack <[email protected]> Signed-off-by: mayoohee <[email protected]>
Purpose
This PR includes the following changes to the MoE LoRA Align kernel:
moe_lora_align_sum_kernel(following themoe_align_block_size_kernelimplementation). Previously only the shared memory variant was implemented, which led to worse performance for models with larger num_experts.Uses two cuda streams to executemoe_lora_align_sum_kernelandmoe_align_block_size_kernelin parallelmoe_lora_align_sum_kernelandmoe_align_block_size_kernelto reduce duplicate logic between LoRA and non-LoRA cases (e.g.moe_align_block_size_kernelandmoe_lora_align_block_size_kernelnow call a common_moe_align_block_sizefunction which supports both LoRA and non-LoRA cases).FIX #30026
Test Plan
Test Result
Benchmark results w/ gpt-oss-120b before vs. after this change
Before:
After:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.