Lora MoE Align Improvements #29257

gnovack · 2025-11-23T04:43:51Z

Purpose

This PR includes the following changes to the MoE LoRA Align kernel:

Adds a global memory variant of moe_lora_align_sum_kernel (following the moe_align_block_size_kernel implementation). Previously only the shared memory variant was implemented, which led to worse performance for models with larger num_experts.
~~Uses two cuda streams to execute moe_lora_align_sum_kernel and moe_align_block_size_kernel in parallel~~
Refactors moe_lora_align_sum_kernel and moe_align_block_size_kernel to reduce duplicate logic between LoRA and non-LoRA cases (e.g. moe_align_block_size_kernel and moe_lora_align_block_size_kernel now call a common _moe_align_block_size function which supports both LoRA and non-LoRA cases).

FIX #30026

Test Plan

Ran existing LoRA and moe_align_sum test cases

Test Result

Benchmark results w/ gpt-oss-120b before vs. after this change

Before:

============ Serving Benchmark Result ============
Successful requests:                     400       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  500.09    
Total input tokens:                      641602    
Total generated tokens:                  240183    
Request throughput (req/s):              0.80      
Output token throughput (tok/s):         480.28    
Peak output token throughput (tok/s):    576.00    
Peak concurrent requests:                13.00     
Total Token throughput (tok/s):          1763.27   
---------------Time to First Token----------------
Mean TTFT (ms):                          155.95    
Median TTFT (ms):                        142.50    
P99 TTFT (ms):                           582.98    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.38     
Median TPOT (ms):                        16.42     
P99 TPOT (ms):                           17.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.38     
Median ITL (ms):                         15.15     
P99 ITL (ms):                            121.57    
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     400       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  480.35    
Total input tokens:                      641602    
Total generated tokens:                  240183    
Request throughput (req/s):              0.83      
Output token throughput (tok/s):         500.01    
Peak output token throughput (tok/s):    648.00    
Peak concurrent requests:                13.00     
Total Token throughput (tok/s):          1835.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          166.84    
Median TTFT (ms):                        149.03    
P99 TTFT (ms):                           631.58    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.71     
Median TPOT (ms):                        15.73     
P99 TPOT (ms):                           16.71     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.71     
Median ITL (ms):                         14.38     
P99 ITL (ms):                            129.20    
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-11-23T04:44:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gnovack.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces several significant improvements to the MoE LoRA alignment kernels. The refactoring to unify LoRA and non-LoRA logic by using common __device__ functions is a great step towards reducing code duplication and improving maintainability. The introduction of a global memory variant for moe_lora_align_sum_kernel and the use of a separate CUDA stream for parallel execution are solid performance optimizations. However, I've found a critical issue in csrc/moe/moe_align_sum_kernels.cu where a data type mismatch for token_lora_mapping could lead to incorrect memory access and undefined behavior. This needs to be addressed.

csrc/moe/moe_align_sum_kernels.cu

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

csrc/moe/moe_align_sum_kernels.cu

ApostaC · 2025-11-24T18:01:16Z

cc @tylertitsworth

jeejeelee · 2025-11-29T14:31:31Z

Most tests in test_moe_lora_align_sum.py are now failing.

vllm/lora/layers/fused_moe.py

gnovack · 2025-12-03T18:45:32Z

Most tests in test_moe_lora_align_sum.py are now failing.

These should be fixed now

jeejeelee · 2025-12-04T06:08:44Z

csrc/moe/moe_align_sum_kernels.cu

+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  const int32_t num_thread = max((int32_t)num_experts, 128);  // WARP_SIZE,
+  TORCH_CHECK(num_thread <= 1024,


This check should verify that the num_experts` is less than or equal to 1024.
We also need to add more number in https://github.com/vllm-project/vllm/blob/main/tests/lora/test_moe_lora_align_sum.py#L35

good catch. I just updated this check and added test cases for larger num_experts

jeejeelee · 2025-12-04T23:45:55Z

LGTM, @yewentao256 could you please take a look?

jeejeelee · 2025-12-05T05:54:38Z

@gnovack All LoRA tests are failing

yewentao256

Thanks for the work! Please fix the unit tests, all related I think

gnovack · 2025-12-05T23:43:14Z

Thanks for the work! Please fix the unit tests, all related I think

No problem! I think I've found the issue which is causing these test failures, so should have a fix out in the next few hours

mergify · 2025-12-08T03:42:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gnovack.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: gnovack <[email protected]>

jeejeelee

Thank you for contribution adn paitence

Signed-off-by: gnovack <[email protected]> Signed-off-by: mayoohee <[email protected]>

gnovack requested review from LucasWilkinson, jeejeelee, mgoin, pavanimajety and tlrmchlsmth as code owners November 23, 2025 04:43

mergify bot added the ci/build label Nov 23, 2025

mergify bot added the needs-rebase label Nov 23, 2025

gemini-code-assist bot reviewed Nov 23, 2025

View reviewed changes

csrc/moe/moe_align_sum_kernels.cu Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 23, 2025

View reviewed changes

csrc/moe/moe_align_sum_kernels.cu Outdated Show resolved Hide resolved

jeejeelee reviewed Dec 2, 2025

View reviewed changes

vllm/lora/layers/fused_moe.py Outdated Show resolved Hide resolved

gnovack force-pushed the lora-align-refactor branch from e0cce68 to ddc47bf Compare December 3, 2025 18:38

mergify bot removed the needs-rebase label Dec 3, 2025

jeejeelee mentioned this pull request Dec 4, 2025

[Bug]: Failed to deploy Qwen3-Next-80B with LoRA Adpater on H100 #30026

Closed

1 task

jeejeelee reviewed Dec 4, 2025

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025

yewentao256 reviewed Dec 5, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 8, 2025

gnovack requested review from NickLucche, gshtras, noooop, patrickvonplaten and tjtanaa as code owners December 8, 2025 22:54

mergify bot added new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models nvidia labels Dec 8, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Dec 8, 2025

mergify bot added the rocm Related to AMD ROCm label Dec 8, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 8, 2025

mergify bot added the structured-output label Dec 8, 2025

github-project-automation bot added this to Structured Output and NVIDIA Dec 8, 2025

mergify bot added speculative-decoding v1 tpu Related to Google TPUs tool-calling labels Dec 8, 2025

github-project-automation bot added this to Tool Calling Dec 8, 2025

mergify bot added the kv-connector label Dec 8, 2025

add large num_experts variant of moe_lora_align kernel

8d847ab

Signed-off-by: gnovack <[email protected]>

gnovack force-pushed the lora-align-refactor branch from 0b5e995 to 8d847ab Compare December 8, 2025 23:03

mergify bot removed tpu Related to Google TPUs needs-rebase labels Dec 8, 2025

jeejeelee approved these changes Dec 9, 2025

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Dec 9, 2025

github-project-automation bot moved this to In review in NVIDIA Dec 9, 2025

jeejeelee merged commit ea657f2 into vllm-project:main Dec 9, 2025
95 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Dec 9, 2025

github-project-automation bot moved this to Done in Structured Output Dec 9, 2025

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Dec 9, 2025

github-project-automation bot moved this to Done in Tool Calling Dec 9, 2025

mayoohee pushed a commit to mayoohee/vllm that referenced this pull request Dec 9, 2025

Lora MoE Align Improvements (vllm-project#29257)

8a8c38a

Signed-off-by: gnovack <[email protected]> Signed-off-by: mayoohee <[email protected]>

Uh oh!

Lora MoE Align Improvements #29257

Lora MoE Align Improvements #29257

Conversation

gnovack commented Nov 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark results w/ gpt-oss-120b before vs. after this change

Uh oh!

mergify bot commented Nov 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ApostaC commented Nov 24, 2025

Uh oh!

jeejeelee commented Nov 29, 2025

Uh oh!

Uh oh!

gnovack commented Dec 3, 2025

Uh oh!

jeejeelee Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gnovack Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Dec 4, 2025

Uh oh!

jeejeelee commented Dec 5, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

gnovack commented Dec 5, 2025

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gnovack commented Nov 23, 2025 •

edited by github-actions bot

Loading