-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Performance] Fuse DeepSeek shared experts and gate operations #28540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Performance] Fuse DeepSeek shared experts and gate operations #28540
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant performance optimizations for DeepSeek models by fusing shared experts and gate operations. The changes are well-structured, introducing new environment variables to control the features and adding corresponding benchmarks and tests. However, I've identified a critical logic error in the dispatch logic of the new CUDA kernel csrc/moe/moe_fused_gate.cu that makes some of the templated kernel specializations unreachable. This needs to be fixed to ensure correctness for all supported configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| assert routed_scaling_factor is not None, \ | ||
| "With num_fused_shared_experts>0" | ||
| ", routed_scaling_factor need to be provided" | ||
| topk_ids[:, -1] = torch.randint(low=num_experts, | ||
| high=num_experts + | ||
| num_fused_shared_experts, | ||
| size=(topk_ids.size(0), ), | ||
| dtype=topk_ids.dtype, | ||
| device=topk_ids.device) | ||
| topk_weights[:, -1] = topk_weights[:, :-1].sum(dim=-1) / routed_scaling_factor | ||
|
|
||
| if renormalize: | ||
| topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True) | ||
| if num_fused_shared_experts == 0: | ||
| topk_weights_sum = topk_weights.sum(dim=-1, keepdim=True) | ||
| else: | ||
| topk_weights_sum = topk_weights[:, :-1].sum(dim=-1, keepdim=True) | ||
| topk_weights = topk_weights / topk_weights_sum | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renormalization ignores fused shared expert weight
When shared experts are fused the code overwrites the last top‑k slot with topk_weights[:, :-1].sum()/routed_scaling_factor and then divides the entire vector by the sum of only the non‑shared weights. This makes the non‑shared weights sum to 1 but leaves the shared weight at 1/routed_scaling_factor, so the probabilities add up to 1 + num_fused_shared_experts/routed_scaling_factor and are no longer scaled by routed_scaling_factor as the non‑fused path does. Because DeepseekV2MoE disables external scaling whenever this feature is enabled, tokens routed through fused shared experts will be over‑weighted, producing larger activations than the unfused path. Consider including the shared weight in the normalization or renormalizing after inserting the shared expert so the weights still form a properly scaled distribution.
Useful? React with 👍 / 👎.
5aa1381 to
b149d9c
Compare
|
cc @bnellnm for cc @alexm-redhat for RFC |
b149d9c to
0c89e7f
Compare
cf57507 to
02fef8e
Compare
| expert_load_view=self.expert_load_view, | ||
| logical_to_physical_map=self.logical_to_physical_map, | ||
| logical_replica_count=self.logical_replica_count, | ||
| enable_fused_moe_router=self.enable_fused_moe_router, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since enable_fused_moe_router is a FusedMoE layer attribute and apply gets the layer as a parameter, can we avoid passing it around to all the apply methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the way I pass the enable_fused_moe_router parameter - it's now an env variable only in the select_experts function. This has made the code more readable.
|
This pull request has merge conflicts that must be resolved before it can be |
|
Also CC @tlrmchlsmth |
02fef8e to
f9d6dc5
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
alexm-redhat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have verified its correctness and did some initial performance verifications - all looks good. Left some comments. The PR needs rebase and it should be ready to be merged.
| @@ -0,0 +1,484 @@ | |||
| #include <ATen/cuda/CUDAContext.h> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a comment that it is taken from sglang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added
| get_compressed_expert_map(self.expert_map), | ||
| ) | ||
| if self.num_fused_shared_experts > 0: | ||
| logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work with EP? If not, then maybe we should assert here some conditions.
| assert topk_group is not None | ||
| assert num_expert_group is not None | ||
| if rocm_aiter_ops.is_fused_moe_enabled(): | ||
| if hidden_states.shape[0] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this shape[0] == 0 case is necessary? Would be good to have some quick doc explaining this piece of code.
| and e_score_correction_bias is not None | ||
| and is_power_of_two(e_score_correction_bias.shape[0]) | ||
| ): | ||
| # The fused kernel can only work with 128/256 experts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a check for E=128 or E=256 in the if statement as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this comment was a bit misleading, so I've deleted it. The kernel works with routed_experts, which are powers of 2. So, if the shape of e_score_correction_bias is a power of two, I suppose it will also be true for the experts.
| apply_routed_scaling_factor_on_output=False, | ||
| ) | ||
| else: | ||
| topk_weights, topk_ids = grouped_topk_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case the moe_fused_gate is faster than the usual grouped_topk_impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, it's not faster on its own anymore, but it provides a boost when used together with fused_experts as the default grouped_topk isn't adapted to it
| return states | ||
|
|
||
| if self.shared_experts is not None: | ||
| if self.shared_experts is not None and self.num_fused_shared_experts == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it possible to have self.shared_experts != None and self.num_fused_shared_experts > 0? It seems like if self.num_fused_shared_experts > 0 then elf.shared_experts must be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, fix it
| # aiter applies routed_scaling_factor internally | ||
| routed_scaling_factor=1.0 | ||
| if not self.is_rocm_aiter_moe_enabled | ||
| if not used_inside_scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some documentation about how "used_inside_scaling" changes the functionality of the scaling application
Signed-off-by: Barbara Suslova <[email protected]> lint Signed-off-by: Barbara Suslova <[email protected]> change the logic of passing variables Signed-off-by: Barbara Suslova <[email protected]> aligning Signed-off-by: Barbara Suslova <[email protected]>
Signed-off-by: Barbara Suslova <[email protected]>
de72969 to
b7a869d
Compare
Signed-off-by: Barbara Suslova <[email protected]>
Purpose
This PR contains optimizations from two other PRs that we aligned with current main:
This effort is driven by Nebius AI, as part of ongoing optimization work.
Command to run the optimization version:
Results
Our measurements show that these optimizations improve performance. Specifically, they give a significant boost to the TTFT metric. Results on the ITL metric are generally positive, with the observed slight performance drawdowns in rare configurations.
Fused shared expert (PR#15502)
This has already been implemented for AMD in by PR#24097. Our PR makes it available to CUDA as well.
Description
Currently, we use an env variable (VLLM_USE_CUDA_FUSION_SHARED_EXPERTS) to enable or disable the feature.
When this feature is enabled, we internally set
for FusedMoE module. During the weight loading stage, we clone the shared expert weights into the experts (using the loading code from PR#24097). Shared expert id and expert weights are manually assigned to ensure that experts are balanced and accuracy is guaranteed.
Fused operation for gating (PR#21107)
This fuses the DeepSeek-style grouped top-k experts selection process (3 top-k ops and bells and whistles) into one custom kernel.
The kernel is imported from sglang sgl-project/sglang#4530
Currently we use an env variable (VLLM_USE_FUSED_MOE_ROUTER) to enable or disable the feature.
Notes
When these optimizations are enabled we can’t use the fused grouped_topk kernel for MoE (#23274) which is now the default.
Tests
All tests have been performed on DeepSeek V3 on the h200.
gsm8k results
Original version:
Optimization version:
vLLM benchmark results
We tests releases/v0.11.1 original vs releases/v0.11.1 + our optimization

Our benchmark results
Our benchmark results you can find in the doc.