-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[2/N]Support DeepSeek-R1 w4a8 low latency deepep #8464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2/N]Support DeepSeek-R1 w4a8 low latency deepep #8464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ayrnb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the Mixture of Experts (MoE) layer's capabilities by integrating a new low-latency mode for DeepEP, specifically optimized for w4a8 quantized models like DeepSeek-R1. The changes involve introducing new CUDA and Triton kernels to handle specialized data layouts and computations, alongside updates to the MoE forward pass and token dispatching logic to support this new operational mode. The goal is to improve performance and efficiency for these specific model configurations.
Highlights
- DeepEP Low Latency Mode Support: Introduced specialized support for DeepEP (Expert Parallelism) low latency mode, specifically for DeepSeek-R1 w4a8 models. This enables more efficient processing for certain Mixture of Experts (MoE) configurations.
- New W4A8 Cutlass MoE Functionality: Added a new
forward_cutlass_w4a8_maskedmethod within the MoE layer, which leverages thecutlass_w4a8_moefunction with specific parameters tailored for the DeepEP low latency mode. This includes new data preprocessing and post-processing logic. - CUDA Kernel Enhancements for 3D Inputs: Modified underlying CUDA kernels to support 3D input tensors for activation data (
a_tensors) in the w4a8 grouped GEMM operations. This is crucial for handling the different data layouts required by the new low-latency mode. - Dynamic FP8 Quantization Control: Updated the token dispatcher to conditionally enable or disable FP8 quantization based on the
SGLANG_USE_W4A8environment variable, providing more flexibility for quantization schemes. - New Triton Kernels for Problem Size Calculation: Implemented new Triton kernels (
compute_problem_sizes_w4a8_kernel,compute_expert_offsets_w4a8_kernel) and a Python wrapper (deepep_ll_get_cutlass_w4a8_moe_mm_data) to dynamically calculate problem sizes and expert offsets for the DeepEP low latency mode, optimizing memory access and computation.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for DeepEP low latency mode for w4a8 models. The changes involve adding new Triton kernels, modifying existing MoE layers to handle a new execution path (deepep_ll), and updating C++ kernels to support 3D activation tensors. The review identified several critical issues, including incorrect parallel prefix sum logic in a Triton kernel, incorrect tensor allocations and logic in the main MoE function, and a call to a function with a missing argument that would lead to a crash. There are also some medium-severity issues like commented-out code and disabled checks that should be addressed.
ca7f89b to
1b10d19
Compare
| k, | ||
| BLOCK_SIZE=512, | ||
| ) | ||
| if ep_mode == "ep": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using MoeA2ABackend and ep_size to check ep_mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how to distinguish between deepep normal and low latency?🤔🤔🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using MoeA2ABackend and ep_size to check ep_mode?
sorry, I didn't notice that --enable-deepep-moe is deprecated. 🥵🥵🥵🥵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using MoeA2ABackend and ep_size to check ep_mode?
I updated the code and used dispatch_output.format to check ep_mode.
9cb7f38 to
ab92496
Compare
|
Thanks for the commit! May I ask if the implementation is complete? I noticed you implemented a compute_expert_offsets_w4a8 function, but I don't seem to find where it's called. I tried calling it after deepep_ll_get_cutlass_w4a8_moe_mm_data(), but the output appears garbled. |
The |
|
Hello~ I tried to test this PR(https://github.com/bytedance-iaas/sglang/tree/feat/w4a8_support_ll_deepep) with low_latency DeepEP, but I ran into the following error:
It seems this PR has some compatibility issues. Could you please rebase it with the latest sglang main branch? Thanks a lot! |
|
@qhsc Hi, I ran into the same issue, and finally resolved it with the following modification. |
|
fzyzcjy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some nits
|
Hi @ayrnb may you merge the latest main and fix the conflicts? Thanks! |
clean code clean code clean code fix fix
rebase main update clean code clean code
code clean
bugfix lint
3900aff to
fc33a13
Compare
Done! |
ch-wan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me in general. I left some comments. Also, we are gradually adopting the new MoE framework and reimplementing existing codes. The roadmap is here: #8715. Most codes in this PR should be reorganized in one future PR.




Motivation
Follow #8247 #7762. Based on #8311.
Support deepep low latency mode for DeepSeek-R1 w4a8 model
Modifications
Add forward_cutlass_w4a8_masked for deepep low latency mode, and cudagraph can be enabled.
Usage:
SGLANG_DEEPEP_BF16_DISPATCH=1 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-W4AFP8 --tp 8 --trust-remote-code --host 0.0.0.0 --port 8000 --context-length 4096 --moe-dense-tp-size 1 --mem-fraction-static 0.70 --cuda-graph-max-bs 64 --cuda-graph-bs 1 2 4 8 16 20 32 64 --max-running-requests 128 --disable-radix-cache --moe-a2a-backend deepep --deepep-mode low_latency --watchdog-timeout 1000000 --moe-runner-backend cutlass >& log_low_latency.log 2>&1Accuracy Test
Checklist