[2/N]Support DeepSeek-R1 w4a8 low latency deepep #8464

ayrnb · 2025-07-28T13:52:48Z

Motivation

Follow #8247 #7762. Based on #8311.
Support deepep low latency mode for DeepSeek-R1 w4a8 model

Modifications

Add forward_cutlass_w4a8_masked for deepep low latency mode, and cudagraph can be enabled.

Usage:

SGLANG_DEEPEP_BF16_DISPATCH=1 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-W4AFP8 --tp 8 --trust-remote-code --host 0.0.0.0 --port 8000 --context-length 4096 --moe-dense-tp-size 1 --mem-fraction-static 0.70 --cuda-graph-max-bs 64 --cuda-graph-bs 1 2 4 8 16 20 32 64 --max-running-requests 128 --disable-radix-cache --moe-a2a-backend deepep --deepep-mode low_latency --watchdog-timeout 1000000 --moe-runner-backend cutlass >& log_low_latency.log 2>&1

Accuracy Test

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @ayrnb, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Mixture of Experts (MoE) layer's capabilities by integrating a new low-latency mode for DeepEP, specifically optimized for w4a8 quantized models like DeepSeek-R1. The changes involve introducing new CUDA and Triton kernels to handle specialized data layouts and computations, alongside updates to the MoE forward pass and token dispatching logic to support this new operational mode. The goal is to improve performance and efficiency for these specific model configurations.

Highlights

DeepEP Low Latency Mode Support: Introduced specialized support for DeepEP (Expert Parallelism) low latency mode, specifically for DeepSeek-R1 w4a8 models. This enables more efficient processing for certain Mixture of Experts (MoE) configurations.
New W4A8 Cutlass MoE Functionality: Added a new forward_cutlass_w4a8_masked method within the MoE layer, which leverages the cutlass_w4a8_moe function with specific parameters tailored for the DeepEP low latency mode. This includes new data preprocessing and post-processing logic.
CUDA Kernel Enhancements for 3D Inputs: Modified underlying CUDA kernels to support 3D input tensors for activation data (a_tensors) in the w4a8 grouped GEMM operations. This is crucial for handling the different data layouts required by the new low-latency mode.
Dynamic FP8 Quantization Control: Updated the token dispatcher to conditionally enable or disable FP8 quantization based on the SGLANG_USE_W4A8 environment variable, providing more flexibility for quantization schemes.
New Triton Kernels for Problem Size Calculation: Implemented new Triton kernels (compute_problem_sizes_w4a8_kernel, compute_expert_offsets_w4a8_kernel) and a Python wrapper (deepep_ll_get_cutlass_w4a8_moe_mm_data) to dynamically calculate problem sizes and expert offsets for the DeepEP low latency mode, optimizing memory access and computation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for DeepEP low latency mode for w4a8 models. The changes involve adding new Triton kernels, modifying existing MoE layers to handle a new execution path (deepep_ll), and updating C++ kernels to support 3D activation tensors. The review identified several critical issues, including incorrect parallel prefix sum logic in a Triton kernel, incorrect tensor allocations and logic in the main MoE function, and a call to a function with a missing argument that would lead to a crash. There are also some medium-severity issues like commented-out code and disabled checks that should be addressed.

python/sglang/srt/layers/moe/cutlass_w4a8_moe.py

python/sglang/srt/layers/moe/ep_moe/kernels.py

python/sglang/srt/layers/moe/ep_moe/layer.py

python/sglang/srt/layers/moe/cutlass_w4a8_moe.py

python/sglang/srt/layers/moe/ep_moe/kernels.py

ch-wan · 2025-08-01T08:17:05Z

python/sglang/srt/layers/moe/cutlass_w4a8_moe.py

-        k,
-        BLOCK_SIZE=512,
-    )
+    if ep_mode == "ep":


How about using MoeA2ABackend and ep_size to check ep_mode?

But how to distinguish between deepep normal and low latency？🤔🤔🤔

How about using MoeA2ABackend and ep_size to check ep_mode?

sorry, I didn't notice that --enable-deepep-moe is deprecated. 🥵🥵🥵🥵

How about using MoeA2ABackend and ep_size to check ep_mode?

I updated the code and used dispatch_output.format to check ep_mode.

bird2426 · 2025-08-16T06:42:45Z

Thanks for the commit! May I ask if the implementation is complete? I noticed you implemented a compute_expert_offsets_w4a8 function, but I don't seem to find where it's called. I tried calling it after deepep_ll_get_cutlass_w4a8_moe_mm_data(), but the output appears garbled.

ayrnb · 2025-08-16T14:46:05Z

Thanks for the commit! May I ask if the implementation is complete? I noticed you implemented a compute_expert_offsets_w4a8 function, but I don't seem to find where it's called. I tried calling it after deepep_ll_get_cutlass_w4a8_moe_mm_data(), but the output appears garbled.

The compute_expert_offsets_w4a8 function was part of an earlier experimental implementation, but it is not actually used in the current version. I will remove it.

qhsc · 2025-08-19T03:10:01Z

Hello~

I tried to test this PR(https://github.com/bytedance-iaas/sglang/tree/feat/w4a8_support_ll_deepep) with low_latency DeepEP, but I ran into the following error:

AttributeError: 'DeepEPMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'?

It seems this PR has some compatibility issues. Could you please rebase it with the latest sglang main branch? Thanks a lot!

pandengyao · 2025-08-20T02:46:19Z

@qhsc Hi, I ran into the same issue, and finally resolved it with the following modification.

ayrnb · 2025-08-20T03:10:11Z

Hello~

I tried to test this PR(https://github.com/bytedance-iaas/sglang/tree/feat/w4a8_support_ll_deepep) with low_latency DeepEP, but I ran into the following error:

AttributeError: 'DeepEPMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'?

It seems this PR has some compatibility issues. Could you please rebase it with the latest sglang main branch? Thanks a lot!
ok, I am rebasing it now

ayrnb · 2025-08-20T03:19:14Z

@qhsc Hi, I ran into the same issue, and finally resolved it with the following modification.

This pr depends on deepep normal #8247, and i will rebase them with the latest branch.

fzyzcjy

LGTM, only some nits

python/sglang/srt/layers/moe/token_dispatcher/deepep.py

python/sglang/srt/layers/moe/cutlass_w4a8_moe.py

sgl-kernel/csrc/moe/cutlass_moe/w4a8/w4a8_get_group_starts.cuh

sgl-kernel/csrc/moe/cutlass_moe/w4a8/w4a8_grouped_mm_c3x.cuh

zhyncs · 2025-10-16T20:44:12Z

Hi @ayrnb may you merge the latest main and fix the conflicts? Thanks!

clean code clean code clean code fix fix

rebase main update clean code clean code

code clean

bugfix lint

bugfix

ayrnb · 2025-10-17T08:21:47Z

Hi @ayrnb may you merge the latest main and fix the conflicts? Thanks!

Done!

ch-wan

It looks good to me in general. I left some comments. Also, we are gradually adopting the new MoE framework and reimplementing existing codes. The roadmap is here: #8715. Most codes in this PR should be reorganized in one future PR.

python/sglang/srt/layers/moe/token_dispatcher/deepep.py

python/sglang/srt/layers/quantization/w4afp8.py

python/sglang/srt/layers/moe/ep_moe/kernels.py

python/sglang/srt/layers/quantization/w4afp8.py

ayrnb requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, Ying1123, ch-wan, ispobock, merrymercy, yizhang2077 and zhyncs as code owners July 28, 2025 13:52

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

ayrnb changed the title ~~Feat/w4a8 support ll deepep~~ Support DeepSeek-R1 w4a8 low latency deepep Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

ayrnb mentioned this pull request Jul 28, 2025

Support DeepSeek-R1 w4a8 low latency deepep #8311

Closed

6 tasks

ayrnb marked this pull request as draft July 29, 2025 01:30

ayrnb force-pushed the feat/w4a8_support_ll_deepep branch from ca7f89b to 1b10d19 Compare July 29, 2025 06:52

ayrnb marked this pull request as ready for review July 29, 2025 11:22

ayrnb requested a review from kushanam as a code owner July 29, 2025 11:22

ayrnb changed the title ~~Support DeepSeek-R1 w4a8 low latency deepep~~ [2/N]Support DeepSeek-R1 w4a8 low latency deepep Jul 29, 2025

ch-wan reviewed Aug 1, 2025

View reviewed changes

ayrnb force-pushed the feat/w4a8_support_ll_deepep branch from 9cb7f38 to ab92496 Compare August 4, 2025 02:10

zhyncs self-assigned this Aug 7, 2025

zhyncs added the high priority label Aug 7, 2025

fzyzcjy reviewed Aug 20, 2025

View reviewed changes

zhyncs assigned ch-wan Sep 15, 2025

ayrnb and others added 9 commits October 17, 2025 11:45

support w4a8 low latency deepep

3bdaed9

clean code clean code clean code fix fix

support cudagraph

95540df

rebase main update clean code clean code

rebase main

299a435

code clean

code clean

dc4b5c6

rebase

098425b

moe refactor

16bcd38

bugfix lint

rebase

370b28a

bugfix

rebase

bfe8ae0

silu kernel update

fc33a13

ayrnb force-pushed the feat/w4a8_support_ll_deepep branch from 3900aff to fc33a13 Compare October 17, 2025 06:10

ayrnb added 2 commits October 17, 2025 14:20

rebase

bc25a61

fix

c2fbd96

Merge branch 'main' into feat/w4a8_support_ll_deepep

83f387a

ch-wan reviewed Oct 21, 2025

View reviewed changes

python/sglang/srt/layers/moe/token_dispatcher/deepep.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/quantization/w4afp8.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/moe/ep_moe/kernels.py Outdated Show resolved Hide resolved

ch-wan added the run-ci label Oct 21, 2025

HanHan009527 and others added 8 commits October 22, 2025 20:04

Merge branch 'main' into feat/w4a8_support_ll_deepep

40a7ad5

code clean

fa6c6d7

clean code

6eb36bb

test

009e283

lint

eb1557e

Merge branch 'main' into feat/w4a8_support_ll_deepep

f14a9c5

fix w4afp8 kernel assert error

f858c31

Merge branch 'main' into feat/w4a8_support_ll_deepep

57806cb

ch-wan reviewed Oct 23, 2025

View reviewed changes

python/sglang/srt/layers/quantization/w4afp8.py Show resolved Hide resolved

add bf16 assert

47945cc

ch-wan approved these changes Oct 24, 2025

View reviewed changes

ch-wan merged commit 13bf565 into sgl-project:main Oct 25, 2025
126 of 131 checks passed

[2/N]Support DeepSeek-R1 w4a8 low latency deepep #8464

[2/N]Support DeepSeek-R1 w4a8 low latency deepep #8464

Uh oh!

Conversation

ayrnb commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage:

Accuracy Test

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ch-wan Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

ayrnb Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

ayrnb Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

ayrnb Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

bird2426 commented Aug 16, 2025

Uh oh!

ayrnb commented Aug 16, 2025

Uh oh!

qhsc commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pandengyao commented Aug 20, 2025

Uh oh!

ayrnb commented Aug 20, 2025

Uh oh!

ayrnb commented Aug 20, 2025

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Oct 16, 2025

Uh oh!

ayrnb commented Oct 17, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ayrnb commented Jul 28, 2025 •

edited

Loading

qhsc commented Aug 19, 2025 •

edited

Loading