Skip to content

Conversation

@ayrnb
Copy link
Collaborator

@ayrnb ayrnb commented Jul 28, 2025

Motivation

Follow #8247 #7762. Based on #8311.
Support deepep low latency mode for DeepSeek-R1 w4a8 model

Modifications

Add forward_cutlass_w4a8_masked for deepep low latency mode, and cudagraph can be enabled.

Usage:

SGLANG_DEEPEP_BF16_DISPATCH=1 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-W4AFP8 --tp 8 --trust-remote-code --host 0.0.0.0 --port 8000 --context-length 4096 --moe-dense-tp-size 1 --mem-fraction-static 0.70 --cuda-graph-max-bs 64 --cuda-graph-bs 1 2 4 8 16 20 32 64 --max-running-requests 128 --disable-radix-cache --moe-a2a-backend deepep --deepep-mode low_latency --watchdog-timeout 1000000 --moe-runner-backend cutlass >& log_low_latency.log 2>&1

Accuracy Test

image

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ayrnb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Mixture of Experts (MoE) layer's capabilities by integrating a new low-latency mode for DeepEP, specifically optimized for w4a8 quantized models like DeepSeek-R1. The changes involve introducing new CUDA and Triton kernels to handle specialized data layouts and computations, alongside updates to the MoE forward pass and token dispatching logic to support this new operational mode. The goal is to improve performance and efficiency for these specific model configurations.

Highlights

  • DeepEP Low Latency Mode Support: Introduced specialized support for DeepEP (Expert Parallelism) low latency mode, specifically for DeepSeek-R1 w4a8 models. This enables more efficient processing for certain Mixture of Experts (MoE) configurations.
  • New W4A8 Cutlass MoE Functionality: Added a new forward_cutlass_w4a8_masked method within the MoE layer, which leverages the cutlass_w4a8_moe function with specific parameters tailored for the DeepEP low latency mode. This includes new data preprocessing and post-processing logic.
  • CUDA Kernel Enhancements for 3D Inputs: Modified underlying CUDA kernels to support 3D input tensors for activation data (a_tensors) in the w4a8 grouped GEMM operations. This is crucial for handling the different data layouts required by the new low-latency mode.
  • Dynamic FP8 Quantization Control: Updated the token dispatcher to conditionally enable or disable FP8 quantization based on the SGLANG_USE_W4A8 environment variable, providing more flexibility for quantization schemes.
  • New Triton Kernels for Problem Size Calculation: Implemented new Triton kernels (compute_problem_sizes_w4a8_kernel, compute_expert_offsets_w4a8_kernel) and a Python wrapper (deepep_ll_get_cutlass_w4a8_moe_mm_data) to dynamically calculate problem sizes and expert offsets for the DeepEP low latency mode, optimizing memory access and computation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ayrnb ayrnb changed the title Feat/w4a8 support ll deepep Support DeepSeek-R1 w4a8 low latency deepep Jul 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for DeepEP low latency mode for w4a8 models. The changes involve adding new Triton kernels, modifying existing MoE layers to handle a new execution path (deepep_ll), and updating C++ kernels to support 3D activation tensors. The review identified several critical issues, including incorrect parallel prefix sum logic in a Triton kernel, incorrect tensor allocations and logic in the main MoE function, and a call to a function with a missing argument that would lead to a crash. There are also some medium-severity issues like commented-out code and disabled checks that should be addressed.

@ayrnb ayrnb marked this pull request as draft July 29, 2025 01:30
@ayrnb ayrnb force-pushed the feat/w4a8_support_ll_deepep branch from ca7f89b to 1b10d19 Compare July 29, 2025 06:52
@ayrnb ayrnb marked this pull request as ready for review July 29, 2025 11:22
@ayrnb ayrnb requested a review from kushanam as a code owner July 29, 2025 11:22
@ayrnb ayrnb changed the title Support DeepSeek-R1 w4a8 low latency deepep [2/N]Support DeepSeek-R1 w4a8 low latency deepep Jul 29, 2025
k,
BLOCK_SIZE=512,
)
if ep_mode == "ep":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using MoeA2ABackend and ep_size to check ep_mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how to distinguish between deepep normal and low latency?🤔🤔🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using MoeA2ABackend and ep_size to check ep_mode?

sorry, I didn't notice that --enable-deepep-moe is deprecated. 🥵🥵🥵🥵

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using MoeA2ABackend and ep_size to check ep_mode?

I updated the code and used dispatch_output.format to check ep_mode.

@ayrnb ayrnb force-pushed the feat/w4a8_support_ll_deepep branch from 9cb7f38 to ab92496 Compare August 4, 2025 02:10
@zhyncs zhyncs self-assigned this Aug 7, 2025
@bird2426
Copy link

Thanks for the commit! May I ask if the implementation is complete? I noticed you implemented a compute_expert_offsets_w4a8 function, but I don't seem to find where it's called. I tried calling it after deepep_ll_get_cutlass_w4a8_moe_mm_data(), but the output appears garbled.

@ayrnb
Copy link
Collaborator Author

ayrnb commented Aug 16, 2025

Thanks for the commit! May I ask if the implementation is complete? I noticed you implemented a compute_expert_offsets_w4a8 function, but I don't seem to find where it's called. I tried calling it after deepep_ll_get_cutlass_w4a8_moe_mm_data(), but the output appears garbled.

The compute_expert_offsets_w4a8 function was part of an earlier experimental implementation, but it is not actually used in the current version. I will remove it.

@qhsc
Copy link
Contributor

qhsc commented Aug 19, 2025

Hello~

I tried to test this PR(https://github.com/bytedance-iaas/sglang/tree/feat/w4a8_support_ll_deepep) with low_latency DeepEP, but I ran into the following error:

AttributeError: 'DeepEPMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'?

It seems this PR has some compatibility issues. Could you please rebase it with the latest sglang main branch? Thanks a lot!

@pandengyao
Copy link

@qhsc Hi, I ran into the same issue, and finally resolved it with the following modification.
Snipaste_2025-08-20_10-45-24
Snipaste_2025-08-20_10-45-53

@ayrnb
Copy link
Collaborator Author

ayrnb commented Aug 20, 2025

Hello~

I tried to test this PR(https://github.com/bytedance-iaas/sglang/tree/feat/w4a8_support_ll_deepep) with low_latency DeepEP, but I ran into the following error:

AttributeError: 'DeepEPMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'?

It seems this PR has some compatibility issues. Could you please rebase it with the latest sglang main branch? Thanks a lot!
ok, I am rebasing it now

@ayrnb
Copy link
Collaborator Author

ayrnb commented Aug 20, 2025

@qhsc Hi, I ran into the same issue, and finally resolved it with the following modification. Snipaste_2025-08-20_10-45-24 Snipaste_2025-08-20_10-45-53

This pr depends on deepep normal #8247, and i will rebase them with the latest branch.

Copy link
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some nits

@zhyncs
Copy link
Member

zhyncs commented Oct 16, 2025

Hi @ayrnb may you merge the latest main and fix the conflicts? Thanks!

ayrnb and others added 9 commits October 17, 2025 11:45
clean code

clean code

clean code

fix

fix
rebase main

update

clean code

clean code
code clean
bugfix

lint
bugfix
@ayrnb ayrnb force-pushed the feat/w4a8_support_ll_deepep branch from 3900aff to fc33a13 Compare October 17, 2025 06:10
@ayrnb
Copy link
Collaborator Author

ayrnb commented Oct 17, 2025

Hi @ayrnb may you merge the latest main and fix the conflicts? Thanks!

Done!

Copy link
Collaborator

@ch-wan ch-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me in general. I left some comments. Also, we are gradually adopting the new MoE framework and reimplementing existing codes. The roadmap is here: #8715. Most codes in this PR should be reorganized in one future PR.

@ch-wan ch-wan added the run-ci label Oct 21, 2025
@ch-wan ch-wan merged commit 13bf565 into sgl-project:main Oct 25, 2025
126 of 131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants