-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Optimization] Add Fused Triton Kernel for GPT-OSS Router #29237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fused Triton kernel for MoE routing in GPT-OSS models to optimize performance. The changes include a new Triton kernel and its integration into the model. My review identified a critical correctness issue: the fused kernel implementation and its usage in gpt_oss.py completely ignore the bias term of the router's linear layer, which will lead to incorrect model outputs. Additionally, I've identified two high-severity issues in the new Triton kernel: the block sizes for the kernel are hardcoded, which can lead to suboptimal performance, and the comments explaining the GEMM logic within the kernel are confusing and contain inaccuracies, which impacts maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
4ef122b to
f3c61a9
Compare
ZJY0516
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test for this?
Definitely. |
f3c61a9 to
dc3f820
Compare
Signed-off-by: ijpq <[email protected]>
dc3f820 to
fca484b
Compare
I update the benchmark test. Any advice? Since there are some changes to my local hardware, roofline analysis takes few days to go. But Intuitively speaking, a compiler product is unable to outperform than optimized cuda kernel. |
|
Thank you for the correction. @shaginhekvs I just did roofline report. This kernel still has a lot of room for optimization. I'll take another look at Triton, and I should be able to provide the optimized results today.
|
- split two kernels, in case renorm or not - add online softmax - unroll along M Signed-off-by: ijpq <[email protected]>
5af77ad to
66e6711
Compare
- delete unnecessary splited kernel - add skipif in unittest - ACK reviews Signed-off-by: ijpq <[email protected]>
Signed-off-by: ijpq <[email protected]>
Signed-off-by: ijpq <[email protected]>
a404e55 to
c7d7a22
Compare
Signed-off-by: ijpq <[email protected]>
c7d7a22 to
eacf1cd
Compare


The output of
python collect_env.pyPurpose
Resolves: #28986
Test Plan
compare results with torch and intact impl.
Test Result
compare results
The output of
pytest test_gpt_oss_fused_router.pyThe output of
pytest test_routing_consistency.pybenchmark:
baseline:
fused:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.