-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei #29273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei #29273
Conversation
|
@mgoin this is the last piece needed to enable gpt-oss in BF16 on Arm CPUs, could you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug in the CPU implementation of swigluoai_and_mul by using an implementation that supports interleaved gate-up weights. The change replaces a local function with a shared SwigluOAIAndMul class from the activation layers, which improves code reuse. However, I've identified a performance issue where this new class is instantiated inside a loop, which should be addressed.
| gate_up = layer.gate_up_linear[i](tokens_for_this_expert) | ||
| if activation == "swigluoai": | ||
| gate_up = swigluoai_and_mul(gate_up) | ||
| gate_up = SwigluOAIAndMul().forward_native(gate_up) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if activation == "swigluoai": | ||
| gate_up = swigluoai_and_mul(gate_up) | ||
| gate_up = SwigluOAIAndMul().forward_native(gate_up) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SwigluOAI activation pairs gate/up values incorrectly on CPU
On the CPU fallback path, SwigluOAIAndMul().forward_native is now fed the gate/up output directly from gate_up_linear, but forward_native expects the gate and up features to be interleaved element‑wise (uses x[..., ::2]/x[..., 1::2]). The CPU weights are loaded as two contiguous halves (gate followed by up) with no interleaving (vllm/model_executor/layers/fused_moe/layer.py lines 984‑1010), and in the CPUFusedMOE path those weights are used without any reordering (unquantized_fused_moe_method.py lines 238‑267). With the new call the gate tensor mixes gate and up values (e.g., gate0 with gate1 instead of gate0 with up0), producing incorrect activations whenever swigluoai is used on CPU when the SGL kernel is unavailable. This regression will distort expert outputs for models such as GPT‑OSS running on CPU.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CPU weights are loaded as two contiguous halves (gate followed by up) with no interleaving (vllm/model_executor/layers/fused_moe/layer.py
I wish ... if that was the case we wouldn't be here.
…ights Current impl of `swigluoai_and_mul` for CPU assumes that gate-up weights have been de-interleaved at load time, which is not the case. The new impl we dispatch to is the same one used for the BF16 path on GPU and handles interleaved gate-up. Signed-off-by: Fadi Arafeh <[email protected]>
da1644d to
e343b14
Compare
|
CI is green! |
|
BF16 loading/de-interleaving of gpt-oss's gate-up needs to be addressed in general for both GPU and CPU path. Any future PR addressing this will have to consequently update all SwigluOAI impls introduced in #22951, including SwigluOAI.forwad_native since it's used as the reference impl for GPUs. Since this PR uses the same SwigluOAI impl as GPU BF16 path, any changes to loading in that path will work (correctly) OOB on CPU backend. |
|
cc @mgoin |
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei
Purpose
Current impl of
swigluoai_and_mulfor CPU assumes that gate-up weights have been de-interleaved at load time, which is not the case and as a result gpt-oss generates garbage.The new impl we dispatch to in this PR is the same one used for the BF16 path on GPU and handles interleaved gate-up.
See comments on #27024 for full context.
Test Plan
Ran gpt-oss-20b end to end on a few prompts.
Test Result
Generations with this fix are decent and very similar to what we get with HF transformers.
Without this fix, generations are garbage
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.