Skip to content

Conversation

@fadara01
Copy link
Contributor

@fadara01 fadara01 commented Nov 23, 2025

[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei

Purpose

Current impl of swigluoai_and_mul for CPU assumes that gate-up weights have been de-interleaved at load time, which is not the case and as a result gpt-oss generates garbage.
The new impl we dispatch to in this PR is the same one used for the BF16 path on GPU and handles interleaved gate-up.

See comments on #27024 for full context.

Test Plan

Ran gpt-oss-20b end to end on a few prompts.

Test Result

Generations with this fix are decent and very similar to what we get with HF transformers.
Without this fix, generations are garbage


Essential Elements of an Effective PR Description Checklist
  • [Y] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [Y] The test plan, such as providing test command.
  • [Y] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@fadara01
Copy link
Contributor Author

@mgoin this is the last piece needed to enable gpt-oss in BF16 on Arm CPUs, could you please have a look?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug in the CPU implementation of swigluoai_and_mul by using an implementation that supports interleaved gate-up weights. The change replaces a local function with a shared SwigluOAIAndMul class from the activation layers, which improves code reuse. However, I've identified a performance issue where this new class is instantiated inside a loop, which should be addressed.

gate_up = layer.gate_up_linear[i](tokens_for_this_expert)
if activation == "swigluoai":
gate_up = swigluoai_and_mul(gate_up)
gate_up = SwigluOAIAndMul().forward_native(gate_up)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For performance reasons, SwigluOAIAndMul should be instantiated only once, outside of this loop. Creating a new instance on every iteration for each expert introduces unnecessary overhead. Consider creating the instance before the for loop on line 268 and reusing it here.

Copy link
Contributor Author

@fadara01 fadara01 Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 275 to 276
if activation == "swigluoai":
gate_up = swigluoai_and_mul(gate_up)
gate_up = SwigluOAIAndMul().forward_native(gate_up)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge SwigluOAI activation pairs gate/up values incorrectly on CPU

On the CPU fallback path, SwigluOAIAndMul().forward_native is now fed the gate/up output directly from gate_up_linear, but forward_native expects the gate and up features to be interleaved element‑wise (uses x[..., ::2]/x[..., 1::2]). The CPU weights are loaded as two contiguous halves (gate followed by up) with no interleaving (vllm/model_executor/layers/fused_moe/layer.py lines 984‑1010), and in the CPUFusedMOE path those weights are used without any reordering (unquantized_fused_moe_method.py lines 238‑267). With the new call the gate tensor mixes gate and up values (e.g., gate0 with gate1 instead of gate0 with up0), producing incorrect activations whenever swigluoai is used on CPU when the SGL kernel is unavailable. This regression will distort expert outputs for models such as GPT‑OSS running on CPU.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

@fadara01 fadara01 Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU weights are loaded as two contiguous halves (gate followed by up) with no interleaving (vllm/model_executor/layers/fused_moe/layer.py

I wish ... if that was the case we wouldn't be here.

…ights

Current impl of `swigluoai_and_mul` for CPU assumes that gate-up weights
have been de-interleaved at load time, which is not the case.
The new impl we dispatch to is the same one used for the BF16 path on
GPU and handles interleaved gate-up.

Signed-off-by: Fadi Arafeh <[email protected]>
@fadara01 fadara01 force-pushed the enable_swigluoai_with_interleaved_gate_up branch from da1644d to e343b14 Compare November 23, 2025 17:52
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed cpu Related to CPU backends gpt-oss Related to GPT-OSS models labels Nov 23, 2025
@fadara01
Copy link
Contributor Author

CI is green!

@fadara01
Copy link
Contributor Author

BF16 loading/de-interleaving of gpt-oss's gate-up needs to be addressed in general for both GPU and CPU path. Any future PR addressing this will have to consequently update all SwigluOAI impls introduced in #22951, including SwigluOAI.forwad_native since it's used as the reference impl for GPUs. Since this PR uses the same SwigluOAI impl as GPU BF16 path, any changes to loading in that path will work (correctly) OOB on CPU backend.

@ApostaC
Copy link
Collaborator

ApostaC commented Nov 24, 2025

cc @mgoin

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Nov 25, 2025
@bigPYJ1151 bigPYJ1151 merged commit 98caead into vllm-project:main Nov 25, 2025
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants