Skip to content

Commit da1644d

Browse files
committed
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up weights
Current impl of `swigluoai_and_mul` for CPU assumes that gate-up weights have been de-interleaved at load time, which is not the case. The new impl we dispatch to is the same one used for the BF16 path on GPU and handles interleaved gate-up. Signed-off-by: Fadi Arafeh <[email protected]>
1 parent 6fb0215 commit da1644d

File tree

1 file changed

+2
-13
lines changed

1 file changed

+2
-13
lines changed

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,13 @@
66
from torch.nn import functional as F
77

88
from vllm import _custom_ops as ops
9+
from vllm.model_executor.layers.activation import SwigluOAIAndMul
910

1011

1112
def silu_and_mul(x: torch.Tensor) -> torch.Tensor:
1213
d = x.shape[-1] // 2
1314
return F.silu(x[..., :d]) * x[..., d:]
1415

15-
16-
def swigluoai_and_mul(
17-
x: torch.Tensor, alpha: float = 1.702, limit: float = 7.0
18-
) -> torch.Tensor:
19-
d = x.shape[-1] // 2
20-
gate, up = x[..., :d], x[..., d:]
21-
gate = gate.clamp(max=limit)
22-
up = up.clamp(min=-limit, max=limit)
23-
glu = gate * torch.sigmoid(alpha * gate)
24-
return (up + 1) * glu
25-
26-
2716
def grouped_topk(
2817
hidden_states: torch.Tensor,
2918
gating_output: torch.Tensor,
@@ -284,7 +273,7 @@ def __call__(
284273

285274
gate_up = layer.gate_up_linear[i](tokens_for_this_expert)
286275
if activation == "swigluoai":
287-
gate_up = swigluoai_and_mul(gate_up)
276+
gate_up = SwigluOAIAndMul().forward_native(gate_up)
288277
else:
289278
gate_up = silu_and_mul(gate_up)
290279
expert_out = layer.down_linear[i](gate_up)

0 commit comments

Comments
 (0)