Skip to content

Conversation

@Tcc0403
Copy link
Collaborator

@Tcc0403 Tcc0403 commented Nov 28, 2025

Summary

This PR aims to fix #956, plus some refactors. Including:

  • adds swiglu and geglu mlp patching for qwen3vl series, swiglu for text model, geglu for vision model
    • geglu is default to False as it is not close to torch's impl, leading to convergence test failure.
  • modifies qk norm patching with LigerRMSNorm(row_mode=True)
  • adds layernorm patching for vision model

Note that moe layers aren't patched since there will be a major change in transformers v5, see #958.

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@Tcc0403 Tcc0403 changed the title Add mlp support for qwen3vl and little refactor Add mlp support for qwen3vl series and little refactor Nov 28, 2025
@thad0ctor
Copy link

thad0ctor commented Dec 2, 2025

Summary

This PR aims to fix #956, plus some refactors. Including:

  • adds swiglu and geglu mlp patching for qwen3vl series, swiglu for text model, geglu for vision model

    • geglu is default to False as it is not close to torch's impl, leading to convergence test failure.
  • modifies qk norm patching with LigerRMSNorm(row_mode=True)

  • adds layernorm patching for vision model

Note that moe layers aren't patched since there will be a major change in transformers v5, see #958.

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

I commented on #897 about a regression to fcle increasing training speed. I applied this path locally to change the cuda syncing and make the fcle fsdp2 share aware and it seems to fix the regression and squeeze a little more t/s. It may be worth pulling the thread on this further: f22ce38

@Tcc0403
Copy link
Collaborator Author

Tcc0403 commented Dec 2, 2025

I commented on #897 about a regression to fcle increasing training speed. I applied this path locally to change the cuda syncing and make the fcle fsdp2 share aware and it seems to fix the regression and squeeze a little more t/s. It may be worth pulling the thread on this further: f22ce38

Thank you, that's an interesting fix! We are planning a 2026 Q1 roadmap, including fsdp2 (multi-gpu) aware testings, optimizations and so on. Feel free to open a PR so we can discuss how we integrate your work align with our roadmap!

Regarding .item() optimization, have you tried reading it directly as a tensor value in triton kernel instead of converting it to python value as we currently do? I wonder how this approach does compared with your approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add SwiGLU support for Qwen3-VL models

3 participants