-
Notifications
You must be signed in to change notification settings - Fork 440
Add mlp support for qwen3vl series and little refactor #957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
Signed-off-by: Tcc0403 <[email protected]>
I commented on #897 about a regression to fcle increasing training speed. I applied this path locally to change the cuda syncing and make the fcle fsdp2 share aware and it seems to fix the regression and squeeze a little more t/s. It may be worth pulling the thread on this further: f22ce38 |
Thank you, that's an interesting fix! We are planning a 2026 Q1 roadmap, including fsdp2 (multi-gpu) aware testings, optimizations and so on. Feel free to open a PR so we can discuss how we integrate your work align with our roadmap! Regarding |
Summary
This PR aims to fix #956, plus some refactors. Including:
Note that moe layers aren't patched since there will be a major change in transformers v5, see #958.
Testing Done
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence