Replies: 3 comments 8 replies
-
|
@avtc Interesting work! Let me know how it works out. Some points:
|
Beta Was this translation helpful? Give feedback.
-
|
@Qubitium Full dataset passed to each expert, damp 0.01. S1 with Kilo, Code mode: Write super mario bros clone using html,css,js Compare with routed dataset, damp=0.025: Sampling params: t=1.0 minp=0 topp=0.95 k=40 |
Beta Was this translation helpful? Give feedback.
-
|
@avtc Based on your test I think we should definitely add this as an option for gpt-qmodel users with MoE now more prevalent than ever and MoE routing is a thorn in quantization. Conceptually, it should not work as well as it does, but reality trumps theory. =) Do you want to work up a PR? |
Beta Was this translation helpful? Give feedback.








Uh oh!
There was an error while loading. Please reload this page.
-
@Qubitium Hi!
I have thought about how to improve quantization of MoE models, as the input dataset size is limited by VRAM (usually of a single GPU, which also had to hold modules), and have several ideas, one of ideas I have tried to implement with a help of Antigravity/gemini 3 pro/claude 4.5.
The idea is to pass whole dataset/and subsequent inputs/ to each expert, not only routed data.
The implementation I have squashed into the single commit for easier review, but as it based not on recent code - on the pre-data-parallel branch, and also lack of my knowledge to review the correctness of behavior, I am not ready to propose it as PR yet.
avtc@9d96430
I am still testing it, will share results later on.
What I see is each expert receive more calibration data (just like non-moe module), forward pass become 2.5-3 times longer, loss become ~10-100 times lower. This is for example from Minimax-M2 first layer, with 1536 samples and 0.05 damp:
Can you suggest the best dump setting to be used with such approach?
Beta Was this translation helpful? Give feedback.
All reactions