Pass whole dataset/and subsequent inputs/ to each expert, not only routed data. #2219

avtc · 2025-11-25T22:43:05Z

avtc
Nov 25, 2025

@Qubitium Hi!
I have thought about how to improve quantization of MoE models, as the input dataset size is limited by VRAM (usually of a single GPU, which also had to hold modules), and have several ideas, one of ideas I have tried to implement with a help of Antigravity/gemini 3 pro/claude 4.5.

The idea is to pass whole dataset/and subsequent inputs/ to each expert, not only routed data.
The implementation I have squashed into the single commit for easier review, but as it based not on recent code - on the pre-data-parallel branch, and also lack of my knowledge to review the correctness of behavior, I am not ready to propose it as PR yet.
avtc@9d96430

I am still testing it, will share results later on.

What I see is each expert receive more calibration data (just like non-moe module), forward pass become 2.5-3 times longer, loss become ~10-100 times lower. This is for example from Minimax-M2 first layer, with 1536 samples and 0.05 damp:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.1.w1",
    "loss": "0.0000032814",
    "samples": "496497",
    "damp": "0.05000",
    "time": "0.833",
    "fwd_time": "329.542",
    "(v)ram": "11187.40MB, 2837.37MB",
    "dynamic": null
}
vs
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.1.w1",
    "loss": "0.0001497157",
    "samples": "10186",
    "damp": "0.05000",
    "time": "3.729",
    "fwd_time": "105.122",
    "(v)ram": "11215.26MB, 2618.66MB",
    "dynamic": null
}

Can you suggest the best dump setting to be used with such approach?

Qubitium · 2025-11-26T03:53:14Z

Qubitium
Nov 26, 2025
Maintainer

@avtc Interesting work! Let me know how it works out. Some points:

The Error Loss that you see for gptq quantization is a little misleading in that lower does not always mean better. If the errors are so low as to approach zero, it just means there might be too much calibration data, and or too many repeating values in the calibration, that is overfitting. High error loss is obviously a sign of bad calibration but on the opposite end, the goal is not to get error loss to 0 which actually may be just as bad.
So vary the dataset, which you already do, and make sure there are not too many repeating patterns in the calibration dataset.
Very intersted to find out how brute forcing the moe modules with unrouted tokens will affect the final model. Thanks for working on this!

0 replies

avtc · 2025-11-27T18:46:57Z

avtc
Nov 27, 2025
Author

@Qubitium
Few shots with GLM-4.6-Reap-268B, w4g128 + dynamic.

Full dataset passed to each expert, damp 0.01.
S(Shot)1 with OpenWebUI: Create a Playable Synth Keyboard using html, css, js in a single html file

https://jsfiddle.net/msfkLngj/

S3 with OpenWebUI

S1 with Kilo, Code mode: Write super mario bros clone using html,css,js

https://jsfiddle.net/j94ykfb5/

Compare with routed dataset, damp=0.025:
S1

S3

S2 with Kilo, Code mode, (S1 had error on collision with coin)

Sampling params: t=1.0 minp=0 topp=0.95 k=40
Looks like "full dataset" variant is more consistent, makes less errors, solved all my few evals, with 1-3 shots. While "routed" 0.025 solved only part of evals and with issues. The quantization time was 20+ hours for "full dataset" and around 11 hours for "routed" on 8x3090 with 1808 samples ~530K tokens.

1 reply

avtc Nov 27, 2025
Author

shared experts has 64 group size to be able to run with tp8 on vllm, and the total model size allows using all available context window of 200K with q8 kv cache.

Qubitium · 2025-12-01T07:10:18Z

Qubitium
Dec 1, 2025
Maintainer

@avtc Based on your test I think we should definitely add this as an option for gpt-qmodel users with MoE now more prevalent than ever and MoE routing is a thorn in quantization. Conceptually, it should not work as well as it does, but reality trumps theory. =) Do you want to work up a PR?

7 replies

avtc Dec 3, 2025
Author

@Qubitium Hi, I have tried to test my changes with recent main branch on GLM-4.5-Air, with 10 samples from c4/en on 4x3090, with offload_to_disk=False (and True), and encountered a bug that layer 0 cannot be finalized, it hangs forever. I reproduced same issue on latest main branch 17b7bcc. It looks like this:

I have asked Sonnet 4.5, GLM-4.6 but their suggestions did not resolve the issue. Can you check if you can reproduce? I have even added a flag to wait for layer finalization before proceeding to next layer, and it just hangs and does not move to the next layer.

The code:

SAMPLES = 10

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
    ).select(range(SAMPLES))["text"]
    
quant_config = QuantizeConfig(
    bits=4,
    group_size=32,
    sym=True,
    desc_act=False,
    act_group_aware=True,
    dynamic={},
    damp_percent=0.05,
    damp_auto_increment=0.01,
    fail_safe=True,
    offload_to_disk=False,
    )

model = GPTQModel.load(model_path, quant_config)

model.quantize(
    calibration_dataset,
    batch_size=1,
    )

The issues is not reproduced with Qwen3-Coder-30B-A3B-Instruct...
I have tried same model from two different folders - padded and not padded - and the issue reproduced for both. (I though maybe files become broken, but looks like not)

Even if I exclude self_attn modules, it anyway hangs for first layer with quantized modules and does not proceed:

avtc Dec 3, 2025
Author

@Qubitium I have identified the issue - it was caused by pack_block extension, maybe cached wrong version or idk, will investigate how to fix

Qubitium Dec 3, 2025
Maintainer

@avtc I have hit this bug myself yesterday. I thought i have fixed it with extension delete and recompile on load with locks but it is not working. Whats worse is the issue has randomness where some users have this error and others dont.

avtc Dec 4, 2025
Author

would be nice to have an error with suggestion how to fix if it hangs (until fixed)... because debug of this was very long and hard...
i.e. find ~/.cache/torch_extensions -name "*pack_block*" -type d -exec rm -rf {} +

Qubitium Dec 4, 2025
Maintainer

Yes. The first time I tried to debug this, it wasted 4 hours of my time. And the bug is still not fixed. The problem is we cannot detect it because it enters into the pack and never returns, it just stalls for no reason at all. Unless we have an external thread/timer to watch for ops, which adds even more complexity we cannot detect this hang correctly. Need to fix this bug asap.

Pass whole dataset/and subsequent inputs/ to each expert, not only routed data. #2219

Uh oh!

avtc Nov 25, 2025

Replies: 3 comments · 8 replies

Uh oh!

Qubitium Nov 26, 2025 Maintainer

Uh oh!

Uh oh!

avtc Nov 27, 2025 Author

Uh oh!

avtc Nov 27, 2025 Author

Uh oh!

Uh oh!

Qubitium Dec 1, 2025 Maintainer

Uh oh!

Uh oh!

avtc Dec 3, 2025 Author

Uh oh!

avtc Dec 3, 2025 Author

Uh oh!

Qubitium Dec 3, 2025 Maintainer

Uh oh!

Uh oh!

avtc Dec 4, 2025 Author

Uh oh!

Qubitium Dec 4, 2025 Maintainer

avtc
Nov 25, 2025

Replies: 3 comments 8 replies

Qubitium
Nov 26, 2025
Maintainer

avtc
Nov 27, 2025
Author

avtc Nov 27, 2025
Author

Qubitium
Dec 1, 2025
Maintainer

avtc Dec 3, 2025
Author

avtc Dec 3, 2025
Author

Qubitium Dec 3, 2025
Maintainer

avtc Dec 4, 2025
Author

Qubitium Dec 4, 2025
Maintainer