Thank you for your work.
May I ask what the differences are between the open-source code on GitHub and the version described in the paper?
I tested deepseek-chat-lite in bigbench, and the throughput is around 280 ms, whereas the paper reports 155 ms. I’d like to know where the differences come from.
test script is:
CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path /app/data/DeepSeek-V2-Lite-Chat --offload_dir /root/moe-infinity --device_memory_ratio 0.75 --out_len 32
Many thanks!