-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Prerequisites
- I have searched existing issues and reviewed documentation.
Problem Description
I want to measure the DeepSeek-v2-Lite-Chat throughput of MoE-infinity using RTX 4080 Super(16GB).The code I used is as follows:. But the average throughput is about 2.935 token/s, which is slower than llama.cpp(in my test the decode throughput is 3.99 tokens per second).Is it something wrong in my test?
I have read your paper, the MoE-infinity is much faster, but I got a slower result?
Note that The device_memory_ratio is 0.7 because I will encounter CUDA error if I use a number greater than 70%.
Proposed Solution
Here is my inference code of MoE-infinity:
import torch
import time
import os
from transformers import AutoTokenizer, TextStreamer
from moe_infinity import MoE
os.environ['CUDA_VISIBLE_DEVICES'] = '2'
user_home = os.path.expanduser('~')
checkpoint = "/share-data/wzk-1/model/deepseek-v2-lite"
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite-Chat")
tokenizer.pad_token = tokenizer.eos_token
config = {
"offload_path": os.path.join(user_home, "moe-infinity"),
"device_memory_ratio": 0.7,
}
model = MoE(checkpoint, config)
streamer = TextStreamer(tokenizer)
input_texts = [
"Tell me a story begin with: Once upon a time",
"Give me an introduction of Bitcon",
"Translate 'I love you' into at least 10 languages",
"write a C++ program of QuickSort"
]
inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"].to("cuda:0")
attention_mask = inputs["attention_mask"].to("cuda:0")
total_time = 0
total_tokens = 0
for i in range(len(input_texts)):
start_time = time.time()
output_ids = model.generate(
input_ids=input_ids[i].unsqueeze(0),
attention_mask=attention_mask[i].unsqueeze(0),
streamer=streamer,
max_new_tokens=256
)
end_time = time.time()
elapsed_time = end_time - start_time
total_time += elapsed_time
generated_tokens = len(output_ids[0]) - len(input_ids[I])
total_tokens += generated_tokens
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
decode_throughput = generated_tokens / elapsed_time
# print(f"Output {i+1}: {output_text}")
print(f"generated {generated_tokens} using {elapsed_time:.3f} seconds, decode throughput is {decode_throughput:.3f} token/s")
print("-" * 60)
throughput = total_tokens / total_time
print(f"Total time: {total_time:.3f} seconds")
print(f"Total tokens generated: {total_tokens}")
print(f"Throughput: {throughput:.3f} tokens/second")
Alternatives Considered
No response
Additional Context
Here is one of my outputs using MoE-infinity. I have try many inputs but the decode throughput is about 2.746 token/s.
Translate 'I love you' into at least 10 languages
1. Spanish: Te amo
2. French: Je t'aime
3. German: Ich liebe dich
4. Italian: Ti amo
5. Portuguese: Eu te amo
6. Russian: Я тебя люблю (Ya tebya lyublyu)
7. Chinese (Simplified): 我爱你 (Wǒ ài nǐ)
8. Japanese: 愛してる (Aishiteru)
9. Hindi: मैं तुमसे प्यार करता हूँ (Main tumse pyar karta hoon)
10. Arabic: أحبك (Uhibbuka)<|end▁of▁sentence|>
generated 163 using 55.541 seconds, decode throughput is 2.935 token/s
Here is the outputs of llama.cpp, which uses '.gguf' files converted from original model files.
llama_perf_sampler_print: sampling time = 26.48 ms / 349 runs ( 0.08 ms per token, 13182.25 tokens per second)
llama_perf_context_print: load time = 5984.14 ms
llama_perf_context_print: prompt eval time = 2251.78 ms / 39 tokens ( 57.74 ms per token, 17.32 tokens per second)
llama_perf_context_print: eval time = 122985.89 ms / 491 runs ( 250.48 ms per token, 3.99 tokens per second)
llama_perf_context_print: total time = 144492.51 ms / 530 tokens
Importance
Nice to have
Usage Statistics (Optional)
No response