Skip to content

[Feature Request]How to measure the generation throughput(token/s)? #57

@wuooo339

Description

@wuooo339

Prerequisites

  • I have searched existing issues and reviewed documentation.

Problem Description

I want to measure the DeepSeek-v2-Lite-Chat throughput of MoE-infinity using RTX 4080 Super(16GB).The code I used is as follows:. But the average throughput is about 2.935 token/s, which is slower than llama.cpp(in my test the decode throughput is 3.99 tokens per second).Is it something wrong in my test?
I have read your paper, the MoE-infinity is much faster, but I got a slower result?
Note that The device_memory_ratio is 0.7 because I will encounter CUDA error if I use a number greater than 70%.

Proposed Solution

Here is my inference code of MoE-infinity:

import torch
import time
import os
from transformers import AutoTokenizer, TextStreamer
from moe_infinity import MoE
os.environ['CUDA_VISIBLE_DEVICES'] = '2'  
user_home = os.path.expanduser('~')
checkpoint = "/share-data/wzk-1/model/deepseek-v2-lite"
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite-Chat")
tokenizer.pad_token = tokenizer.eos_token
config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
    "device_memory_ratio": 0.7,  
}
model = MoE(checkpoint, config)
streamer = TextStreamer(tokenizer)


input_texts = [
    "Tell me a story begin with: Once upon a time",
    "Give me an introduction of Bitcon",
    "Translate 'I love you' into at least 10 languages",
    "write a C++ program of QuickSort"
]

inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"].to("cuda:0")
attention_mask = inputs["attention_mask"].to("cuda:0")

total_time = 0
total_tokens = 0
for i in range(len(input_texts)):
    start_time = time.time()
    output_ids = model.generate(
        input_ids=input_ids[i].unsqueeze(0),
        attention_mask=attention_mask[i].unsqueeze(0),
        streamer=streamer,
        max_new_tokens=256
    )
    end_time = time.time()
    elapsed_time = end_time - start_time
    total_time += elapsed_time
    generated_tokens = len(output_ids[0]) - len(input_ids[I])
    total_tokens += generated_tokens
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    decode_throughput = generated_tokens / elapsed_time
    # print(f"Output {i+1}: {output_text}")
    print(f"generated {generated_tokens} using {elapsed_time:.3f} seconds, decode throughput is {decode_throughput:.3f} token/s")
    print("-" * 60)
throughput = total_tokens / total_time
print(f"Total time: {total_time:.3f} seconds")
print(f"Total tokens generated: {total_tokens}")
print(f"Throughput: {throughput:.3f} tokens/second")

Alternatives Considered

No response

Additional Context

Here is one of my outputs using MoE-infinity. I have try many inputs but the decode throughput is about 2.746 token/s.

Translate 'I love you' into at least 10 languages
1. Spanish: Te amo
2. French: Je t'aime
3. German: Ich liebe dich
4. Italian: Ti amo
5. Portuguese: Eu te amo
6. Russian: Я тебя люблю (Ya tebya lyublyu)
7. Chinese (Simplified): 我爱你 (Wǒ ài nǐ)
8. Japanese: 愛してる (Aishiteru)
9. Hindi: मैं तुमसे प्यार करता हूँ (Main tumse pyar karta hoon)
10. Arabic: أحبك (Uhibbuka)<|end▁of▁sentence|>
generated 163 using 55.541 seconds, decode throughput is 2.935 token/s

Here is the outputs of llama.cpp, which uses '.gguf' files converted from original model files.

llama_perf_sampler_print:    sampling time =      26.48 ms /   349 runs   (    0.08 ms per token, 13182.25 tokens per second)
llama_perf_context_print:        load time =    5984.14 ms
llama_perf_context_print: prompt eval time =    2251.78 ms /    39 tokens (   57.74 ms per token,    17.32 tokens per second)
llama_perf_context_print:        eval time =  122985.89 ms /   491 runs   (  250.48 ms per token,     3.99 tokens per second)
llama_perf_context_print:       total time =  144492.51 ms /   530 tokens

Importance

Nice to have

Usage Statistics (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions