[Feature Request]How to measure the generation throughput(token/s)?

### Prerequisites

- [x] I have searched existing issues and reviewed documentation.

### Problem Description

I want to measure the DeepSeek-v2-Lite-Chat throughput of MoE-infinity using RTX 4080 Super(16GB).The code I used is as follows:. But the average throughput is about 2.935 token/s, which is slower than llama.cpp(in my test the decode throughput is 3.99 tokens per second).Is it something wrong in my test?
I have read your paper, the MoE-infinity is much faster, but I got a slower result?
Note that The device_memory_ratio is 0.7 because I will encounter CUDA error if I use a number greater than 70%.

### Proposed Solution

Here is my inference code of MoE-infinity:

    import torch
    import time
    import os
    from transformers import AutoTokenizer, TextStreamer
    from moe_infinity import MoE
    os.environ['CUDA_VISIBLE_DEVICES'] = '2'  
    user_home = os.path.expanduser('~')
    checkpoint = "/share-data/wzk-1/model/deepseek-v2-lite"
    tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite-Chat")
    tokenizer.pad_token = tokenizer.eos_token
    config = {
        "offload_path": os.path.join(user_home, "moe-infinity"),
        "device_memory_ratio": 0.7,  
    }
    model = MoE(checkpoint, config)
    streamer = TextStreamer(tokenizer)
    
    
    input_texts = [
        "Tell me a story begin with: Once upon a time",
        "Give me an introduction of Bitcon",
        "Translate 'I love you' into at least 10 languages",
        "write a C++ program of QuickSort"
    ]
    
    inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)
    input_ids = inputs["input_ids"].to("cuda:0")
    attention_mask = inputs["attention_mask"].to("cuda:0")
    
    total_time = 0
    total_tokens = 0
    for i in range(len(input_texts)):
        start_time = time.time()
        output_ids = model.generate(
            input_ids=input_ids[i].unsqueeze(0),
            attention_mask=attention_mask[i].unsqueeze(0),
            streamer=streamer,
            max_new_tokens=256
        )
        end_time = time.time()
        elapsed_time = end_time - start_time
        total_time += elapsed_time
        generated_tokens = len(output_ids[0]) - len(input_ids[I])
        total_tokens += generated_tokens
        output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        decode_throughput = generated_tokens / elapsed_time
        # print(f"Output {i+1}: {output_text}")
        print(f"generated {generated_tokens} using {elapsed_time:.3f} seconds, decode throughput is {decode_throughput:.3f} token/s")
        print("-" * 60)
    throughput = total_tokens / total_time
    print(f"Total time: {total_time:.3f} seconds")
    print(f"Total tokens generated: {total_tokens}")
    print(f"Throughput: {throughput:.3f} tokens/second")

### Alternatives Considered

_No response_

### Additional Context

Here is one of my outputs using MoE-infinity. I have try many inputs but the decode throughput is  about 2.746 token/s.

    Translate 'I love you' into at least 10 languages
    1. Spanish: Te amo
    2. French: Je t'aime
    3. German: Ich liebe dich
    4. Italian: Ti amo
    5. Portuguese: Eu te amo
    6. Russian: Я тебя люблю (Ya tebya lyublyu)
    7. Chinese (Simplified): 我爱你 (Wǒ ài nǐ)
    8. Japanese: 愛してる (Aishiteru)
    9. Hindi: मैं तुमसे प्यार करता हूँ (Main tumse pyar karta hoon)
    10. Arabic: أحبك (Uhibbuka)<｜end▁of▁sentence｜>
    generated 163 using 55.541 seconds, decode throughput is 2.935 token/s


Here is the outputs of llama.cpp, which uses '.gguf' files converted from original model files.


    llama_perf_sampler_print:    sampling time =      26.48 ms /   349 runs   (    0.08 ms per token, 13182.25 tokens per second)
    llama_perf_context_print:        load time =    5984.14 ms
    llama_perf_context_print: prompt eval time =    2251.78 ms /    39 tokens (   57.74 ms per token,    17.32 tokens per second)
    llama_perf_context_print:        eval time =  122985.89 ms /   491 runs   (  250.48 ms per token,     3.99 tokens per second)
    llama_perf_context_print:       total time =  144492.51 ms /   530 tokens


### Importance

Nice to have

### Usage Statistics (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]How to measure the generation throughput(token/s)? #57

Prerequisites

Problem Description

Proposed Solution

Alternatives Considered

Additional Context

Importance

Usage Statistics (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]How to measure the generation throughput(token/s)? #57

Description

Prerequisites

Problem Description

Proposed Solution

Alternatives Considered

Additional Context

Importance

Usage Statistics (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions