[Core] NGram GPU Implementation compatible with Async Scheduler #29184

PatchouliTIS · 2025-11-21T14:42:44Z

Purpose

This PR is based on PR #24799 aiming to implement GPU version of ngram speculative decoding and make it compatible with Async Scheduler.

Test Plan

Async Scheduler + NGram + Qwen3-1.7B
Test config:

# dataset is CMU-DoG, which is an input-grounded dataset.
python3.12 -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--max-num-seqs 128 \
--max-model-len 2048 \
--model Qwen/Qwen3-1.7B \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype bfloat16  \
--enable-chunked-prefill \
--disable-log-requests \
--async-scheduling \
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

Test Device: NVIDIA H20

Test Result

Performance

num_prompts	async_ngram(tps)	sync_ngram(tps)	speedup
2	466	357	30.5%
8	1378	988	39.4%
16	2082	1726	20.6%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…rs (vllm-project#29111) Signed-off-by: Huamin Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: PatchouliTaisa <[email protected]>

Signed-off-by: PatchouliTaisa <[email protected]>

…vllm into patchy/async_ngram

Signed-off-by: PatchouliTaisa <[email protected]>

chatgpt-codex-connector · 2025-11-26T08:22:00Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

ZJY0516 · 2025-11-27T04:25:32Z

cc @njhill

benchislett · 2025-11-27T14:46:42Z

vllm/compilation/backends.py

        # Honors opt-outs such as CompilationMode.NONE or VLLM_DISABLE_COMPILE_CACHE.
        disable_cache = not is_compile_cache_enabled(self.inductor_config)

+        # TODO(patchy): ngram gpu kernel will cause vllm torch compile cache errors.


Why? Can this be fixed?

I enabled torch compile in the ngram gpu kernel, the computational graph corresponding to ngram operator would hit a precompiled computational graph cache in the main model, leading to mismatched computational graph results. Therefore, I directly disabled the compile cache here. I tested this locally, and disabling the cache had no impact on performance.

I assume disabling the compile cache would lead to longer startup time? I'm not an expert here but maybe it's possible to add an identifier to the compile cache to avoid extraneous cache hits?

yes, the startup time will increase a little. I attempted to add additional input parameters and other member variables to the nn.Modules forward method decorated with @support_torch_compile to achieve cache isolation, but none of them worked. I suspect this might be related to the internal implementation of @support_torch_compile within vLLM. However, as things stand, disabling torch compile caching only impacts the performance of the entire inference service during the initial startup phase.

Signed-off-by: PatchouliTaisa <[email protected]>

PatchouliTIS

Change the codes according to comments.

PatchouliTIS · 2025-11-28T12:52:18Z

vllm/v1/worker/gpu_model_runner.py

        for i, num_tokens in enumerate(num_accepted_tokens):
            self.input_batch.num_accepted_tokens_cpu[i] = num_tokens

+    def _update_ngram_gpu_tensors(self, scheduler_output: "SchedulerOutput") -> None:


This separate processing for the ngram GPU input avoids direct copying each time. It performs incremental updates to the GPU buffer based on the previous prev_req_id_to_index and the current self.input_batch.req_id_to_index, thereby preventing extensive copying operations.

PatchouliTIS · 2025-11-28T16:13:11Z

vllm/compilation/backends.py

        # Honors opt-outs such as CompilationMode.NONE or VLLM_DISABLE_COMPILE_CACHE.
        disable_cache = not is_compile_cache_enabled(self.inductor_config)

+        # TODO(patchy): ngram gpu kernel will cause vllm torch compile cache errors.


yes, the startup time will increase a little. I attempted to add additional input parameters and other member variables to the nn.Modules forward method decorated with @support_torch_compile to achieve cache isolation, but none of them worked. I suspect this might be related to the internal implementation of @support_torch_compile within vLLM. However, as things stand, disabling torch compile caching only impacts the performance of the entire inference service during the initial startup phase.

PatchouliTIS · 2025-12-01T07:22:39Z

vllm/v1/worker/gpu_input_batch.py

            pin_memory=False,
        )
        self.token_ids_cpu = self.token_ids_cpu_tensor.numpy()
+        self.token_ids_gpu_tensor = torch.zeros(


If this section requires more sophisticated logic to store token IDs for each request, it would necessitate significant modifications, including changes to how ngram_gpu handles token ID input. For this PR, I propose moving this buffer allocation logic to gpu_model_runner. This means buffer allocation would only occur when both async_scheduler and ngram_gpu are enabled. Users could also manually set max_num_seqs and max_model_len to reduce the VRAM footprint of this buffer. Further optimizations could be addressed in a separate PR.

PatchouliTIS · 2025-12-01T10:04:22Z

vllm/v1/worker/gpu_model_runner.py

+                    all_token_ids = prompt_token_ids + req_state.output_token_ids
+                    num_tokens = len(all_token_ids)
+                    # Copy to GPU tensor
+                    self.input_batch.token_ids_gpu_tensor[idx, :num_tokens].copy_(


I see, I'll fix that in the actual async implementation.

PatchouliTIS · 2025-12-02T09:00:35Z

vllm/v1/worker/gpu_model_runner.py

+                        ),
+                        non_blocking=True,
+                    )
+                    self.input_batch.num_tokens_no_spec_gpu[idx] = num_tokens


move the update of token_ids_gpu_tensor used for ngram gpu into _update_states, reuse the result of num_tokens_no_spec maintained in input_batch

PatchouliTIS · 2025-12-02T09:03:07Z

vllm/v1/spec_decode/ngram_proposer_gpu.py

+@support_torch_compile(
+    dynamic_arg_dims={
+        "num_tokens_no_spec": 0,
+        "token_ids_gpu": [0, 1],


I have removed the second dim flag for token_ids_gpu, now there is only batch_size dim of all inputs is marked as dynamic. The second dim is seq_len and since I use a fix size of buffer token_ids_gpu_tensor, the input always has fixed size in second dim.

vllm/v1/spec_decode/ngram_proposer_gpu.py

Neo9061 · 2025-12-02T23:40:49Z

Thanks for enabling this feature! Question: will the grain becomes different at higher batch sizes? and which benchmark datasets you used?

Maybe you mean the performance in higher batch size? for now there is performance degradation in higher batch size because the scheduler pre-allocates num_spec_tokens draft token positions for next step even if the draft token ids is invalid, so there is a huge redundent model forward computation and sampling process when in higher batch size. But this can be solved and I'm still working on it. The datasets are here: https://github.com/festvox/datasets-CMU_DoG

Yes! I meant higher batch sizes. We also observed similar. #27379 (we also used Blazedit dataset to show n-gram effectiveness)

Do you by chance have a timeline when you plan to resolve the higher-batch issue for n-gram? For context I want to use n-gram in the context of hybrid decoding (#24344) and since EAGLE now has async scheduling, your PR would be very useful to make n-gram compatible in hybrid.

PatchouliTIS · 2025-12-03T14:50:36Z

Thanks for enabling this feature! Question: will the grain becomes different at higher batch sizes? and which benchmark datasets you used?

Maybe you mean the performance in higher batch size? for now there is performance degradation in higher batch size because the scheduler pre-allocates num_spec_tokens draft token positions for next step even if the draft token ids is invalid, so there is a huge redundent model forward computation and sampling process when in higher batch size. But this can be solved and I'm still working on it. The datasets are here: https://github.com/festvox/datasets-CMU_DoG

Yes! I meant higher batch sizes. We also observed similar. #27379 (we also used Blazedit dataset to show n-gram effectiveness)

Do you by chance have a timeline when you plan to resolve the higher-batch issue for n-gram? For context I want to use n-gram in the context of hybrid decoding (#24344) and since EAGLE now has async scheduling, your PR would be very useful to make n-gram compatible in hybrid.

I come up a new implementation and will push up this week. Based on it I run some benchmark in Blazedit datasets on Qwen3-8B and here is the result:

async + ngram gpu (bs24):

origin ngram cpu (bs24):

baseline (bs24):

async + ngram gpu (bs96):

origin ngram cpu (bs96):

baseline (bs96):

It appears that the new implementation yields some benefits at large batch sizes.

PatchouliTIS · 2025-12-03T15:40:31Z

Thanks for enabling this feature! Question: will the grain becomes different at higher batch sizes? and which benchmark datasets you used?

Maybe you mean the performance in higher batch size? for now there is performance degradation in higher batch size because the scheduler pre-allocates num_spec_tokens draft token positions for next step even if the draft token ids is invalid, so there is a huge redundent model forward computation and sampling process when in higher batch size. But this can be solved and I'm still working on it. The datasets are here: https://github.com/festvox/datasets-CMU_DoG

Yes! I meant higher batch sizes. We also observed similar. #27379 (we also used Blazedit dataset to show n-gram effectiveness)

Do you by chance have a timeline when you plan to resolve the higher-batch issue for n-gram? For context I want to use n-gram in the context of hybrid decoding (#24344) and since EAGLE now has async scheduling, your PR would be very useful to make n-gram compatible in hybrid.

During local benchmark I also notice that the mean draft tokens acceptance rate is approxiamately 37%~41% in Blazedit datasets, which maybe not enough for an explicit performance enhancements?

ArmageddonKnight · 2025-12-04T04:28:33Z

@benchislett @njhill We have addressed the feedback. Could you please kindly review this PR again? Thank you.

Signed-off-by: PatchouliTaisa <[email protected]>

…vllm into patchy/async_ngram

Signed-off-by: PatchouliTaisa <[email protected]>

…vllm into patchy/async_ngram

Signed-off-by: PatchouliTaisa <[email protected]>

…vllm into patchy/async_ngram

Signed-off-by: PatchouliTaisa <[email protected]>

mergify bot added speculative-decoding v1 labels Nov 21, 2025

hl475 and others added 4 commits November 24, 2025 10:58

fix typo error

d183dcb

Signed-off-by: PatchouliTaisa <[email protected]>

fix return values in ngram gpu

293e3ae

Signed-off-by: PatchouliTaisa <[email protected]>

python3.13 pre-commit check

4534c88

Signed-off-by: PatchouliTaisa <[email protected]>

PatchouliTIS force-pushed the patchy/async_ngram branch from 6dea7df to 4534c88 Compare November 24, 2025 02:58

PatchouliTaisa and others added 7 commits November 24, 2025 11:03

fix pre-commit and sign-off

07e6b8a

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'main' into patchy/async_ngram

2f08629

fix ngram gpu kernel compile issue

e70b060

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'main' into patchy/async_ngram

cde94b2

Merge branch 'patchy/async_ngram' of https://github.com/PatchouliTIS/…

33c4437

…vllm into patchy/async_ngram

fix docs bug

25d36b1

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'main' into patchy/async_ngram

71b0dca

PatchouliTIS marked this pull request as ready for review November 26, 2025 08:21

PatchouliTIS requested review from ProExpertProg, WoosukKwon, benchislett, hmellor, houseroad, luccafong, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners November 26, 2025 08:21

benchislett reviewed Nov 27, 2025

View reviewed changes

PatchouliTIS requested a review from ApostaC as a code owner December 2, 2025 06:33

PatchouliTIS requested a review from benchislett December 2, 2025 06:34

PatchouliTIS force-pushed the patchy/async_ngram branch from a7654b8 to 52d75e8 Compare December 2, 2025 06:37

PatchouliTaisa added 6 commits December 2, 2025 15:49

v.01

183556e

Signed-off-by: PatchouliTaisa <[email protected]>

test

f6f871f

Signed-off-by: PatchouliTaisa <[email protected]>

fix large batch performance.

1fbf296

Signed-off-by: PatchouliTaisa <[email protected]>

refactor ngram gpu

b5243ec

Signed-off-by: PatchouliTaisa <[email protected]>

modify nvtx

0081487

Signed-off-by: PatchouliTaisa <[email protected]>

change copy to async

bcf454f

Signed-off-by: PatchouliTaisa <[email protected]>

PatchouliTIS force-pushed the patchy/async_ngram branch from d7ba6df to bcf454f Compare December 2, 2025 07:50

PatchouliTIS and others added 5 commits December 2, 2025 15:51

Merge branch 'main' into patchy/async_ngram

0d2638b

remove irrelevant files

34cc523

Signed-off-by: PatchouliTaisa <[email protected]>

use discard_request_mask in ngram

c9f2724

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'main' into patchy/async_ngram

16eb87c

Merge branch 'main' into patchy/async_ngram

82ff639

PatchouliTIS commented Dec 2, 2025

View reviewed changes

PatchouliTaisa and others added 10 commits December 4, 2025 12:29

remove irrelevant computations

3abd884

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'patchy/async_ngram' of https://github.com/PatchouliTIS/…

cd9ecc9

…vllm into patchy/async_ngram

Merge branch 'main' into patchy/async_ngram

38cf7fd

remove irrelevant comments

b518ef2

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'patchy/async_ngram' of https://github.com/PatchouliTIS/…

d07f4a7

…vllm into patchy/async_ngram

Merge branch 'main' into patchy/async_ngram

3d28827

move token ids tensor gpu init inline

8920a59

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'patchy/async_ngram' of https://github.com/PatchouliTIS/…

6967bb2

…vllm into patchy/async_ngram

remove unused status check

25d6b1f

Signed-off-by: PatchouliTaisa <[email protected]>

Merge branch 'main' into patchy/async_ngram

3a6df84

Uh oh!

[Core] NGram GPU Implementation compatible with Async Scheduler #29184

Are you sure you want to change the base?

[Core] NGram GPU Implementation compatible with Async Scheduler #29184

Uh oh!

Conversation

PatchouliTIS commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

ZJY0516 commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PatchouliTIS left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Neo9061 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PatchouliTIS commented Dec 3, 2025

Uh oh!

PatchouliTIS commented Dec 3, 2025

Uh oh!

ArmageddonKnight commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

PatchouliTIS commented Nov 21, 2025 •

edited by github-actions bot

Loading

Neo9061 commented Dec 2, 2025 •

edited

Loading