-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Core] NGram GPU Implementation compatible with Async Scheduler #29184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…rs (vllm-project#29111) Signed-off-by: Huamin Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
6dea7df to
4534c88
Compare
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
…vllm into patchy/async_ngram
Signed-off-by: PatchouliTaisa <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
cc @njhill |
| # Honors opt-outs such as CompilationMode.NONE or VLLM_DISABLE_COMPILE_CACHE. | ||
| disable_cache = not is_compile_cache_enabled(self.inductor_config) | ||
|
|
||
| # TODO(patchy): ngram gpu kernel will cause vllm torch compile cache errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Can this be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enabled torch compile in the ngram gpu kernel, the computational graph corresponding to ngram operator would hit a precompiled computational graph cache in the main model, leading to mismatched computational graph results. Therefore, I directly disabled the compile cache here. I tested this locally, and disabling the cache had no impact on performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume disabling the compile cache would lead to longer startup time? I'm not an expert here but maybe it's possible to add an identifier to the compile cache to avoid extraneous cache hits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the startup time will increase a little. I attempted to add additional input parameters and other member variables to the nn.Modules forward method decorated with @support_torch_compile to achieve cache isolation, but none of them worked. I suspect this might be related to the internal implementation of @support_torch_compile within vLLM. However, as things stand, disabling torch compile caching only impacts the performance of the entire inference service during the initial startup phase.
a7654b8 to
52d75e8
Compare
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
d7ba6df to
bcf454f
Compare
Signed-off-by: PatchouliTaisa <[email protected]>
Signed-off-by: PatchouliTaisa <[email protected]>
PatchouliTIS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the codes according to comments.
vllm/v1/worker/gpu_model_runner.py
Outdated
| for i, num_tokens in enumerate(num_accepted_tokens): | ||
| self.input_batch.num_accepted_tokens_cpu[i] = num_tokens | ||
|
|
||
| def _update_ngram_gpu_tensors(self, scheduler_output: "SchedulerOutput") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This separate processing for the ngram GPU input avoids direct copying each time. It performs incremental updates to the GPU buffer based on the previous prev_req_id_to_index and the current self.input_batch.req_id_to_index, thereby preventing extensive copying operations.
| # Honors opt-outs such as CompilationMode.NONE or VLLM_DISABLE_COMPILE_CACHE. | ||
| disable_cache = not is_compile_cache_enabled(self.inductor_config) | ||
|
|
||
| # TODO(patchy): ngram gpu kernel will cause vllm torch compile cache errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the startup time will increase a little. I attempted to add additional input parameters and other member variables to the nn.Modules forward method decorated with @support_torch_compile to achieve cache isolation, but none of them worked. I suspect this might be related to the internal implementation of @support_torch_compile within vLLM. However, as things stand, disabling torch compile caching only impacts the performance of the entire inference service during the initial startup phase.
vllm/v1/worker/gpu_input_batch.py
Outdated
| pin_memory=False, | ||
| ) | ||
| self.token_ids_cpu = self.token_ids_cpu_tensor.numpy() | ||
| self.token_ids_gpu_tensor = torch.zeros( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this section requires more sophisticated logic to store token IDs for each request, it would necessitate significant modifications, including changes to how ngram_gpu handles token ID input. For this PR, I propose moving this buffer allocation logic to gpu_model_runner. This means buffer allocation would only occur when both async_scheduler and ngram_gpu are enabled. Users could also manually set max_num_seqs and max_model_len to reduce the VRAM footprint of this buffer. Further optimizations could be addressed in a separate PR.
vllm/v1/worker/gpu_model_runner.py
Outdated
| all_token_ids = prompt_token_ids + req_state.output_token_ids | ||
| num_tokens = len(all_token_ids) | ||
| # Copy to GPU tensor | ||
| self.input_batch.token_ids_gpu_tensor[idx, :num_tokens].copy_( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'll fix that in the actual async implementation.
vllm/v1/worker/gpu_model_runner.py
Outdated
| ), | ||
| non_blocking=True, | ||
| ) | ||
| self.input_batch.num_tokens_no_spec_gpu[idx] = num_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the update of token_ids_gpu_tensor used for ngram gpu into _update_states, reuse the result of num_tokens_no_spec maintained in input_batch
| @support_torch_compile( | ||
| dynamic_arg_dims={ | ||
| "num_tokens_no_spec": 0, | ||
| "token_ids_gpu": [0, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the second dim flag for token_ids_gpu, now there is only batch_size dim of all inputs is marked as dynamic. The second dim is seq_len and since I use a fix size of buffer token_ids_gpu_tensor, the input always has fixed size in second dim.
Yes! I meant higher batch sizes. We also observed similar. #27379 (we also used Blazedit dataset to show n-gram effectiveness) Do you by chance have a timeline when you plan to resolve the higher-batch issue for n-gram? For context I want to use n-gram in the context of hybrid decoding (#24344) and since EAGLE now has async scheduling, your PR would be very useful to make n-gram compatible in hybrid. |
I come up a new implementation and will push up this week. Based on it I run some benchmark in Blazedit datasets on Qwen3-8B and here is the result: It appears that the new implementation yields some benefits at large batch sizes. |
During local benchmark I also notice that the mean draft tokens acceptance rate is approxiamately 37%~41% in Blazedit datasets, which maybe not enough for an explicit performance enhancements? |
|
@benchislett @njhill We have addressed the feedback. Could you please kindly review this PR again? Thank you. |
Signed-off-by: PatchouliTaisa <[email protected]>
…vllm into patchy/async_ngram
Signed-off-by: PatchouliTaisa <[email protected]>
…vllm into patchy/async_ngram
Signed-off-by: PatchouliTaisa <[email protected]>
…vllm into patchy/async_ngram
Signed-off-by: PatchouliTaisa <[email protected]>






Purpose
This PR is based on PR #24799 aiming to implement GPU version of ngram speculative decoding and make it compatible with Async Scheduler.
Test Plan
Test config:
Test Device: NVIDIA H20
Test Result
Performance
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.