Skip to content

Conversation

@koush
Copy link

@koush koush commented Dec 2, 2025

Update block size assignments for consumer Blackwell sm120 architecture.

Motivation

Kimi K2 thinking crashes after the first request on 8x RTX Pro 6000:

sglang-kimi-k2  | [2025-12-02 18:40:16 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 2212, token usage: 0.02, #running-req: 0, #queue-req: 0, 
sglang-kimi-k2  | [2025-12-02 18:40:17 TP7] Scheduler hit an exception: Traceback (most recent call last):
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2677, in run_scheduler_process
sglang-kimi-k2  |     scheduler.event_loop_overlap()
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
sglang-kimi-k2  |     return func(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1036, in event_loop_overlap
sglang-kimi-k2  |     batch_result = self.run_batch(batch)
sglang-kimi-k2  |                    ^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_batch
sglang-kimi-k2  |     batch_result = self.model_worker.forward_batch_generation(
sglang-kimi-k2  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 392, in forward_batch_generation
sglang-kimi-k2  |     logits_output, can_run_cuda_graph = self.model_runner.forward(
sglang-kimi-k2  |                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2622, in forward
sglang-kimi-k2  |     output = self._forward_raw(
sglang-kimi-k2  |              ^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2681, in _forward_raw
sglang-kimi-k2  |     ret = self.forward_extend(
sglang-kimi-k2  |           ^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2567, in forward_extend
sglang-kimi-k2  |     return self.model.forward(
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
sglang-kimi-k2  |     return func(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3480, in forward
sglang-kimi-k2  |     hidden_states = self.model(
sglang-kimi-k2  |                     ^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3291, in forward
sglang-kimi-k2  |     hidden_states, residual = layer(
sglang-kimi-k2  |                               ^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3004, in forward
sglang-kimi-k2  |     hidden_states = self.self_attn(
sglang-kimi-k2  |                     ^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1474, in forward
sglang-kimi-k2  |     return self.forward_core(s)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1573, in forward_core
sglang-kimi-k2  |     return self.forward_absorb_core(*inner_state)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1979, in forward_absorb_core
sglang-kimi-k2  |     attn_output = self.attn_mqa(
sglang-kimi-k2  |                   ^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 123, in forward
sglang-kimi-k2  |     return forward_batch.attn_backend.forward(
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 113, in forward
sglang-kimi-k2  |     return self.forward_extend(
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 838, in forward_extend
sglang-kimi-k2  |     self.extend_attention_fwd(
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
sglang-kimi-k2  |     return fn(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/extend_attention.py", line 597, in extend_attention_fwd
sglang-kimi-k2  |     _fwd_kernel[grid](
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 419, in <lambda>
sglang-kimi-k2  |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
sglang-kimi-k2  |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 756, in run
sglang-kimi-k2  |     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
sglang-kimi-k2  |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in launch_metadata
sglang-kimi-k2  |     self._init_handles()
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 464, in _init_handles
sglang-kimi-k2  |     raise_(OutOfResources(self.metadata.shared, max_shared, "shared memory"))
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 456, in raise_
sglang-kimi-k2  |     raise err
sglang-kimi-k2  | triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 106496, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

Modifications

I updated the block sizes to account for the correct amount of smem on these cards. It uses the same setup as the other 100k smem GPU from older generations.

Accuracy Tests

Benchmarking and Profiling

Checklist

Update block size assignments for consumer Blackwell sm120 architecture.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @koush, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical resource allocation issue on consumer Blackwell sm120 GPUs by refining the block size calculations for the attention mechanism. The changes ensure that the system operates within the hardware's shared memory constraints, preventing runtime failures and improving stability on these specific architectures.

Highlights

  • Shared Memory OutOfResources Error Fix: This pull request resolves an OutOfResources error that occurred on consumer Blackwell sm120 architecture GPUs (e.g., RTX Pro 6000) due to insufficient shared memory for the attention mechanism.
  • Block Size Adjustment for sm120: New block size assignments have been introduced specifically for the sm120 architecture within the _get_block_sizes_for_extend_attention function to accommodate its smaller shared memory limit (100K).
  • Conditional Block Size Logic: The block sizes (64, 128), (64, 64), and (32, 32) are now conditionally applied based on the Lq parameter when CUDA_CAPABILITY[0] is 12, ensuring proper resource utilization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the block size assignments for the sm120 (Blackwell) architecture to resolve an 'out of resource: shared memory' error. The changes introduce specific block sizes for this architecture, which are appropriate given its shared memory constraints. The logic is correct. My review includes a suggestion to refactor the conditional logic for checking CUDA capabilities to improve code structure and maintainability.

@koush koush changed the title Update block size logic for sm120 architectures [Fix] add block size logic to fix sm120 crash Dec 3, 2025
@koush koush changed the title [Fix] add block size logic to fix sm120 crash [Fix] add block size logic to fix sm120 crash (fix https://github.com/sgl-project/sglang/issues/14322) Dec 3, 2025
@koush koush changed the title [Fix] add block size logic to fix sm120 crash (fix https://github.com/sgl-project/sglang/issues/14322) [Fix] add block size logic to fix sm120 crash (fix #14322) Dec 3, 2025
@koush
Copy link
Author

koush commented Dec 3, 2025

Fixes #14322

@koush koush changed the title [Fix] add block size logic to fix sm120 crash (fix #14322) [Fix] add block size logic to fix sm120 crash Dec 3, 2025
@koush koush changed the title [Fix] add block size logic to fix sm120 crash [Fix] add block size logic for sm120 smem size Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant