[Fix] add block size logic for sm120 smem size #14311

koush · 2025-12-02T18:53:38Z

Update block size assignments for consumer Blackwell sm120 architecture.

Motivation

Kimi K2 thinking crashes after the first request on 8x RTX Pro 6000:

sglang-kimi-k2  | [2025-12-02 18:40:16 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 2212, token usage: 0.02, #running-req: 0, #queue-req: 0, 
sglang-kimi-k2  | [2025-12-02 18:40:17 TP7] Scheduler hit an exception: Traceback (most recent call last):
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2677, in run_scheduler_process
sglang-kimi-k2  |     scheduler.event_loop_overlap()
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
sglang-kimi-k2  |     return func(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1036, in event_loop_overlap
sglang-kimi-k2  |     batch_result = self.run_batch(batch)
sglang-kimi-k2  |                    ^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_batch
sglang-kimi-k2  |     batch_result = self.model_worker.forward_batch_generation(
sglang-kimi-k2  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 392, in forward_batch_generation
sglang-kimi-k2  |     logits_output, can_run_cuda_graph = self.model_runner.forward(
sglang-kimi-k2  |                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2622, in forward
sglang-kimi-k2  |     output = self._forward_raw(
sglang-kimi-k2  |              ^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2681, in _forward_raw
sglang-kimi-k2  |     ret = self.forward_extend(
sglang-kimi-k2  |           ^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2567, in forward_extend
sglang-kimi-k2  |     return self.model.forward(
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
sglang-kimi-k2  |     return func(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3480, in forward
sglang-kimi-k2  |     hidden_states = self.model(
sglang-kimi-k2  |                     ^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3291, in forward
sglang-kimi-k2  |     hidden_states, residual = layer(
sglang-kimi-k2  |                               ^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 3004, in forward
sglang-kimi-k2  |     hidden_states = self.self_attn(
sglang-kimi-k2  |                     ^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1474, in forward
sglang-kimi-k2  |     return self.forward_core(s)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1573, in forward_core
sglang-kimi-k2  |     return self.forward_absorb_core(*inner_state)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1979, in forward_absorb_core
sglang-kimi-k2  |     attn_output = self.attn_mqa(
sglang-kimi-k2  |                   ^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
sglang-kimi-k2  |     return self._call_impl(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
sglang-kimi-k2  |     return forward_call(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 123, in forward
sglang-kimi-k2  |     return forward_batch.attn_backend.forward(
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 113, in forward
sglang-kimi-k2  |     return self.forward_extend(
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 838, in forward_extend
sglang-kimi-k2  |     self.extend_attention_fwd(
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
sglang-kimi-k2  |     return fn(*args, **kwargs)
sglang-kimi-k2  |            ^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/extend_attention.py", line 597, in extend_attention_fwd
sglang-kimi-k2  |     _fwd_kernel[grid](
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 419, in <lambda>
sglang-kimi-k2  |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
sglang-kimi-k2  |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 756, in run
sglang-kimi-k2  |     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
sglang-kimi-k2  |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in launch_metadata
sglang-kimi-k2  |     self._init_handles()
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 464, in _init_handles
sglang-kimi-k2  |     raise_(OutOfResources(self.metadata.shared, max_shared, "shared memory"))
sglang-kimi-k2  |   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 456, in raise_
sglang-kimi-k2  |     raise err
sglang-kimi-k2  | triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 106496, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

Modifications

I updated the block sizes to account for the correct amount of smem on these cards. It uses the same setup as the other 100k smem GPU from older generations.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Update block size assignments for consumer Blackwell sm120 architecture.

gemini-code-assist · 2025-12-02T18:53:57Z

Summary of Changes

Hello @koush, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical resource allocation issue on consumer Blackwell sm120 GPUs by refining the block size calculations for the attention mechanism. The changes ensure that the system operates within the hardware's shared memory constraints, preventing runtime failures and improving stability on these specific architectures.

Highlights

Shared Memory OutOfResources Error Fix: This pull request resolves an OutOfResources error that occurred on consumer Blackwell sm120 architecture GPUs (e.g., RTX Pro 6000) due to insufficient shared memory for the attention mechanism.
Block Size Adjustment for sm120: New block size assignments have been introduced specifically for the sm120 architecture within the _get_block_sizes_for_extend_attention function to accommodate its smaller shared memory limit (100K).
Conditional Block Size Logic: The block sizes (64, 128), (64, 64), and (32, 32) are now conditionally applied based on the Lq parameter when CUDA_CAPABILITY[0] is 12, ensuring proper resource utilization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the block size assignments for the sm120 (Blackwell) architecture to resolve an 'out of resource: shared memory' error. The changes introduce specific block sizes for this architecture, which are appropriate given its shared memory constraints. The logic is correct. My review includes a suggestion to refactor the conditional logic for checking CUDA capabilities to improve code structure and maintainability.

python/sglang/srt/layers/attention/triton_ops/extend_attention.py

koush · 2025-12-03T00:06:59Z

Fixes #14322

Update block size logic for sm120 architectures

7004ee7

Update block size assignments for consumer Blackwell sm120 architecture.

koush requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 2, 2025 18:53

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

python/sglang/srt/layers/attention/triton_ops/extend_attention.py Outdated Show resolved Hide resolved

koush added 2 commits December 2, 2025 11:02

fix up if nesting per review

386c7e8

rename sm120 from consumer to workstation branding

9d3f94c

koush changed the title ~~Update block size logic for sm120 architectures~~ [Fix] add block size logic to fix sm120 crash Dec 3, 2025

koush changed the title ~~[Fix] add block size logic to fix sm120 crash~~ [Fix] add block size logic to fix sm120 crash (fix https://github.com/sgl-project/sglang/issues/14322) Dec 3, 2025

koush changed the title ~~[Fix] add block size logic to fix sm120 crash (fix https://github.com/sgl-project/sglang/issues/14322)~~ [Fix] add block size logic to fix sm120 crash (fix #14322) Dec 3, 2025

koush changed the title ~~[Fix] add block size logic to fix sm120 crash (fix #14322)~~ [Fix] add block size logic to fix sm120 crash Dec 3, 2025

koush changed the title ~~[Fix] add block size logic to fix sm120 crash~~ [Fix] add block size logic for sm120 smem size Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] add block size logic for sm120 smem size #14311

[Fix] add block size logic for sm120 smem size #14311

koush commented Dec 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

koush commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Fix] add block size logic for sm120 smem size #14311

Are you sure you want to change the base?

[Fix] add block size logic for sm120 smem size #14311

Conversation

koush commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

koush commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

koush commented Dec 2, 2025 •

edited

Loading