[V1][Spec Decode][Feature] support tree attention in flash-linear-attenton #29846

menggeliu1205 · 2025-12-02T06:56:54Z

Purpose

Tree attention is not supported in flash-linear-attention, which can verify a tree structure of draft tokens in Qwen3-Next model. To support that, added a new parameter to gdn triton kernel called retrieve_parent_token. This parameter recodes the parent node of each draft token using a 2D (batch_size, max_spec_len+1) tensor. This feature is only enabled if "retrieve_parent_token" is provided. Otherwise, it changes nothing.

And I add two test:

tests/v1/spec_decode/test_causal_conv1d_with_eagle_tree.py: test function causal_conv1d_update in speculative decoding with "retrieve_parent_token" or not.
tests/v1/spec_decode/test_recurrent_gated_delta_with_eagle_tree.py: test function fused_recurrent_gated_delta_rule in speculative decoding with "retrieve_parent_token" or not.

Test Plan

Test Result

Added test case pass.

TODO

based on 22752, support speculative decoding with eagle draft tree on Qwen3-Next model.

gemini-code-assist

Code Review

This pull request adds support for tree-structured speculative decoding in flash-linear-attention by introducing a retrieve_parent_token parameter to the Triton kernels. The changes primarily affect fused_recurrent.py and causal_conv1d.py, and new tests are included to validate this functionality. My review identifies critical performance and correctness issues in the new Triton kernel implementations. Specifically, both kernels use an inefficient pattern for indexing parent tokens, and one of them contains a potential bug due to using an incorrect variable for indexing. I've provided suggestions to fix these issues, which should improve performance and ensure correctness.

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

vllm/model_executor/layers/fla/ops/fused_recurrent.py

Signed-off-by: liumengge1205 <[email protected]>

menggeliu1205 requested a review from tdoublep as a code owner December 2, 2025 06:56

mergify bot added speculative-decoding v1 labels Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

vllm/model_executor/layers/mamba/ops/causal_conv1d.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fla/ops/fused_recurrent.py Outdated Show resolved Hide resolved

menggeliu1205 changed the title ~~support tree state in flash-linear-attenton~~ feat: support tree state in flash-linear-attenton Dec 2, 2025

menggeliu1205 closed this Dec 2, 2025

menggeliu1205 reopened this Dec 2, 2025

menggeliu1205 force-pushed the support_tree_attention_in_gdn branch from 19af8e0 to b47f50b Compare December 2, 2025 07:34

menggeliu1205 changed the title ~~feat: support tree state in flash-linear-attenton~~ feat: support tree attention in flash-linear-attenton Dec 2, 2025

menggeliu1205 force-pushed the support_tree_attention_in_gdn branch 3 times, most recently from 46e4136 to e811cb5 Compare December 2, 2025 09:06

menggeliu1205 added 2 commits December 2, 2025 19:59

support tree attention in gdn

03c6dad

Signed-off-by: liumengge1205 <[email protected]>

fix some comments

88cd6da

Signed-off-by: liumengge1205 <[email protected]>

menggeliu1205 force-pushed the support_tree_attention_in_gdn branch from 8639589 to 88cd6da Compare December 2, 2025 11:59

menggeliu1205 changed the title ~~feat: support tree attention in flash-linear-attenton~~ [V1][Spec Decode][Feature] support tree attention in flash-linear-attenton Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Spec Decode][Feature] support tree attention in flash-linear-attenton #29846

[V1][Spec Decode][Feature] support tree attention in flash-linear-attenton #29846

menggeliu1205 commented Dec 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[V1][Spec Decode][Feature] support tree attention in flash-linear-attenton #29846

Are you sure you want to change the base?

[V1][Spec Decode][Feature] support tree attention in flash-linear-attenton #29846

Conversation

menggeliu1205 commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TODO

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

menggeliu1205 commented Dec 2, 2025 •

edited by github-actions bot

Loading