Skip to content

Conversation

@menggeliu1205
Copy link

@menggeliu1205 menggeliu1205 commented Dec 2, 2025

Purpose

Tree attention is not supported in flash-linear-attention, which can verify a tree structure of draft tokens in Qwen3-Next model. To support that, added a new parameter to gdn triton kernel called retrieve_parent_token. This parameter recodes the parent node of each draft token using a 2D (batch_size, max_spec_len+1) tensor. This feature is only enabled if "retrieve_parent_token" is provided. Otherwise, it changes nothing.

And I add two test:

  1. tests/v1/spec_decode/test_causal_conv1d_with_eagle_tree.py: test function causal_conv1d_update in speculative decoding with "retrieve_parent_token" or not.
  2. tests/v1/spec_decode/test_recurrent_gated_delta_with_eagle_tree.py: test function fused_recurrent_gated_delta_rule in speculative decoding with "retrieve_parent_token" or not.

Test Plan

Test Result

Added test case pass.
截屏2025-12-02 14 07 38
截屏2025-12-02 14 26 30的副本

TODO

  • based on 22752, support speculative decoding with eagle draft tree on Qwen3-Next model.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for tree-structured speculative decoding in flash-linear-attention by introducing a retrieve_parent_token parameter to the Triton kernels. The changes primarily affect fused_recurrent.py and causal_conv1d.py, and new tests are included to validate this functionality. My review identifies critical performance and correctness issues in the new Triton kernel implementations. Specifically, both kernels use an inefficient pattern for indexing parent tokens, and one of them contains a potential bug due to using an incorrect variable for indexing. I've provided suggestions to fix these issues, which should improve performance and ensure correctness.

@menggeliu1205 menggeliu1205 changed the title support tree state in flash-linear-attenton feat: support tree state in flash-linear-attenton Dec 2, 2025
@menggeliu1205 menggeliu1205 reopened this Dec 2, 2025
@menggeliu1205 menggeliu1205 force-pushed the support_tree_attention_in_gdn branch from 19af8e0 to b47f50b Compare December 2, 2025 07:34
@menggeliu1205 menggeliu1205 changed the title feat: support tree state in flash-linear-attenton feat: support tree attention in flash-linear-attenton Dec 2, 2025
@menggeliu1205 menggeliu1205 force-pushed the support_tree_attention_in_gdn branch 3 times, most recently from 46e4136 to e811cb5 Compare December 2, 2025 09:06
Signed-off-by: liumengge1205 <[email protected]>
Signed-off-by: liumengge1205 <[email protected]>
@menggeliu1205 menggeliu1205 force-pushed the support_tree_attention_in_gdn branch from 8639589 to 88cd6da Compare December 2, 2025 11:59
@menggeliu1205 menggeliu1205 changed the title feat: support tree attention in flash-linear-attenton [V1][Spec Decode][Feature] support tree attention in flash-linear-attenton Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant