Skip to content

Conversation

@dudugong-gitch
Copy link

I have successfully implemented support for attention sinks in FlashAttention-2 and validated it on attention-sink models such as GPT-OSS-20B. Below is a side-by-side speed comparison between the attention-sink–enabled FlashAttention-2 and the original FA-2:
causal,headdim,batch_size,seqlen,TFLOPs/s_old,TFLOPs/s_new,TFLOPs/s_old/TFLOPs/s_new
False,128,1,16384,80.97,81.31,0.9958
False,128,16,1024,72.67,72.86,0.9974
False,128,2,8192,80.34,80.70,0.9955
False,128,32,512,67.15,66.18,1.0147
False,128,4,4096,79.16,79.61,0.9943
False,128,8,2048,76.68,76.92,0.9969
False,64,1,16384,86.70,87.16,0.9947
False,64,16,1024,80.04,79.68,1.0045
False,64,2,8192,86.30,85.53,1.0090
False,64,32,512,76.18,73.69,1.0338
False,64,4,4096,85.14,83.17,1.0237
False,64,8,2048,83.69,81.77,1.0235
True,128,1,16384,78.55,77.32,1.0159
True,128,16,1024,57.83,56.75,1.0190
True,128,2,8192,76.22,76.31,0.9988
True,128,32,512,45.38,45.03,1.0078
True,128,4,4096,72.41,72.09,1.0044
True,128,8,2048,65.28,66.34,0.9840
True,64,1,16384,83.39,83.30,1.0011
True,64,16,1024,63.20,63.68,0.9925
True,64,2,8192,81.37,81.66,0.9964
True,64,32,512,48.91,49.13,0.9955
True,64,4,4096,77.93,77.89,1.0005
True,64,8,2048,72.07,72.16,0.9988

@dudugong-gitch dudugong-gitch force-pushed the vllm_flash_attn_attention_sinks_for_fa2 branch from acde2f8 to 0ff951c Compare October 19, 2025 15:03
@LucasWilkinson
Copy link
Collaborator

can you please open the vLLM side PR with accuracy numbers for GPT-OSS; would like to be able to run the CI against it. You can just point: https://github.com/vllm-project/vllm/blob/8a81d776ce87224768f3a20eeb57605af726977f/cmake/external_projects/vllm_flash_attn.cmake#L40-L44 to a commit on this branch for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants