[SYCL-TLA] Integrate FlashAttention fwd/bwd kernels #2341

LuFinch · 2025-11-12T02:18:27Z

This PR moves the sycltla kernels in pytorch/pytorch#167056 into torch-xpu-ops.

This PR is based on #2030. When the build PR merge, I will rebase this PR.

EikanWang

TBH, I cannot quite understand the detailed implementation. I need to take more time to understand the logic.

guangyey · 2025-11-12T03:52:28Z

src/ATen/CMakeLists.txt


 file(GLOB xpu_cpp "xpu/*.cpp")
-file(GLOB xpu_native_cpp "native/xpu/*.cpp" "native/sparse/*.cpp" "native/sparse/xpu/*.cpp" "native/nested/*.cpp" "native/nested/xpu/*.cpp" "native/transformers/*.cpp" "native/quantized/*.cpp")
+file(GLOB xpu_native_cpp "native/xpu/*.cpp" "native/sparse/*.cpp" "native/sparse/xpu/*.cpp" "native/nested/*.cpp" "native/nested/xpu/*.cpp" "native/transformers/*.cpp" "native/quantized/*.cpp" "native/transformers/xpu/flash_attn/*.cpp")


Nit: I think we should install the header file under flash_attn into PyTorch such as line 42

May I know what is the purpose of installing header file?

Give a chance to use them in cpp extension.

@guangyey , I think PyTorch does not expose flash_attn because it is the underlying logic of sdpa, which is exposed as a backend. Meanwhile, I don't believe users invoke the flash_atten of PyTorch because dao/flash_atten is a better choice.

Meanwhile, the namespace of these functions is sycltla. It is weird to let users invoke sycl-tla-specific functions.

liangan1 · 2025-11-13T23:48:32Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_fwd.cpp

+    out = at::empty({batch_size, numhead_qo, seqlen_qo, headsize_vo}, opts);
+  } else if (layout == ATTN_TENSOR_LAYOUT::BSHD) {
+    out = at::empty({batch_size, seqlen_qo, numhead_qo, headsize_vo}, opts)
+              .permute({0, 2, 1, 3});


why need to permute here?

output is inited as BSHD contiguous but the shape should be BHSD in for SDPA. Hence it needs to permute the seqlen and numhead dimension.

EikanWang · 2025-11-17T07:00:19Z

@LuFinch , should we land this PR now?

LuFinch · 2025-11-17T07:18:46Z

@EikanWang No. CI failed at build. Checking whether it is a driver issue...

InvalidModule: Invalid SPIR-V module: input SPIR-V module uses unknown extension 'SPV_INTEL_2d_block_io'
 Undefined function _Z45intel_sub_group_2d_block_prefetch_16b_4r16x2cPU3AS1viiiDv2_i found in ... This may result in runtime errors.

LuFinch · 2025-11-17T14:11:47Z

The CD docker's driver from rhe-l8.8 is too old which can't find intel 2d load symbol. Need to upgrade driver to rhel-8.10.

liangan1 · 2025-11-18T01:37:00Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/collective/xe_flash_attn_prefill_mma_bshd.h

+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/fp8_to_fp16.h"


I suppose the fp8 related feature is not part of this PR?

Yes. Just copy the kernel file from sycltla directly. No do code clean yet.

liangan1 · 2025-11-18T01:38:05Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/collective/xe_flash_attn_prefill_mma_bshd.h

+    for (int k_tile = 0; k_tile < k_tile_count; ++k_tile) {
+      copy(params.gmem_tiled_copy_q, tQgQ(_, _, _, k_tile), tQrQ);
+      copy(params.gmem_tiled_copy_k, tKgK(_, _, _, k_tile), tKrK);
+      if constexpr (is_fp8_v<ElementQ> && is_fp8_v<ElementK>) {


This PR is so large, fp8 should covered by this PR?

liangan1 · 2025-11-18T01:40:03Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/collective/xe_flash_attn_prefill_mma_bshd.h

+    auto kv_head_coord = q_head_coord / q_group_size;
+    int offset_q = 0, offset_k = 0, offset_v = 0;
+
+    if constexpr (is_var_len) {


It seems that var length is not validated by the UT now? if it is not part of the goal of this PR, suggest to enable it in another PR and add the related UT

Currently I copy these collective and kernel file from sycltla to torch-xpu-ops directly because they update the codes frequently and copy entire file make rebase more easy. I think we could merge the code to torch-xpu-ops at first. Then we can do code clean when the sycltla code is stable.

github-actions · 2025-11-21T08:37:19Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	pytorch_unet	1.040893	0.705059

LuFinch requested review from EikanWang, Copilot and guangyey and removed request for Copilot November 12, 2025 02:18

LuFinch mentioned this pull request Nov 12, 2025

[xpu][feature] Enable SDPA XPU FlashAttention backend with SYCL-TLA implementation pytorch/pytorch#167057

Open

EikanWang approved these changes Nov 12, 2025

View reviewed changes

guangyey reviewed Nov 12, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 13, 2025 05:52

This comment was marked as outdated.

Sign in to view

LuFinch force-pushed the lfq/flash_attention branch from 770035a to 442c445 Compare November 13, 2025 05:55

liangan1 reviewed Nov 13, 2025

View reviewed changes

LuFinch force-pushed the lfq/flash_attention branch from 2eb4cd9 to 95f9c65 Compare November 17, 2025 03:04

liangan1 reviewed Nov 18, 2025

View reviewed changes

LuFinch added 6 commits November 18, 2025 00:28

mha fwd/bwd kernel integration

a7deb96

install header

76aec4e

fix build warning

d33b6a5

rebase forwardkernel

54a5ca5

fix CI build error

991ee97

rebase to latest

89c6a49

LuFinch force-pushed the lfq/flash_attention branch from 95f9c65 to 89c6a49 Compare November 18, 2025 08:28

pad input tensors if headdim is not multiple of 64

b61325e

[SYCL-TLA] Integrate FlashAttention fwd/bwd kernels #2341

Are you sure you want to change the base?

[SYCL-TLA] Integrate FlashAttention fwd/bwd kernels #2341

Uh oh!

Conversation

LuFinch commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EikanWang commented Nov 17, 2025

Uh oh!

LuFinch commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuFinch commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 21, 2025

Performance outliers, please check!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LuFinch commented Nov 12, 2025 •

edited

Loading

LuFinch commented Nov 17, 2025 •

edited

Loading

LuFinch commented Nov 17, 2025 •

edited

Loading