The performance gap is found in #2347
Need to investigate root cause of the performance drops of the column major B matrix case.
Roughly 1.5x worse than the row major B matrix case.
(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.31633758544921875 ms
Time for triton: 0.44517597556114197 ms
Compute A x B.T
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.3375360071659088 ms
Time for triton: 0.6348815560340881 ms