Skip to content

Conversation

@copybara-service
Copy link

[Mosaic GPU] Optimize the computation of tcgen05.mma matrix descriptors

Previously we used a simple approach of computing the descriptors entirely
using LLVM ops. This was convenient, but it turns out that there are two
problems with it:

  1. LLVM doesn't always fully constant fold properly and sometimes emits
    PTX that causes ptxas to generate lots of non-uniform operations.
  2. LLVM is quite aggressive to hoist descriptor computation outside of loops,
    which blows up the register pressure.

The alternative implemented here is to compute the descriptors in inline ptx,
with manual constant folding, and right before the MMA operations. This seems
to generate code that has extremely low register pressure and only very few
uniform operations on 32-bit quantities.

Previously we used a simple approach of computing the descriptors entirely
using LLVM ops. This was convenient, but it turns out that there are two
problems with it:
1. LLVM doesn't always fully constant fold properly and sometimes emits
   PTX that causes ptxas to generate lots of non-uniform operations.
2. LLVM is quite aggressive to hoist descriptor computation outside of loops,
   which blows up the register pressure.

The alternative implemented here is to compute the descriptors in inline ptx,
with manual constant folding, and right before the MMA operations. This seems
to generate code that has extremely low register pressure and only very few
uniform operations on 32-bit quantities.

PiperOrigin-RevId: 839224185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant