Skip to content

Conversation

@PaulZhang12
Copy link
Contributor

@PaulZhang12 PaulZhang12 commented Oct 15, 2025

Stacked PRs:


Co-author: @yf225

Epilogue Subtiling

Add it as an opt-in feature currently, as support for complex epilogues (such as loading a bias + adding to accumulator) is difficult and not currently supported. Furthermore, most kernels do not require epilogue subtiling, as it is generally useful for GEMMs in which the accumulator lives in TMEM for B200.

GEMM CI exhibits ~4% gain, epilogue_subtiling=[2] is often picked as the final config, 0.88x with subtiling, 0.84x without
image

PaulZhang12 added a commit that referenced this pull request Oct 15, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025
PaulZhang12 added a commit that referenced this pull request Oct 15, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 15, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 15, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this help with matmul perf?

PaulZhang12 added a commit that referenced this pull request Oct 16, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 17, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 20, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 20, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
@PaulZhang12 PaulZhang12 changed the base branch from main to PaulZhang12/stack/16 October 20, 2025 19:20
@PaulZhang12 PaulZhang12 changed the base branch from PaulZhang12/stack/16 to main October 20, 2025 19:22
PaulZhang12 added a commit that referenced this pull request Oct 20, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
@PaulZhang12 PaulZhang12 changed the base branch from main to PaulZhang12/stack/16 October 20, 2025 19:22
@PaulZhang12 PaulZhang12 changed the base branch from PaulZhang12/stack/16 to main October 20, 2025 19:24
PaulZhang12 added a commit that referenced this pull request Oct 20, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
@PaulZhang12 PaulZhang12 changed the base branch from main to PaulZhang12/stack/16 October 20, 2025 19:25
@PaulZhang12 PaulZhang12 changed the base branch from PaulZhang12/stack/16 to main October 20, 2025 19:28
PaulZhang12 added a commit that referenced this pull request Oct 27, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 30, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Oct 30, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
@jansel
Copy link
Contributor

jansel commented Nov 2, 2025

Any perf data on this one?

PaulZhang12 added a commit that referenced this pull request Nov 3, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
PaulZhang12 added a commit that referenced this pull request Nov 5, 2025
stack-info: PR: #948, branch: PaulZhang12/stack/14
stack-info: PR: #948, branch: PaulZhang12/stack/14
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's turn allow_epilogue_subtiling on by default then do a full test run to shake out any issues (then turn it off by default again).


lowering = current.meta.get("lowering")
# Check if this is a pointwise operation with only one user
if isinstance(lowering, PointwiseLowering) and len(current.users) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the users==1 requirement? Is this meant to ensure everything is contained in the same graph? Maybe we should check this constraint more directly.

Comment on lines +1402 to +1403
if current not in pointwise_nodes:
pointwise_nodes[current] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if current not in pointwise_nodes:
pointwise_nodes[current] = None
pointwise_nodes.setdefault(current)


for node in graph.nodes:
if node.op == "call_function" and node.target == store_api:
stores.add(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Comment on lines +1372 to +1375
# Register a tunable for epilogue subtile for all device stores
fragment = ListOf(
EnumFragment(choices=VALID_EPILOGUE_SUBTILE_SIZES), length=store_count
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the fragment defnition to config spec.


for node in graph.nodes:
if node.op == "call_function" and node.target == store_api:
stores.add(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Comment on lines +31 to +38
def _use_epilogue_subtile() -> bool:
from .compile_environment import CompileEnvironment

return (
torch.cuda.is_available()
and torch.cuda.get_device_capability() >= (10, 0)
and CompileEnvironment.current().settings.allow_epilogue_subtiling
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this logic to CompileEnvironment and only compute it once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file seem to duplicate a lot of codegen logic. IMO it would be cleaner to frame this as a graph transformation rather than trying cram so much logic inside the handling of store codegen.

return not (block_n_hint % 2 != 0 or block_size <= 16)


def _get_accumulator_subtiles(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to special case accumulators? I think the concept of subtiling is more generic.

return output_shape


def _can_epilogue_subtile_with_output_shape(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should happen before we add the option to the configspec.

value=store_value,
)

def codegen_store_subtile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of duplicated code. We should refactor things to share more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants