-
Notifications
You must be signed in to change notification settings - Fork 68
Add epilogue subtiling #948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
stack-info: PR: #948, branch: PaulZhang12/stack/14
cf439ac to
fcc7492
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
fcc7492 to
cdbedf6
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
cdbedf6 to
58496fb
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
58496fb to
965b193
Compare
jansel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this help with matmul perf?
stack-info: PR: #948, branch: PaulZhang12/stack/14
965b193 to
2bc36d0
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
2bc36d0 to
1c1e282
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
1c1e282 to
cccb0af
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
cccb0af to
a6dd082
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
a6dd082 to
88d46a8
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
88d46a8 to
48eed82
Compare
0c3d607 to
bdf0793
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
bdf0793 to
9856699
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
9856699 to
3ae89e1
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
3ae89e1 to
0ef154f
Compare
|
Any perf data on this one? |
stack-info: PR: #948, branch: PaulZhang12/stack/14
0ef154f to
4e19822
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
4e19822 to
7e8b05e
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
7e8b05e to
a8c83b6
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
a8c83b6 to
4ebc4f1
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
4ebc4f1 to
5b75ab2
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
5b75ab2 to
f48a0a3
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
f48a0a3 to
95d9ef0
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
95d9ef0 to
a9d2372
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
a9d2372 to
a56a3b7
Compare
stack-info: PR: #948, branch: PaulZhang12/stack/14
a56a3b7 to
cd3553e
Compare
jansel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's turn allow_epilogue_subtiling on by default then do a full test run to shake out any issues (then turn it off by default again).
|
|
||
| lowering = current.meta.get("lowering") | ||
| # Check if this is a pointwise operation with only one user | ||
| if isinstance(lowering, PointwiseLowering) and len(current.users) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the users==1 requirement? Is this meant to ensure everything is contained in the same graph? Maybe we should check this constraint more directly.
| if current not in pointwise_nodes: | ||
| pointwise_nodes[current] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if current not in pointwise_nodes: | |
| pointwise_nodes[current] = None | |
| pointwise_nodes.setdefault(current) |
|
|
||
| for node in graph.nodes: | ||
| if node.op == "call_function" and node.target == store_api: | ||
| stores.add(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used?
| # Register a tunable for epilogue subtile for all device stores | ||
| fragment = ListOf( | ||
| EnumFragment(choices=VALID_EPILOGUE_SUBTILE_SIZES), length=store_count | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the fragment defnition to config spec.
|
|
||
| for node in graph.nodes: | ||
| if node.op == "call_function" and node.target == store_api: | ||
| stores.add(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
| def _use_epilogue_subtile() -> bool: | ||
| from .compile_environment import CompileEnvironment | ||
|
|
||
| return ( | ||
| torch.cuda.is_available() | ||
| and torch.cuda.get_device_capability() >= (10, 0) | ||
| and CompileEnvironment.current().settings.allow_epilogue_subtiling | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this logic to CompileEnvironment and only compute it once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file seem to duplicate a lot of codegen logic. IMO it would be cleaner to frame this as a graph transformation rather than trying cram so much logic inside the handling of store codegen.
| return not (block_n_hint % 2 != 0 or block_size <= 16) | ||
|
|
||
|
|
||
| def _get_accumulator_subtiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to special case accumulators? I think the concept of subtiling is more generic.
| return output_shape | ||
|
|
||
|
|
||
| def _can_epilogue_subtile_with_output_shape( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should happen before we add the option to the configspec.
| value=store_value, | ||
| ) | ||
|
|
||
| def codegen_store_subtile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of duplicated code. We should refactor things to share more.
Stacked PRs:
Co-author: @yf225
Epilogue Subtiling
Add it as an opt-in feature currently, as support for complex epilogues (such as loading a bias + adding to accumulator) is difficult and not currently supported. Furthermore, most kernels do not require epilogue subtiling, as it is generally useful for GEMMs in which the accumulator lives in TMEM for B200.
GEMM CI exhibits ~4% gain, epilogue_subtiling=[2] is often picked as the final config, 0.88x with subtiling, 0.84x without
