Skip to content

Commit 0a2cebd

Browse files
TroyGardenmeta-codesync[bot]
authored andcommitted
add CUDA 12.9 unit test (#3592)
Summary: Pull Request resolved: #3592 # context * previously CUDA 12.9 wasn't added to the unit test due to fbgemm capatibility * now fbgemm has the support we added it into our test suite. * for PR we only run CUDA 12.9 with python 3.13 # issue fix * previous torchrec guithub workflow for gpu unittests are failing due to missing A100 support in fbgemm ``` >>> a=torch.empty(3, device='cuda').int() >>> torch.ops.fbgemm.permute_2D_sparse_data(a,b,a) Traceback (most recent call last): File "<python-input-22>", line 1, in <module> torch.ops.fbgemm.permute_2D_sparse_data(a,b,a) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^ File "/home/hhy/.conda/envs/ci/lib/python3.13/site-packages/torch/_ops.py", line 1237, in __call__ return self._op(*args, **kwargs) ~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/hhy/.conda/envs/ci/lib/python3.13/site-packages/torch/_library/autograd.py", line 112, in autograd_impl result = forward_no_grad(*args, Metadata(keyset, keyword_only_args)) File "/home/hhy/.conda/envs/ci/lib/python3.13/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs) File "/home/hhy/.conda/envs/ci/lib/python3.13/site-packages/torch/_ops.py", line 822, in redispatch return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: [/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/sparse_ops/sparse_permute_2d.cu(113:98)] [(permute_2D_lengths_kernel<index_t>)] CUDA Error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` * env setup ``` $ conda create -yn fbgemm python=3.13 $ conda activate fbgemm $ pip install torch --index-url https://download.pytorch.org/whl/nightly/cu128 $ pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu128 $ python -i ``` * python cli ``` >>> import torch >>> torch.ops.import_module("fbgemm_gpu.sparse_ops") >>> a=torch.empty(3, device='cuda').int() >>> b=torch.empty((3,3), device='cuda').long() >>> torch.ops.fbgemm.permute_2D_sparse_data(a,b,a) ``` * resolved > the issue is you're running on A100 and fbgemm removed sm80 earlier in nightly, recently just added back. So if you uninstall fbgemm gpu and re-install it for today's release. It should fix your issue. Reviewed By: aporialiao Differential Revision: D88400551 fbshipit-source-id: 6da61d3841edcb3dcb51bbb8156822337be1ca78
1 parent deead45 commit 0a2cebd

File tree

1 file changed

+6
-4
lines changed

1 file changed

+6
-4
lines changed

.github/workflows/unittest_ci.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
strategy:
3737
fail-fast: false
3838
matrix:
39-
cuda-tag: ["cu126", "cu128"]
39+
cuda-tag: ["cu126", "cu128", "cu129"]
4040
os:
4141
- linux.g5.12xlarge.nvidia.gpu
4242
python:
@@ -55,18 +55,20 @@ jobs:
5555
cuda-tag: "cu126"
5656
- is_pr: true
5757
cuda-tag: "cu128"
58+
- is_pr: true
59+
cuda-tag: "cu129"
5860
python:
5961
version: "3.9"
6062
- is_pr: true
61-
cuda-tag: "cu128"
63+
cuda-tag: "cu129"
6264
python:
6365
version: "3.10"
6466
- is_pr: true
65-
cuda-tag: "cu128"
67+
cuda-tag: "cu129"
6668
python:
6769
version: "3.11"
6870
- is_pr: true
69-
cuda-tag: "cu128"
71+
cuda-tag: "cu129"
7072
python:
7173
version: "3.12"
7274
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main

0 commit comments

Comments
 (0)