-
Notifications
You must be signed in to change notification settings - Fork 580
Open
Description
I run a example in https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm, and want to save model with torch.jit.script, but it has error.
command:
export LEARNING_RATE=0.5;
torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py -- --batch_size 2048 --learning_rate $LEARNING_RATE --dataset_name criteo_kaggle --num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36 --embedding_dim 128 --over_arch_layer_sizes 1024,1024,512,256,1 --dense_arch_layer_sizes 512,256,128 --epochs 1 --validation_freq_within_epoch 12802
logs:
torchx 2025-11-19 06:46:19 INFO Tracker configurations: {}
torchx 2025-11-19 06:46:19 INFO Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2025-11-19 06:46:19 INFO Log directory is: /tmp/torchx_z2d00ny6
local_cwd://torchx/dlrm_main-vm9krtsx5bpnjd
torchx 2025-11-19 06:46:19 INFO Waiting for the app to finish...
dlrm_main/0 [0]:PARAMS: (lr, batch_size, warmup_steps, decay_start, decay_steps): (0.5, 2048, 0, 0, 0)
dlrm_main/0 [0]:/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:860: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`.
dlrm_main/0 [0]: warnings.warn(
dlrm_main/0 [0]:
dlrm_main/0 [0]:Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]dlrm_main/0 [0]:
dlrm_main/0 [0]:Epoch 0: 10%|█ | 1/10 [00:00<00:03, 3.00it/s]dlrm_main/0 [0]:
dlrm_main/0 [0]:Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 25.97it/s]
dlrm_main/0 [0]:
dlrm_main/0 [0]:Evaluating val set: 0%| | 0/10 [00:00<?, ?it/s]dlrm_main/0 [0]:Total number of iterations: 10
dlrm_main/0 [0]:
dlrm_main/0 [0]:Evaluating val set: 50%|█████ | 5/10 [00:00<00:00, 48.80it/s]/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
dlrm_main/0 [0]: warnings.warn( # warn only once
dlrm_main/0 [0]:
dlrm_main/0 [0]:Evaluating val set: 100%|██████████| 10/10 [00:00<00:00, 75.09it/s]
dlrm_main/0 [0]:
dlrm_main/0 [0]:Evaluating test set: 0%| | 0/10 [00:00<?, ?it/s]dlrm_main/0 [0]:AUROC over val set: 0.5073344707489014.
dlrm_main/0 [0]:Number of val samples: 20480
dlrm_main/0 [0]:
dlrm_main/0 [0]:Evaluating test set: 100%|██████████| 10/10 [00:00<00:00, 192.46it/s]
dlrm_main/0 [0]:[rank0]: Traceback (most recent call last):
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/torchrec_dlrm/dlrm_main.py", line 737, in <module>
dlrm_main/0 [0]:[rank0]: invoke_main() # pragma: no cover
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/torchrec_dlrm/dlrm_main.py", line 733, in invoke_main
dlrm_main/0 [0]:[rank0]: main(sys.argv[1:])
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/torchrec_dlrm/dlrm_main.py", line 727, in main
dlrm_main/0 [0]:[rank0]: script_model = torch.jit.script(quantize_model)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_script.py", line 1443, in script
dlrm_main/0 [0]:[rank0]: ret = _script_impl(
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_script.py", line 1152, in _script_impl
dlrm_main/0 [0]:[rank0]: return torch.jit._recursive.create_script_module(
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 554, in create_script_module
dlrm_main/0 [0]:[rank0]: concrete_type = get_module_concrete_type(nn_module, share_types)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 503, in get_module_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type = concrete_type_store.get_or_create_concrete_type(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 435, in get_or_create_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type_builder = infer_concrete_type_builder(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 285, in infer_concrete_type_builder
dlrm_main/0 [0]:[rank0]: sub_concrete_type = get_module_concrete_type(item, share_types)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 503, in get_module_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type = concrete_type_store.get_or_create_concrete_type(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 435, in get_or_create_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type_builder = infer_concrete_type_builder(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 285, in infer_concrete_type_builder
dlrm_main/0 [0]:[rank0]: sub_concrete_type = get_module_concrete_type(item, share_types)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 503, in get_module_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type = concrete_type_store.get_or_create_concrete_type(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 435, in get_or_create_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type_builder = infer_concrete_type_builder(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 285, in infer_concrete_type_builder
dlrm_main/0 [0]:[rank0]: sub_concrete_type = get_module_concrete_type(item, share_types)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 503, in get_module_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type = concrete_type_store.get_or_create_concrete_type(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 435, in get_or_create_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type_builder = infer_concrete_type_builder(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 285, in infer_concrete_type_builder
dlrm_main/0 [0]:[rank0]: sub_concrete_type = get_module_concrete_type(item, share_types)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 503, in get_module_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type = concrete_type_store.get_or_create_concrete_type(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 435, in get_or_create_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type_builder = infer_concrete_type_builder(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 285, in infer_concrete_type_builder
dlrm_main/0 [0]:[rank0]: sub_concrete_type = get_module_concrete_type(item, share_types)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 503, in get_module_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type = concrete_type_store.get_or_create_concrete_type(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 435, in get_or_create_concrete_type
dlrm_main/0 [0]:[rank0]: concrete_type_builder = infer_concrete_type_builder(nn_module)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 338, in infer_concrete_type_builder
dlrm_main/0 [0]:[rank0]: get_overload_annotations(nn_module, ignored_properties)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/jit/_recursive.py", line 741, in get_overload_annotations
dlrm_main/0 [0]:[rank0]: item = getattr(mod, name, None)
dlrm_main/0 [0]:[rank0]: File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torchrec/distributed/embedding_types.py", line 345, in shardings
dlrm_main/0 [0]:[rank0]: raise NotImplementedError
dlrm_main/0 [0]:[rank0]: NotImplementedError
dlrm_main/0 [0]:AUROC over test set: 0.49990183115005493.
dlrm_main/0 [0]:Number of test samples: 20480
dlrm_main/0 [0]:[rank0]:[W1119 06:47:21.666698363 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
dlrm_main/0 E1119 06:47:23.065000 28249 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 28318) of binary: /workspace/dlrm/.venv/bin/python
dlrm_main/0 Traceback (most recent call last):
dlrm_main/0 File "/workspace/dlrm/.venv/bin/torchrun", line 8, in <module>
dlrm_main/0 sys.exit(main())
dlrm_main/0 File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
dlrm_main/0 return f(*args, **kwargs)
dlrm_main/0 File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
dlrm_main/0 run(args)
dlrm_main/0 File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
dlrm_main/0 elastic_launch(
dlrm_main/0 File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
dlrm_main/0 return launch_agent(self._config, self._entrypoint, list(args))
dlrm_main/0 File "/workspace/dlrm/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
dlrm_main/0 raise ChildFailedError(
dlrm_main/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
dlrm_main/0 ============================================================
dlrm_main/0 dlrm_main.py FAILED
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Failures:
dlrm_main/0 <NO_OTHER_FAILURES>
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Root Cause (first observed failure):
dlrm_main/0 [0]:
dlrm_main/0 time : 2025-11-19_06:47:22
dlrm_main/0 host : pytorch-6df7c88674-6sh5d
dlrm_main/0 rank : 0 (local_rank: 0)
dlrm_main/0 exitcode : 1 (pid: 28318)
dlrm_main/0 error_file: <N/A>
dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ============================================================
torchx 2025-11-19 06:47:23 INFO Job finished: FAILED
torchx 2025-11-19 06:47:23 ERROR AppStatus:
msg: <NONE>
num_restarts: 0
roles: []
state: FAILED (5)
structured_error_msg: <NONE>
ui_url: file:///tmp/torchx_z2d00ny6/torchx/dlrm_main-vm9krtsx5bpnjd
Metadata
Metadata
Assignees
Labels
No labels