generated from caikit/caikit-template
-
Notifications
You must be signed in to change notification settings - Fork 55
Open
Description
Description
We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide WORLD_SIZE, MASTER_ADDR, MASTER_PORT parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.
Run command:
ALLOW_DOWNLOADS=true WORLD_SIZE=2 RANK=0 MASTER_ADDR=localhost MASTER_PORT=25590 python3 run_peft_tuning.py PROMPT_TUNING --dataset "glue/rte" --model_name google/flan-t5-xl --num_epochs 1 --verbose --prompt_tuning_init TEXT --output_dir prompt_prefixes/flan_t5_xl_1_epoch_rte_16_batch_1_acc_hf_trainer --learning_rate 0.3 --batch_size=16 --accumulate_steps 1 --max_target_length 512 --max_source_length 2048 --torch_dtype bfloat16
Relevant code to launch the training:
Metadata
Metadata
Assignees
Labels
No labels