Multi-gpu prompt tuning hanging when running in kube cluster

### Description

We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.

Run command:
```
ALLOW_DOWNLOADS=true  WORLD_SIZE=2 RANK=0 MASTER_ADDR=localhost MASTER_PORT=25590  python3 run_peft_tuning.py PROMPT_TUNING --dataset "glue/rte"  --model_name google/flan-t5-xl --num_epochs 1 --verbose --prompt_tuning_init TEXT  --output_dir prompt_prefixes/flan_t5_xl_1_epoch_rte_16_batch_1_acc_hf_trainer --learning_rate 0.3 --batch_size=16 --accumulate_steps 1 --max_target_length 512 --max_source_length 2048 --torch_dtype bfloat16
```


Relevant code to launch the training: 
- Launch config generation: https://github.com/gkumbhat/caikit-nlp/blob/add_pt_hf_trainer/caikit_nlp/toolkit/torch_run.py#L66
- Invoking elastic launch: https://github.com/gkumbhat/caikit-nlp/blob/add_pt_hf_trainer/caikit_nlp/modules/text_generation/peft_prompt_tuning.py#L515

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-gpu prompt tuning hanging when running in kube cluster #271

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-gpu prompt tuning hanging when running in kube cluster #271

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions