Skip to content

Multi-gpu prompt tuning hanging when running in kube cluster #271

@gkumbhat

Description

@gkumbhat

Description

We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide WORLD_SIZE, MASTER_ADDR, MASTER_PORT parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.

Run command:

ALLOW_DOWNLOADS=true  WORLD_SIZE=2 RANK=0 MASTER_ADDR=localhost MASTER_PORT=25590  python3 run_peft_tuning.py PROMPT_TUNING --dataset "glue/rte"  --model_name google/flan-t5-xl --num_epochs 1 --verbose --prompt_tuning_init TEXT  --output_dir prompt_prefixes/flan_t5_xl_1_epoch_rte_16_batch_1_acc_hf_trainer --learning_rate 0.3 --batch_size=16 --accumulate_steps 1 --max_target_length 512 --max_source_length 2048 --torch_dtype bfloat16

Relevant code to launch the training:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions