-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
- `Accelerate` version: 1.12.0.dev0
- Platform: Linux-5.15.182.1-1.cm2-x86_64-with-glibc2.35
- `accelerate` bash location: /home/jobuser/.local/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version: 2.8.0.3+cu128
- PyTorch accelerator: CUDA
- System RAM: 2267.29 GB
- GPU type: NVIDIA H100 80GB HBM3Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Use accelerate to launch training script using accelerate config file
accelerate launch --config_file accelerate_fsdp2_fp8.conf \
--rdzv_conf "join_timeout=900" \
--num_machines $NUM_NODES \
--num_processes $WORLD_SIZE \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank $RANK \
training.py \
--data_path "$DATA_PATH" \
--logging_dir "$LOG_FULL_PATH" \
--model_path "$LOCAL_MODEL_PATH" \
--dataset_text_field "prompt" \
--max_length 64 \
--bf16 False \
...
where accelerate_fsdp2_fp8.conf looks like:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: false
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
fsdp_version: 2
main_training_function: main
mixed_precision: fp8
fp8_config:
backend: AO
num_machines: 1
num_processes: 1
rdzv_backend: c10d
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Problem: If you instrument your own training loop, it's easy to initialize Accelerator object with whatever plugins/kwargs handlers (see https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/fsdp2_fp8.py). However, if using transformers Trainer/SFTTrainer, it initializes the accelerator object and instantiates plugins from env variables set during accelerate launch --config_file acc_config.yaml. There's currently not support for any torchao parameters, some of which may be hard to set via a config file.
Get this error due to no default ao_recipe_handler.config being initialized:
[rank0]: return inner_training_loop(
[rank0]: File "/home/jobuser/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/home/jobuser/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1545, in prepare
[rank0]: args = self._prepare_ao(*args)
[rank0]: File "/home/jobuser/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2013, in _prepare_ao
[rank0]: if self.is_fsdp2 and len(optimizers) > 0 and self.ao_recipe_handler.config.enable_fsdp_float8_all_gather:
[rank0]: AttributeError: 'NoneType' object has no attribute 'enable_fsdp_float8_all_gather'
After fixing this, also get this error:
[rank0]: tensor_out = addmm_float8_unwrapped(
[rank0]: File "/home/jobuser/.local/lib/python3.10/site-packages/torchao/float8/float8_ops.py", line 69, in addmm_float8_unwrapped
[rank0]: output = torch._scaled_mm(
[rank0]: RuntimeError: Expected trailing dimension of mat1 to be divisible by 16 but got mat1 shape: (3072x282).
Expected behavior
If users are already using the config file based approach with Trainer class, then we set some accelerator defaults that make it work for most cases (e.g. filter linear functions, enable fp8 all gather, pad inner dim):
mixed_precision: fp8
fp8_config:
backend: AO
We can also add support for setting ao configs in TrainingArgs to override the default accelerate configs.