Skip to content

Torchao fp8 fails if using accelerate config file with TrainerΒ #3830

@shimizust

Description

@shimizust

System Info

- `Accelerate` version: 1.12.0.dev0
- Platform: Linux-5.15.182.1-1.cm2-x86_64-with-glibc2.35
- `accelerate` bash location: /home/jobuser/.local/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version: 2.8.0.3+cu128
- PyTorch accelerator: CUDA
- System RAM: 2267.29 GB
- GPU type: NVIDIA H100 80GB HBM3

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Use accelerate to launch training script using accelerate config file

accelerate launch --config_file accelerate_fsdp2_fp8.conf \
    --rdzv_conf "join_timeout=900" \
    --num_machines $NUM_NODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank $RANK \
    training.py \
      --data_path "$DATA_PATH" \
      --logging_dir "$LOG_FULL_PATH" \
      --model_path "$LOCAL_MODEL_PATH" \
      --dataset_text_field "prompt" \
      --max_length 64 \
      --bf16 False \
      ...

where accelerate_fsdp2_fp8.conf looks like:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: false
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_version: 2
main_training_function: main
mixed_precision: fp8
fp8_config:
  backend: AO
num_machines: 1
num_processes: 1
rdzv_backend: c10d
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Problem: If you instrument your own training loop, it's easy to initialize Accelerator object with whatever plugins/kwargs handlers (see https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/fsdp2_fp8.py). However, if using transformers Trainer/SFTTrainer, it initializes the accelerator object and instantiates plugins from env variables set during accelerate launch --config_file acc_config.yaml. There's currently not support for any torchao parameters, some of which may be hard to set via a config file.

Get this error due to no default ao_recipe_handler.config being initialized:

[rank0]:     return inner_training_loop(
[rank0]:   File "/home/jobuser/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:   File "/home/jobuser/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1545, in prepare
[rank0]:     args = self._prepare_ao(*args)
[rank0]:   File "/home/jobuser/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2013, in _prepare_ao
[rank0]:     if self.is_fsdp2 and len(optimizers) > 0 and self.ao_recipe_handler.config.enable_fsdp_float8_all_gather:
[rank0]: AttributeError: 'NoneType' object has no attribute 'enable_fsdp_float8_all_gather'

After fixing this, also get this error:

[rank0]: tensor_out = addmm_float8_unwrapped(
[rank0]: File "/home/jobuser/.local/lib/python3.10/site-packages/torchao/float8/float8_ops.py", line 69, in addmm_float8_unwrapped
[rank0]: output = torch._scaled_mm(
[rank0]: RuntimeError: Expected trailing dimension of mat1 to be divisible by 16 but got mat1 shape: (3072x282).

Expected behavior

If users are already using the config file based approach with Trainer class, then we set some accelerator defaults that make it work for most cases (e.g. filter linear functions, enable fp8 all gather, pad inner dim):

mixed_precision: fp8
fp8_config:
  backend: AO

We can also add support for setting ao configs in TrainingArgs to override the default accelerate configs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions