Torchao fp8 fails if using accelerate config file with Trainer

### System Info

```Shell
- `Accelerate` version: 1.12.0.dev0
- Platform: Linux-5.15.182.1-1.cm2-x86_64-with-glibc2.35
- `accelerate` bash location: /home/jobuser/.local/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version: 2.8.0.3+cu128
- PyTorch accelerator: CUDA
- System RAM: 2267.29 GB
- GPU type: NVIDIA H100 80GB HBM3
```

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [x] My own task or dataset (give details below)

### Reproduction

Use accelerate to launch training script using accelerate config file
```
accelerate launch --config_file accelerate_fsdp2_fp8.conf \
    --rdzv_conf "join_timeout=900" \
    --num_machines $NUM_NODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank $RANK \
    training.py \
      --data_path "$DATA_PATH" \
      --logging_dir "$LOG_FULL_PATH" \
      --model_path "$LOCAL_MODEL_PATH" \
      --dataset_text_field "prompt" \
      --max_length 64 \
      --bf16 False \
      ...
```

where `accelerate_fsdp2_fp8.conf` looks like:
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: false
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_version: 2
main_training_function: main
mixed_precision: fp8
fp8_config:
  backend: AO
num_machines: 1
num_processes: 1
rdzv_backend: c10d
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

Problem: If you instrument your own training loop, it's easy to initialize Accelerator object with whatever plugins/kwargs handlers (see https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/fsdp2_fp8.py). However, if using transformers Trainer/SFTTrainer, it initializes the accelerator object and instantiates plugins from env variables set during `accelerate launch --config_file acc_config.yaml`. There's currently not support for any torchao parameters, some of which may be hard to set via a config file.

Get this error due to no default ao_recipe_handler.config being initialized:
```
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/jobuser/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:   File "/home/jobuser/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1545, in prepare
[rank0]:     args = self._prepare_ao(*args)
[rank0]:   File "/home/jobuser/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2013, in _prepare_ao
[rank0]:     if self.is_fsdp2 and len(optimizers) > 0 and self.ao_recipe_handler.config.enable_fsdp_float8_all_gather:
[rank0]: AttributeError: 'NoneType' object has no attribute 'enable_fsdp_float8_all_gather'
```

After fixing this, also get this error:
```
[rank0]: tensor_out = addmm_float8_unwrapped(
[rank0]: File "/home/jobuser/.local/lib/python3.10/site-packages/torchao/float8/float8_ops.py", line 69, in addmm_float8_unwrapped
[rank0]: output = torch._scaled_mm(
[rank0]: RuntimeError: Expected trailing dimension of mat1 to be divisible by 16 but got mat1 shape: (3072x282).
```

### Expected behavior
If users are already using the config file based approach with Trainer class, then we set some accelerator defaults that make it work for most cases (e.g. filter linear functions, enable fp8 all gather, pad inner dim):
```
mixed_precision: fp8
fp8_config:
  backend: AO
```

We can also add support for setting ao configs in TrainingArgs to override the default accelerate configs. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torchao fp8 fails if using accelerate config file with Trainer #3830

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Torchao fp8 fails if using accelerate config file with Trainer #3830

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions