Release v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10 · huggingface/accelerate

TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set use_mxfp8_block_scaling in fp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)

Add support for TE MXFP8 recipe in accelerate by @pstjohn in #3688

FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass mixed_precision = "fp16" or "bf16" when training on a mac (fp16 requires torch 2.8 and bf16 requires torch 2.6)

Add bf16/fp16 support for amp with mps device by @SunMarc in #3373

FSDP updates

The following PRs add respectively support to ignored_params and no_sync() for FSDPv2:

feat: add ignored_params support for fsdp2 by @kmehant in #3731
fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in #3762

Mixed precision can now be passed as a dtype string from accelerate cli flag or fsdp_config in accelerate config file:

feat: allow mixed precision policy as dtype by @kmehant in #3751

Nd-parallel updates

Some minor updates concerning nd-parallelism.

Context Parallelism docs typos fixed by @sergiopaniego in #3761
Feat: add to_json by @S1ro1 in #3743
make torch_native_parallelism examples device agnostic by @yao-matrix in #3759
[ND Parallel] Update examples, cleanup by @S1ro1 in #3737

Bump to Python 3.10

We've dropped support for python 3.9 as it reached EOL in October.

Bump to python3.10 + update linter by @SunMarc in #3809

Lots of minor fixes:

fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in #3740
xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in #3756
Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in #3744
fix: specify device for process_tensor in example usage by @qgallouedec in #3755
Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in #3776
Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta by @Qubitium in #3796
Fix convert LayerNorm without bias to fp8 by @mjun0812 in #3725
Add optional typing by @cyyever in #3769
refactor: Use with in Accelerator.autocast()instead of __enter__() and __exit__() for more elegant style. by @EquationWalker in #3767
switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in #3773
fix FSDP2 test case failure on XPU by @yao-matrix in #3771
Fix tests by @SunMarc in #3722
Protect import for device_mesh by @SunMarc in #3742
Fix SWANLAB_MODE by @SunMarc in #3808
Fix tracking swanlab by @SunMarc in #3810
refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in #3815
Remove deprecated FindTiedParametersResult by @cyyever in #3786
Add optional typing by @cyyever in #3769
remove mlflow from testing by @SunMarc in #3783
enable 2 model hook ut cases on XPU by @yao-matrix in #3774
Added Tip for better rendering by @sergiopaniego in #3781
Fix typos by @cyyever in #3753
fix: torch_npu import error in some envs by @yanyongyu in #3764
Fix: typo makes tests fail by @S1ro1 in #3765
fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in #3779
use reset_peak_memory_stats on xpu by @yao-matrix in #3772

New Contributors

@mjun0812 made their first contribution in #3725
@sergiopaniego made their first contribution in #3761
@EquationWalker made their first contribution in #3762
@yanyongyu made their first contribution in #3764
@RicardoDominguez made their first contribution in #3779
@SamuelBarryCS made their first contribution in #3776
@Qubitium made their first contribution in #3796

Full Changelog: v1.10.1...v1.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

TE MXFP8 support

FP16/BF16 Training for MPS devices

FSDP updates

Nd-parallel updates

Bump to Python 3.10

Lots of minor fixes:

New Contributors

Contributors

Uh oh!