TE MXFP8 support
We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set use_mxfp8_block_scaling in fp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)
FP16/BF16 Training for MPS devices
BF16 and FP16 support for MPS devices is finally here. You can now pass mixed_precision = "fp16" or "bf16" when training on a mac (fp16 requires torch 2.8 and bf16 requires torch 2.6)
FSDP updates
The following PRs add respectively support to ignored_params and no_sync() for FSDPv2:
- feat: add ignored_params support for fsdp2 by @kmehant in #3731
- fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in #3762
Mixed precision can now be passed as a dtype string from accelerate cli flag or fsdp_config in accelerate config file:
Nd-parallel updates
Some minor updates concerning nd-parallelism.
- Context Parallelism docs typos fixed by @sergiopaniego in #3761
- Feat: add to_json by @S1ro1 in #3743
- make torch_native_parallelism examples device agnostic by @yao-matrix in #3759
- [ND Parallel] Update examples, cleanup by @S1ro1 in #3737
Bump to Python 3.10
We've dropped support for python 3.9 as it reached EOL in October.
Lots of minor fixes:
- fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in #3740
- xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in #3756
- Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in #3744
- fix: specify device for process_tensor in example usage by @qgallouedec in #3755
- Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in #3776
- Fix (skip) cuda cache flush when origin device is
cpuand offloaded tometaby @Qubitium in #3796 - Fix convert LayerNorm without bias to fp8 by @mjun0812 in #3725
- Add optional typing by @cyyever in #3769
- refactor: Use
within Accelerator.autocast()instead of__enter__()and__exit__()for more elegant style. by @EquationWalker in #3767 - switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in #3773
- fix FSDP2 test case failure on XPU by @yao-matrix in #3771
- Fix tests by @SunMarc in #3722
- Protect import for device_mesh by @SunMarc in #3742
- Fix
SWANLAB_MODEby @SunMarc in #3808 - Fix tracking swanlab by @SunMarc in #3810
- refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in #3815
- Remove deprecated FindTiedParametersResult by @cyyever in #3786
- Add optional typing by @cyyever in #3769
- remove mlflow from testing by @SunMarc in #3783
- enable 2 model hook ut cases on XPU by @yao-matrix in #3774
- Added Tip for better rendering by @sergiopaniego in #3781
- Fix typos by @cyyever in #3753
- fix: torch_npu import error in some envs by @yanyongyu in #3764
- Fix: typo makes tests fail by @S1ro1 in #3765
- fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in #3779
- use reset_peak_memory_stats on xpu by @yao-matrix in #3772
New Contributors
- @mjun0812 made their first contribution in #3725
- @sergiopaniego made their first contribution in #3761
- @EquationWalker made their first contribution in #3762
- @yanyongyu made their first contribution in #3764
- @RicardoDominguez made their first contribution in #3779
- @SamuelBarryCS made their first contribution in #3776
- @Qubitium made their first contribution in #3796
Full Changelog: v1.10.1...v1.11.0