v1.8.0: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation
FSDPv2 refactor + FP8 support
We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!
Faster Distributed Training on Intel CPUs
We updated the CCL_WORKER_COUNT variable and added KMP parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.
- Set ccl and KMP param in simple launch by @jiqing-feng in #3575
Regional Compilation for DeepSpeed
We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!
- Fix deepspeed regional compilation by @IlyasMoutawwakil in #3609
ipex.optimize deprecation
ipex.optimize is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.
- remove ipex.optimize in accelerate by @yao-matrix in #3608
Better XPU Support
We've greatly expanded and stabilized support for Intel XPUs:
- enable fsdp2 benchmark on XPU by @yao-matrix in #3590
- enable big_model_inference on xpu by @yao-matrix in #3595
- enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in
- enable test_cli & test_example cases on XPU by @yao-matrix in #3578
- enable torchao and pippy test cases on XPU by @yao-matrix in #3599
- enable regional_compilation benchmark on xpu by @yao-matrix in #3592
- fix xpu 8bit value loading by @jiqing-feng in #3623
- add device-agnostic GradScaler by @yao-matrix in #3588
- add xpu support in TorchTensorParallelPlugin by @yao-matrix in #3627
Trackers
We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.
- Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
- Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in #3581
What's Changed
- Fix bf16 training with TP by @SunMarc in #3610
- better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
- Update Gaudi Runners by @IlyasMoutawwakil in #3593
- goodbye torch_ccl by @yao-matrix in #3580
- Add support for standalone mode when default port is occupied on single node by @laitifranz in #3576
- Resolve logger warnings by @emmanuel-ferdman in #3582
- Add kwargs to optimizer, scheduler and dataloader using function
accelerator().load_state()by @luiz0992 in #3540 - [docs] no hard-coded cuda in the ddp documentation by @faaany in #3589
- change to use torch.device by @yao-matrix in #3594
- Fix: list object has no attribute keys by @S1ro1 in #3603
- Update Gaudi Runners by @IlyasMoutawwakil in #3593
- Fix bf16 training with TP by @SunMarc in #3610
- better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
- Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in #3587
- Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in #3619
- Add fp8_e5m2 support in
dtype_byte_sizeby @SunMarc in #3625 - [Deepspeed] deepspeed auto grad accum by @kashif in #3630
- Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in #3631
- Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
- Fix Typos in Documentation and Comments by @leopardracer in #3621
- feat: use datasets.IterableDataset shard if possible by @SunMarc in #3635
- [DeepSpeed] sync gradient accum steps from deepspeed plugin by @kashif in #3632
- Feat: add cpu offload by @S1ro1 in #3636
- Fix: correct labels for fsdp2 examples by @S1ro1 in #3637
- fix grad acc deepspeed by @SunMarc in #3638
New Contributors
- @laitifranz made their first contribution in #3576
- @emmanuel-ferdman made their first contribution in #3582
- @yuanjua made their first contribution in #3581
- @sorgfresser made their first contribution in #3587
- @ShaohonChen made their first contribution in #3605
- @leopardracer made their first contribution in #3621
Full Changelog: v1.7.0...v1.8.0