FSDPv2 refactor + FP8 support

We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!

[FSDP2] Refactor + FP8 by @S1ro1 in #3585

Faster Distributed Training on Intel CPUs

We updated the CCL_WORKER_COUNT variable and added KMP parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.

Set ccl and KMP param in simple launch by @jiqing-feng in #3575

Regional Compilation for DeepSpeed

We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!

Fix deepspeed regional compilation by @IlyasMoutawwakil in #3609

ipex.optimize deprecation

ipex.optimize is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.

remove ipex.optimize in accelerate by @yao-matrix in #3608

Better XPU Support

We've greatly expanded and stabilized support for Intel XPUs:

enable fsdp2 benchmark on XPU by @yao-matrix in #3590
enable big_model_inference on xpu by @yao-matrix in #3595
enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in
enable test_cli & test_example cases on XPU by @yao-matrix in #3578
enable torchao and pippy test cases on XPU by @yao-matrix in #3599
enable regional_compilation benchmark on xpu by @yao-matrix in #3592
fix xpu 8bit value loading by @jiqing-feng in #3623
add device-agnostic GradScaler by @yao-matrix in #3588
add xpu support in TorchTensorParallelPlugin by @yao-matrix in #3627

Trackers

We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.

Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in #3581

What's Changed

Fix bf16 training with TP by @SunMarc in #3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
Update Gaudi Runners by @IlyasMoutawwakil in #3593
goodbye torch_ccl by @yao-matrix in #3580
Add support for standalone mode when default port is occupied on single node by @laitifranz in #3576
Resolve logger warnings by @emmanuel-ferdman in #3582
Add kwargs to optimizer, scheduler and dataloader using function accelerator().load_state() by @luiz0992 in #3540
[docs] no hard-coded cuda in the ddp documentation by @faaany in #3589
change to use torch.device by @yao-matrix in #3594
Fix: list object has no attribute keys by @S1ro1 in #3603
Update Gaudi Runners by @IlyasMoutawwakil in #3593
Fix bf16 training with TP by @SunMarc in #3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in #3587
Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in #3619
Add fp8_e5m2 support in dtype_byte_size by @SunMarc in #3625
[Deepspeed] deepspeed auto grad accum by @kashif in #3630
Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in #3631
Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
Fix Typos in Documentation and Comments by @leopardracer in #3621
feat: use datasets.IterableDataset shard if possible by @SunMarc in #3635
[DeepSpeed] sync gradient accum steps from deepspeed plugin by @kashif in #3632
Feat: add cpu offload by @S1ro1 in #3636
Fix: correct labels for fsdp2 examples by @S1ro1 in #3637
fix grad acc deepspeed by @SunMarc in #3638

New Contributors

@laitifranz made their first contribution in #3576
@emmanuel-ferdman made their first contribution in #3582
@yuanjua made their first contribution in #3581
@sorgfresser made their first contribution in #3587
@ShaohonChen made their first contribution in #3605
@leopardracer made their first contribution in #3621

Full Changelog: v1.7.0...v1.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.8.0: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

FSDPv2 refactor + FP8 support

Faster Distributed Training on Intel CPUs

Regional Compilation for DeepSpeed

ipex.optimize deprecation

Better XPU Support

Trackers

What's Changed

New Contributors

Contributors

Uh oh!