Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
129 commits
Select commit Hold shift + click to select a range
cbccb38
[benchmark] add h200 bench (#1361)
asaiacai Jul 12, 2025
05e47c3
fixing dtype in flux eval (#1388)
kfirgoldberg Jul 13, 2025
2764a77
[float8] Fix module filter function (#1391)
danielvegamyhre Jul 14, 2025
7f5c3b6
added badges for pip and conda, and explicit installation instruction…
sarthakpati Jul 14, 2025
8908970
fix wrong b200 flops number (#1393)
samsja Jul 14, 2025
6204cdf
refactor ParallelDims and CheckpointManager (#1384)
tianyu-l Jul 14, 2025
db52d57
Add support for saving HF format tensors with DCP (#1351)
ankitageorge Jul 14, 2025
27e3ad8
Add Github workflow to build and publish wheel to PyTorch Index night…
joecummings Jul 15, 2025
53f6642
Validator integration with current metrics processor for logging (#1395)
wesleytruong Jul 15, 2025
23b8736
refactor FTManager (#1397)
tianyu-l Jul 15, 2025
c1c55ea
Lint (#1400)
H-Huang Jul 15, 2025
2906d8b
[DSV3] Add PP support for DSV3 (#1345)
H-Huang Jul 15, 2025
972ac9f
Add the missing field to NoColor (#1406)
fegin Jul 16, 2025
9a8cb98
Add option for selective op AC to filter mm shapes based on fqn (#1380)
soulitzer Jul 16, 2025
f062d48
[llama4] Change expert_bias and tokens_per_expert to non-persistent b…
wwwjn Jul 16, 2025
d69a737
create multipe outer optimizers for diloco (#1407)
tushar00jain Jul 16, 2025
4e5265e
[DSV3] Change sdpa interface to pass softmax_scale (#1394)
wwwjn Jul 18, 2025
c004dc4
separate outputs for ft replicas (#1410)
tushar00jain Jul 18, 2025
c924c44
allow specifying ft pg (#1411)
tushar00jain Jul 18, 2025
183f6fc
Remove flex+sac restriction (#1408)
drisspg Jul 20, 2025
beb29a1
add infra support for HF checkpoint conversion (#1404)
tianyu-l Jul 20, 2025
93a236c
[DSV3] Add normalization for topk router scores (#1419)
wwwjn Jul 21, 2025
4e73af3
update documentation for release (#1425)
tianyu-l Jul 21, 2025
415f834
Add Flex debug config (#1437)
drisspg Jul 22, 2025
ad7f644
[FT] Add torchft to CI (#1398)
H-Huang Jul 22, 2025
16273f8
[deepseek] fix FlexAttention + TP (#1440)
tianyu-l Jul 22, 2025
177b295
refactor so that ModelArgs does not depend on tokenizer (#1424)
tianyu-l Jul 23, 2025
171a883
Take job config out of checkpoint manager (#1433)
ebsmothers Jul 23, 2025
34d815c
[refactor] split JobConfig and ConfigManager into two files (#1442)
tianyu-l Jul 23, 2025
2e6ab37
add the forge folder (#1387)
tianyu-l Jul 23, 2025
2f1c814
add back torch nightly install instruction (#1444)
tianyu-l Jul 23, 2025
d282cf2
[refactor] moving gloabl dependence on JobConfig to fine-grained conf…
tianyu-l Jul 24, 2025
70592cb
added model definition conversion for llama3 (#1441)
wesleytruong Jul 24, 2025
58afc5f
Fix incorrect mapping of ffn_norm and attention_norm in HF Llama4 con…
raymin0223 Jul 24, 2025
38a9d30
publish instructions on adding a new model (#1451)
tianyu-l Jul 25, 2025
f3e2a75
make mxfp8 dim1 cast kernel configurable (#1427)
danielvegamyhre Jul 25, 2025
8a7b4aa
Fix a none pointer exception in checkpoint.py (#1465)
DNXie Jul 28, 2025
1fefaee
remove float8 force_recompute_fp8_weight_in_bwd flag (#1452)
vkuzo Jul 28, 2025
a44dff1
[checkpoint] let user specify `intial_load_path` and `initial_load_in…
tianyu-l Jul 28, 2025
f26179e
Re-enable pipeline parallel tests (#1477)
H-Huang Jul 29, 2025
83e6941
Improve reshard_after_forward logic (#1094)
tianyu-l Jul 29, 2025
942661c
Log total number of tokens seen (#1474)
runame Jul 29, 2025
5bab356
Temporarily Disable Memory Tracking Test for FSDP2 (#1480)
wwwjn Jul 29, 2025
8dd5a7e
Fix tokenizer error message (#1476)
H-Huang Jul 29, 2025
3aa09b9
log cuda driver version for debugging (#1479)
danielvegamyhre Jul 29, 2025
327a99c
Fixes the sd adapter in forge experiments (#1484)
allenwang28 Jul 29, 2025
881f0ca
Change `lr_min` to `min_lr_factor ` (#1471)
unlimblue Jul 30, 2025
f1c8c2c
guard against nvidia-smi command exit code 1 (#1496)
danielvegamyhre Jul 30, 2025
3c84ce0
Refactor PP splitting (#1416)
H-Huang Jul 30, 2025
be49c02
[deepseek] integrate 16B tokenizer to match 16B official model (#1497)
lessw2020 Jul 31, 2025
82b593e
remove dead code (#1501)
tushar00jain Jul 31, 2025
5961c75
fix creating leaf folder (#1502)
tushar00jain Jul 31, 2025
1080c8f
validation support for pipeline parallelism [WIP] (#1490)
wesleytruong Jul 31, 2025
ad9849c
Fix data_load_start position (#1481)
speed1313 Jul 31, 2025
b1dc330
Refactor script to use 'overwrites' variable for command-line argumen…
idoh Jul 31, 2025
cf30b29
Add logging for learning rates in MetricsProcessor (#1413)
idoh Jul 31, 2025
d655e16
Make token group alignment size configurable (#1503)
danielvegamyhre Aug 1, 2025
b109f7d
[DSV3] Add output.contiguous() in model to match llama3 (#1504) (#1513)
XilunWu Aug 1, 2025
43fa980
fix small deepseekv3 typo (#1514)
ruisizhang123 Aug 1, 2025
48d8dcd
make mx recipe name more generic (#1512)
vkuzo Aug 1, 2025
a0fdaa3
All-reduce `ntokens_seen` before logging (#1509)
runame Aug 1, 2025
2429e0b
Compute validation metrics at first step (#1508)
runame Aug 1, 2025
004162a
minor fix (#1494)
ShoufaChen Aug 2, 2025
ed288bc
[llama4] store expert weights such that we can transpose before group…
danielvegamyhre Aug 3, 2025
2844029
[llama4] add apply_compile for moe, where fullgraph=False for moe lay…
danielvegamyhre Aug 4, 2025
92bea07
[deepseek] update to 16b base tokenizer (#1499)
lessw2020 Aug 5, 2025
90cfba4
Add description for 16B model tokenizer for deepseek-v3 model (#1530)
wwwjn Aug 5, 2025
a204e31
Flux Validation (#1518)
wesleytruong Aug 5, 2025
3065a2a
model fragments for diloco (#1446)
tushar00jain Aug 6, 2025
cc55827
checkpoint.md (#1533)
wesleytruong Aug 6, 2025
f2830b6
Fix config manager directories (#1532)
AlirezaShamsoshoara Aug 6, 2025
a9aa506
unify moe implementation for llama4 and deepseek_v3 (#1534)
tianyu-l Aug 6, 2025
be211c8
separate out diloco configs (#1516)
tushar00jain Aug 6, 2025
36ec547
fix module import (#1537)
tushar00jain Aug 6, 2025
a1fdd7e
use logger in ft (#1539)
tushar00jain Aug 6, 2025
23e4dfc
fix: ep clipping with no ep grads (#1541)
garrett361 Aug 8, 2025
2c8b594
Reorder validate and checkpoint in train (#1542)
wesleytruong Aug 8, 2025
59e57a4
fix EP fsdp gradient divide factor (#1551)
tianyu-l Aug 11, 2025
fd5a87f
Better Support for Huggingface Asset Integration (#1526)
wesleytruong Aug 12, 2025
d14f1e3
Flux Batched Inference (#1548)
wesleytruong Aug 12, 2025
9c42b9b
[a2av] Add autograd support for token dispatch op (#1491)
kwen2501 Aug 12, 2025
cf4de26
[a2av] Add autograd support for token combine op (#1511)
kwen2501 Aug 12, 2025
a6972ae
Add state_dict converter for DeepSeekv3 in torchtitan (#1538)
wwwjn Aug 12, 2025
8bd8c93
Move fqn mapping logic to StateDictAdapter (#1557)
wesleytruong Aug 12, 2025
21416c4
Update .gitignore (#1560)
wesleytruong Aug 13, 2025
0c51d92
fix state dict adapter in forge engine (#1563)
ebsmothers Aug 13, 2025
48b6520
unit test for download_hf_assets script (#1556)
wesleytruong Aug 13, 2025
aeb3a4b
[EP] add support for ETP=1 (#1555)
tianyu-l Aug 13, 2025
6377dce
llama4: Avoid staticmethod nested graph break for MoE compile (#1565)
xmfan Aug 13, 2025
7354848
[MoE/EP] apply dim-1 FSDP sharding for routed experts and rewrite sha…
tianyu-l Aug 14, 2025
6fc499f
quick fix dsv3 fsdp (#1575)
tianyu-l Aug 15, 2025
e629fe5
Use PYTORCH_ALLOC_CONF as PYTORCH_CUDA_ALLOC_CONF is deprecated (#1577)
fegin Aug 15, 2025
803906b
Add DualPipeV (#1571)
H-Huang Aug 15, 2025
297a72a
Ignore tokenizer_path if it is an empty string (#1579)
fegin Aug 15, 2025
a59abea
added better guidance for if deprecated tokenizer path fails (#1568)
wesleytruong Aug 15, 2025
72b16b1
Added doc for Val/Eval and lm_eval integration (#1573)
wesleytruong Aug 17, 2025
9233d83
[EP] bug fixes (#1586)
tianyu-l Aug 18, 2025
0d1b80d
[EP] remove token split overhead from DTensor in TokenReorderer pre h…
tianyu-l Aug 18, 2025
f9e8897
Adding Qwen3 model to the experiments folder (#1429)
HosseinKaviani-H Aug 18, 2025
e4847c8
added example for bidirectional checkpoint testing (#1540)
wesleytruong Aug 19, 2025
a54725c
MoE explicit prefetching in FSDP (#1594)
tianyu-l Aug 19, 2025
9e24689
[DeepSeek] add torch.compile + async TP (#1588)
tianyu-l Aug 19, 2025
7f1fa48
[Qwen3] Switch to verified RoPE implementation + Add weight tying sup…
wwwjn Aug 19, 2025
9f47ceb
[dsv3] Remove dtype to avoid confusion (#1599)
wwwjn Aug 19, 2025
b5b7ffb
[HF] Deprecate `tokenizer_path` in Toml Files (#1592)
wesleytruong Aug 19, 2025
084d307
[doc] update DeepSeekV3ModelArgs doc string (#1598)
lckr Aug 19, 2025
9874e84
Change freq_cis from persistent buffer to non-persistent buffer (#1600)
wwwjn Aug 19, 2025
c0b2e5a
[HF] Model Definition Conversion Support for FLUX (#1582)
wesleytruong Aug 20, 2025
46a32e7
Deprecate Llama Conversion Script (#1603)
wesleytruong Aug 20, 2025
08b8b24
[refactor] support compile model and loss separately (#1608)
tianyu-l Aug 21, 2025
82d6c3b
[DSV3] Upgrade to DeepSeek-V3.1 (#1609)
wwwjn Aug 21, 2025
fd23080
Fix Typo (#1611)
wwwjn Aug 21, 2025
2bfcdd8
improve MoE bias update logic in optimizer (#1593)
rakkit Aug 22, 2025
255a6ab
fix qwen3 compile config in parallelize.py (#1623)
YangWang92 Aug 22, 2025
7d744b2
add model_parts ref to MetricsProcessor (#1578)
garrett361 Aug 22, 2025
8a749c6
Move the call to init_attention_mask to trainer (#1616)
fegin Aug 22, 2025
f738a03
Switch DeepSeekV3 to Use FlexAttention by Default (#1610)
fegin Aug 22, 2025
cab22e7
Centralize Async TP Enablement with maybe_enable_async_tp API (#1619)
fegin Aug 22, 2025
cd337db
[Cleanup] Miscellaneous Refactors (#1607)
wesleytruong Aug 22, 2025
2025abb
async tp minor fix (#1629)
tianyu-l Aug 25, 2025
9197908
fix(dataloader): Prevent RuntimeError from DataloaderStopIteration (#…
Lain810 Aug 25, 2025
030879f
[Qwen3] Fix weight tying for Qwen3 according to Huggingface configs (…
wwwjn Aug 25, 2025
ad06609
Fix variable name in NotImplementedError message (#1637)
wesleytruong Aug 25, 2025
4191def
Update torchft.md (#1596)
H-Huang Aug 26, 2025
e65ef30
Adding StateDictAdapter (#1601)
HosseinKaviani-H Aug 26, 2025
17ef753
add wandb team entity and run name options (#1643)
anana10c Aug 27, 2025
a481c26
Solving the validation hanging issue (#1634)
DNXie Aug 27, 2025
7156416
update warning message (#1648)
DNXie Aug 28, 2025
45647b3
Merge branch 'main' into whc/merge_autoparallel
wconstab Aug 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .ci/docker/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ wandb
fsspec
tyro
tokenizers >= 0.15.0
safetensors
11 changes: 11 additions & 0 deletions .github/scripts/update_version.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version_file="assets/version.txt"
init_file="torchtitan/__init__.py"
if [[ -n "$BUILD_VERSION" ]]; then
# Update the version in version.txt
echo "$BUILD_VERSION" > "$version_file"
# Create a variable named __version__ at the end of __init__.py
echo "__version__ = \"$BUILD_VERSION\"" >> "$init_file"
else
echo "Error: BUILD_VERSION environment variable is not set or empty."
exit 1
fi
40 changes: 40 additions & 0 deletions .github/workflows/build_whl_and_publish.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Build nightly wheels and publish to PyTorch Index

on:
push:
branches:
- nightly
workflow_dispatch:

permissions:
id-token: write
contents: read

jobs:
generate-matrix:
if: github.repository_owner == 'pytorch'
uses: pytorch/test-infra/.github/workflows/generate_binary_build_matrix.yml@main
with:
package-type: wheel
os: linux
test-infra-repository: pytorch/test-infra
test-infra-ref: main
with-cuda: enable
with-rocm: enable
python-versions: '["3.10", "3.11", "3.12"]'
build:
needs: generate-matrix
name: ${{ matrix.repository }}
uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main
strategy:
fail-fast: false
with:
repository: pytorch/torchtitan
ref: ""
test-infra-repository: pytorch/test-infra
test-infra-ref: main
package-name: torchtitan
build-matrix: ${{ needs.generate-matrix.outputs.matrix }}
pre-script: .github/scripts/update_version.sh
trigger-event: ${{ github.event_name }}
build-platform: 'python-build-package'
6 changes: 5 additions & 1 deletion .github/workflows/integration_test_8gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,15 @@ jobs:
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
conda activate "${CONDA_ENV}"

# Log CUDA driver version for debugging.
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 || true)
echo "CUDA driver version: ${DRIVER_VERSION}"

pip config --user set global.progress_bar off

python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126

USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

mkdir artifacts-to-be-uploaded
python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8
python -m tests.integration_tests artifacts-to-be-uploaded --ngpu 8
4 changes: 4 additions & 0 deletions .github/workflows/integration_test_8gpu_flux.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ jobs:

pip config --user set global.progress_bar off

# Log CUDA driver version for debugging.
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 || true)
echo "CUDA driver version: ${DRIVER_VERSION}"

python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126

mkdir artifacts-to-be-uploaded
Expand Down
8 changes: 7 additions & 1 deletion .github/workflows/integration_test_8gpu_h100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,17 @@ jobs:
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
conda activate "${CONDA_ENV}"

# Log CUDA driver version for debugging.
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 || true)
echo "CUDA driver version: ${DRIVER_VERSION}"

pip config --user set global.progress_bar off

python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126

USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

mkdir artifacts-to-be-uploaded
python ./tests/integration_tests_h100.py artifacts-to-be-uploaded --ngpu 8

# Enable CPP stacktraces for debugging symmetric memory initialization errors.
TORCH_SHOW_CPP_STACKTRACES=1 python -m tests.integration_tests_h100 artifacts-to-be-uploaded --ngpu 8
4 changes: 4 additions & 0 deletions .github/workflows/integration_test_8gpu_simple_fsdp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ jobs:
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
conda activate "${CONDA_ENV}"

# Log CUDA driver version for debugging.
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 || true)
echo "CUDA driver version: ${DRIVER_VERSION}"

pip config --user set global.progress_bar off

python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
Expand Down
57 changes: 57 additions & 0 deletions .github/workflows/integration_test_8gpu_torchft.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: TorchFT 8 GPU Integration Test

on:
push:
branches: [ main ]
paths:
- 'torchtitan/components/ft.py'
pull_request:
paths:
- 'torchtitan/components/ft.py'
schedule:
# Runs every 6 hours
- cron: '0 */6 * * *'
concurrency:
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
cancel-in-progress: true

defaults:
run:
shell: bash -l -eo pipefail {0}

jobs:
build-test:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
with:
runner: linux.g5.48xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.6"
# This image is faster to clone than the default, but it lacks CC needed by triton
# (1m25s vs 2m37s).
docker-image: torchtitan-ubuntu-20.04-clang12
repository: pytorch/torchtitan
upload-artifact: outputs
script: |
set -eux

# The generic Linux job chooses to use base env, not the one setup by the image
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
conda activate "${CONDA_ENV}"

# Log CUDA driver version for debugging.
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 || true)
echo "CUDA driver version: ${DRIVER_VERSION}"

pip config --user set global.progress_bar off

python -m pip install torchft-nightly
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

mkdir artifacts-to-be-uploaded
echo "torchft_lighthouse"
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 > /dev/null 2>&1 &
echo "ft_integration_test"
# Getting error - Cuda failure 217 'peer access is not supported between these two devices'
python -m tests.integration_tests_ft artifacts-to-be-uploaded --ngpu 8
# pkill -9 torchft_lighthouse
7 changes: 3 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,9 @@ wandb

torchtitan/datasets/**/*.model

# tokenizer models
assets/**/*.model
assets/**/*.json
assets/**/*.txt
# hf assets
assets/hf/*
assets/tokenizer/*
torchtitan/experiments/flux/assets/*

# temp files
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Note: To accelerate contributions to and innovations around `torchtitan`, we are
- After the model change, it should still load the original checkpoint correctly.
- Document the reasons for the code change, similar to [composability.md](docs/composability.md).
- Keep code modularized, especially for [train.py](train.py), so that it remains easy to copy-paste into a minimal code example. If necessary:
- Introduce new config options/category in [config_manager.py](torchtitan/config_manager.py).
- Introduce new config options/category in [job_config.py](torchtitan/config/job_config.py).
- Create separate functions/files.

### Proof of Value
Expand Down
40 changes: 34 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@

[![integration tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)
[![arXiv](https://img.shields.io/badge/arXiv-2410.06511-b31b1b.svg)](https://arxiv.org/abs/2410.06511)
[![ICLR](https://img.shields.io/badge/ICLR-2025-blue.svg)](https://iclr.cc/virtual/2025/poster/29620)
[![ICLR](https://img.shields.io/badge/ICLR-2025-violet.svg)](https://iclr.cc/virtual/2025/poster/29620)
[![forum](https://img.shields.io/badge/pytorch-forum-DE3412.svg)](https://discuss.pytorch.org/c/distributed/torchtitan/44)
[![license](https://img.shields.io/badge/license-BSD_3--Clause-lightgrey.svg)](./LICENSE)
[![pip](https://img.shields.io/pypi/v/torchtitan?color=blue)](https://pypi.org/project/torchtitan/)
[![conda](https://img.shields.io/conda/vn/conda-forge/torchtitan?color=green)](https://anaconda.org/conda-forge/torchtitan)


</div>

Expand All @@ -17,6 +20,8 @@ To use the latest features of `torchtitan`, we recommend using the most recent P


## Latest News
- [2025/07] We published [instructions](/torchtitan/models/README.md) on how to add a model to `torchtitan`.
- [2025/07] We released `torchtitan` [v0.1.0](https://github.com/pytorch/torchtitan/releases), and also set up nightly builds.
- [2025/04] Our paper was accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620).
- [2025/04] [Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
- [2025/04] Training the diffusion model [FLUX](torchtitan/experiments/flux/) with FSDP/HSDP is available as an experiment.
Expand All @@ -33,7 +38,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P

Our mission is to accelerate innovation in the field of generative AI by empowering researchers and developers to explore new modeling architectures and infrastructure techniques.

The guiding principles when building `torchtitan`
The Guiding Principles when building `torchtitan`
* Designed to be easy to understand, use and extend for different training purposes.
* Minimal changes to the model code when applying multi-dimensional parallelism.
* Bias towards a clean, minimal codebase while providing basic reusable / swappable components.
Expand Down Expand Up @@ -86,25 +91,48 @@ You may want to see how the model is defined or how parallelism techniques are a

## Installation

One can choose to install `torchtitan` from a stable release, a nightly build, or directly run the source code. Please [install PyTorch](https://pytorch.org/get-started/locally/) before proceeding.

### Stable releases
One can install the latest [stable release](https://github.com/pytorch/torchtitan/releases) of `torchtitan` via `pip` or `conda`.
```sh
pip install torchtitan
```
```sh
conda install conda-forge::torchtitan
```
Note that each stable release pins the nightly versions of `torch` and `torchao`. Please see [release.md](docs/release.md) for more details.

### Nightly builds

This method requires the nightly build of PyTorch. You can replace `cu126` with another version of cuda (e.g. `cu128`) or an AMD GPU (e.g. `rocm6.3`).

```sh
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
pip install --pre torchtitan --index-url https://download.pytorch.org/whl/nightly/cu126
```

### From source

This method requires the nightly build of PyTorch or the latest PyTorch built [from source](https://github.com/pytorch/pytorch?tab=readme-ov-file#from-source).

```bash
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
[For AMD GPU] pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3 --force-reinstall
```

### Downloading a tokenizer

`torchtitan` currently supports training Llama 3.1 (8B, 70B, 405B) out of the box. To get started training these models, we need to download a tokenizer.model. Follow the instructions on the official [meta-llama](https://huggingface.co/meta-llama/Llama-3.1-8B) repository to ensure you have access to the Llama model weights.
`torchtitan` currently supports training Llama 3.1 (8B, 70B, 405B) out of the box. To get started training these models, we need to download the tokenizer. Follow the instructions on the official [meta-llama](https://huggingface.co/meta-llama/Llama-3.1-8B) repository to ensure you have access to the Llama model weights.

Once you have confirmed access, you can run the following command to download the Llama 3.1 tokenizer to your local machine.

```bash
# Get your HF token from https://huggingface.co/settings/tokens

# Llama 3.1 tokenizer
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B --hf_token=...
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
```

### Start a training run
Expand Down
54 changes: 54 additions & 0 deletions benchmarks/llama3-8b_h200_202506_trainy-whitefiber.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
This was performed by Trainy team on WhiteFiber in June 2025, to get a baseline of performance
of the Trainy platform on H200s platform over multiple hosts.

### Models

Llama 3.1 8B

### Hardware

Each host has

- 8 NVIDIA H200 GPUs connected via NVLink.
- Hosts are inter-connected with a backend RDMA fabric with 400Gb/s (Mellanox CX-7) per GPU.

### Configuration

Runs were invoked with the following, where `NUM_NODES` was `4` and `8`
```
torchrun \
--nnodes $NUM_NODES \
--nproc_per_node 8 \
--rdzv_id 101 \
--rdzv_backend c10d \
--rdzv_endpoint "$MASTER_ADDR:29500" \
torchtitan/train.py \
--job.config-file torchtitan/models/llama3/train_configs/llama3_8b.toml \
--metrics.enable_wandb \
--training.local_batch_size=2 \
--training.compile \
--model.converters="float8" \
--float8.enable_fsdp_float8_all_gather \
--float8.precompute_float8_dynamic_scale_for_fsdp \
--float8.force_recompute_fp8_weight_in_bwd \
--profiling.profile_freq 1000000
--training.steps 2000
```

### Results

Detailed performance results and training configurations can be found in the tables below along and can visualized in [this WandB report](https://api.wandb.ai/links/asaiacai/w4c46stp). `TPS` and `Memory(GiB)` are arbitrarily sampled at the 100th iteration:

| NUM_NODES | TPS/GPU | Memory(GiB) |
| ----- | ----: | ----: |
| 4 | 10938 | 47.96 |
| 8 | 10753 | 46.97 |


### Versions and Dates

| repo | commit | date |
| --- | --- | --- |
| torch | [2.8.0a0+5228986c39](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-05.html) | 2025/05/29 |
| torchao | [0afa4c1](https://github.com/pytorch/ao/commit/0afa4c1bd28c82921e360ddbd1b27c9d6da5b947) | 2025/06/13 |
| torchtitan | [e7c0cae](https://github.com/pytorch/torchtitan/commit/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc) | 2025/06/13 |
Loading
Loading