Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 19, 2025

Stack from ghstack (oldest at bottom):

This PR allows us to check if the loss is consistent across commits/PRs.

  1. This PR contains a pre-tested losses result file.
  2. This PR improve the loss_compare.py to add --import and --export options.
  3. In CI, uses --import to get the previous losses and compare them with the current PR. If anything mismatch (10 steps), the CI will fail.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 19, 2025
This allows us to check if the loss is consistent across commits/PRs


ghstack-source-id: c23ff6a
Pull-Request: #2063
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025
@fegin fegin marked this pull request as draft November 19, 2025 20:49
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 19, 2025
This allows us to check if the loss is consistent across commits/PRs

ghstack-source-id: 4715975
Pull-Request: #2063
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
This allows us to check if the loss is consistent across commits/PRs

ghstack-source-id: d5265c3
Pull-Request: #2063
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
This allows us to check if the loss is consistent across commits/PRs

ghstack-source-id: f4541cb
Pull-Request: #2063
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
This allows us to check if the loss is consistent across commits/PRs

ghstack-source-id: 3faf77c
Pull-Request: #2063
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
This allows us to check if the loss is consistent across commits/PRs

ghstack-source-id: a625db7
Pull-Request: #2063
@fegin fegin marked this pull request as ready for review November 20, 2025 01:57
@fegin fegin changed the title Add import and export options to loss_compare.py Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses Nov 20, 2025
fegin added a commit that referenced this pull request Nov 20, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2063
* __->__ #2062

This will prevent errors when later doing git checkout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put in https://github.com/pytorch/torchtitan/tree/main/tests/assets and just call it llama3_losses.txt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b q:Is this ground truth loss come from a single GPU run? Or is it FSDP only?

# Verify the accuracy.
echo "Checking FSDP4 v.s. HSDP2FSDP2TP2 accuracy parity"
echo "Checking FSDP4 v.s. HSDP2FSDP4 accuracy parity"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found HSDP2FSDP4 confusing.
Is this FSDP 8 vs. HSDP (4, 2)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can use HSDP (4, 2), I don't think we have a formal way to write HSDP, or do we, lol?

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
This allows us to check if the loss is consistent across commits/PRs

ghstack-source-id: 212e08b
Pull-Request: #2063
@fegin fegin requested a review from tianyu-l November 21, 2025 07:06
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move assets/ci_llama3_losses.txt to tests/assets/llama3_losses.txt.
If we want to extend this to more models, I would suggest creating a folder tests/assets/losses/llama3.txt

python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --baseline-ngpus=4 --test-ngpus=8 --steps=1
export test_options="--parallelism.data_parallel_replicate_degree=4"
python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --steps=10 --import-result assets/ci_llama3_losses.txt
rm -rf $RUNNER_TEMP/artifacts-to-be-uploaded/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because you dump something to the folder, so that later runs will complain it's not empty? I think we should make this dump folder optional so that the first run doesn't use it at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants