-
Notifications
You must be signed in to change notification settings - Fork 610
Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/fegin/46/base
Are you sure you want to change the base?
Conversation
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2063 * __->__ #2062 This will prevent errors when later doing git checkout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put in https://github.com/pytorch/torchtitan/tree/main/tests/assets and just call it llama3_losses.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b q:Is this ground truth loss come from a single GPU run? Or is it FSDP only?
| # Verify the accuracy. | ||
| echo "Checking FSDP4 v.s. HSDP2FSDP2TP2 accuracy parity" | ||
| echo "Checking FSDP4 v.s. HSDP2FSDP4 accuracy parity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found HSDP2FSDP4 confusing.
Is this FSDP 8 vs. HSDP (4, 2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can use HSDP (4, 2), I don't think we have a formal way to write HSDP, or do we, lol?
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move assets/ci_llama3_losses.txt to tests/assets/llama3_losses.txt.
If we want to extend this to more models, I would suggest creating a folder tests/assets/losses/llama3.txt
| python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --baseline-ngpus=4 --test-ngpus=8 --steps=1 | ||
| export test_options="--parallelism.data_parallel_replicate_degree=4" | ||
| python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --steps=10 --import-result assets/ci_llama3_losses.txt | ||
| rm -rf $RUNNER_TEMP/artifacts-to-be-uploaded/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because you dump something to the folder, so that later runs will complain it's not empty? I think we should make this dump folder optional so that the first run doesn't use it at all.
Stack from ghstack (oldest at bottom):
This PR allows us to check if the loss is consistent across commits/PRs.