-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[DNM][ci] Break down test-pipeline.yaml into test areas #29343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kevin H. Luu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the main CI pipeline configuration file into smaller, more manageable files based on test areas. This is a positive change for maintainability. My review focuses on the correctness of these new CI configuration files. I've identified a few critical issues related to incorrect GPU allocation and process counts in the distributed tests, as well as a potentially missed test case in the LoRA test suite. Addressing these will ensure the CI pipeline remains robust and correct after this refactoring.
| - "pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'" | ||
| - pytest -v -s tests/distributed/test_sequence_parallel.py | ||
| - pytest -v -s tests/distributed/test_context_parallel.py | ||
| - CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Distributed Tests (2 GPUs)(H200) step is configured with num_gpus: 2. The CI runner will likely allocate GPUs with indices 0 and 1 for this job. However, the command hardcodes CUDA_VISIBLE_DEVICES=1,2. This can cause problems: if GPU 2 is not available, the test will fail. If it is available but not allocated to this job, it might interfere with other jobs. It's safer to rely on the CI runner's GPU allocation, which typically uses indices starting from 0. For a 2-GPU job, this would be 0,1.
- CUDA_VISIBLE_DEVICES=0,1 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| num_gpus: 4 | ||
| working_dir: "/vllm-workspace" | ||
| commands: | ||
| - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
|
Can we also make it that tests always run if the specific CI file is changed? |
| num_gpus: 4 | ||
| working_dir: "/vllm-workspace" | ||
| commands: | ||
| - bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these files appear to be ep related and e2e integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you suggest o to leave it here or move to EP file?
yeqcharlotte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's your plan to eventually merge this?
i think you might wanna merge the split earlier. maybe use some import rules so they are more consistent. otherwise others will keep adding tests and you wouldn't be able to rebase on top.
you can also double check using @rzabarazesh's test coverage collection script to see whether there's difference on coverage before and after.
| @@ -0,0 +1,150 @@ | |||
| group: Miscellaneous | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not a big fan of misc, i think this might end up capturing a bunch of random stuff.
what's your thought on
- moving V1 stuff to different test areas.
- cpu stuff can be on its own test area.
- one yaml for packaging and installation.
- examples, prime rl compatibility tests can be moved to e2e integrations.
- gptoss eval and lm eval can be merged into e2e engine accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we shouldn't have a misc group, this is because I don't know where to place these jobs yet, going to figure that out after the restructuring...
- Some jobs here (V1 others, async engine) need to be broken down and integrated with tests in other areas
- Moved Prime RL to e2e_integration
can be merged into e2e engine accuracy.
We don't have this yet. Do you mean to create one?
My plan is to get the pipeline generator working with this refactor, then merge all together.
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
This reverts commit 36db0a3.
|
Documentation preview: https://vllm--29343.org.readthedocs.build/en/29343/ |
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Split up tests in
test-pipeline.yamlinto 25 different test areas, each test area is 1 yaml file. I try to have a similar structure with/tests/directory.