[DNM][ci] Break down test-pipeline.yaml into test areas #29343

khluu · 2025-11-24T20:45:42Z

Split up tests in test-pipeline.yaml into 25 different test areas, each test area is 1 yaml file. I try to have a similar structure with /tests/ directory.

Attention
Basic Correctness
Benchmarks
Compile
CUDA
Distributed
E2E Integration
Engine
Entrypoints
Expert Parallelism
Kernels
LM Eval
LoRA
Miscellaneous
Model Executor
Models - Basic
Models - Language
Models - Distributed
Models - Multimodal
Plugins
PyTorch
Quantization
Samplers
Tool use
Weight Loading

Signed-off-by: Kevin H. Luu <[email protected]>

gemini-code-assist

Code Review

This pull request refactors the main CI pipeline configuration file into smaller, more manageable files based on test areas. This is a positive change for maintainability. My review focuses on the correctness of these new CI configuration files. I've identified a few critical issues related to incorrect GPU allocation and process counts in the distributed tests, as well as a potentially missed test case in the LoRA test suite. Addressing these will ensure the CI pipeline remains robust and correct after this refactoring.

.buildkite/test_areas/distributed.yaml

gemini-code-assist · 2025-11-24T20:48:19Z

.buildkite/test_areas/distributed.yaml

+    - "pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
+    - pytest -v -s tests/distributed/test_sequence_parallel.py
+    - pytest -v -s tests/distributed/test_context_parallel.py
+    - CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1  --dp-size=2 --max-model-len 2048


The Distributed Tests (2 GPUs)(H200) step is configured with num_gpus: 2. The CI runner will likely allocate GPUs with indices 0 and 1 for this job. However, the command hardcodes CUDA_VISIBLE_DEVICES=1,2. This can cause problems: if GPU 2 is not available, the test will fail. If it is available but not allocated to this job, it might interfere with other jobs. It's safer to rely on the CI runner's GPU allocation, which typically uses indices starting from 0. For a 2-GPU job, this would be 0,1.

- CUDA_VISIBLE_DEVICES=0,1 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048

.buildkite/test_areas/lora.yaml

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.buildkite/test_areas/basic_correctness.yaml

.buildkite/test_areas/entrypoints.yaml

varun-sundar-rabindranath · 2025-11-24T21:31:58Z

.buildkite/test_areas/e2e_integration.yaml

+  num_gpus: 4
+  working_dir: "/vllm-workspace"
+  commands:
+  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020


Hi @khluu - I added a B200 tests and changed the name of this test file to qwen30b_a3b_fp8_block_ep_eplb.sh - Can you reflect these changes please #29195 . Thanks 🙌

ProExpertProg · 2025-11-24T22:07:58Z

Can we also make it that tests always run if the specific CI file is changed?

yeqcharlotte · 2025-11-25T06:30:28Z

.buildkite/test_areas/e2e_integration.yaml

+  num_gpus: 4
+  working_dir: "/vllm-workspace"
+  commands:
+  - bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010


these files appear to be ep related and e2e integration tests

Would you suggest o to leave it here or move to EP file?

yeqcharlotte

what's your plan to eventually merge this?

i think you might wanna merge the split earlier. maybe use some import rules so they are more consistent. otherwise others will keep adding tests and you wouldn't be able to rebase on top.

you can also double check using @rzabarazesh's test coverage collection script to see whether there's difference on coverage before and after.

yeqcharlotte · 2025-11-25T07:40:48Z

.buildkite/test_areas/misc.yaml

@@ -0,0 +1,150 @@
+group: Miscellaneous


i'm not a big fan of misc, i think this might end up capturing a bunch of random stuff.

what's your thought on

moving V1 stuff to different test areas.

cpu stuff can be on its own test area.

one yaml for packaging and installation.

examples, prime rl compatibility tests can be moved to e2e integrations.

gptoss eval and lm eval can be merged into e2e engine accuracy.

I agree we shouldn't have a misc group, this is because I don't know where to place these jobs yet, going to figure that out after the restructuring...

Some jobs here (V1 others, async engine) need to be broken down and integrated with tests in other areas

Moved Prime RL to e2e_integration

can be merged into e2e engine accuracy.

We don't have this yet. Do you mean to create one?

My plan is to get the pipeline generator working with this refactor, then merge all together.

Signed-off-by: Kevin H. Luu <[email protected]>

…#29787)" This reverts commit 37593de.

This reverts commit 36db0a3.

mergify · 2025-12-01T21:52:01Z

Documentation preview: https://vllm--29343.org.readthedocs.build/en/29343/

Signed-off-by: Kevin H. Luu <[email protected]>

p

a7f11ca

Signed-off-by: Kevin H. Luu <[email protected]>

mergify bot added ci/build nvidia labels Nov 24, 2025

github-project-automation bot added this to NVIDIA Nov 24, 2025

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

khluu changed the title ~~[ci] Break down test-pipeline.yaml into test areas~~ [DNM][ci] Break down test-pipeline.yaml into test areas Nov 24, 2025

chatgpt-codex-connector bot reviewed Nov 24, 2025

View reviewed changes

.buildkite/test_areas/basic_correctness.yaml Show resolved Hide resolved

.buildkite/test_areas/entrypoints.yaml Show resolved Hide resolved

varun-sundar-rabindranath reviewed Nov 24, 2025

View reviewed changes

yeqcharlotte reviewed Nov 25, 2025

View reviewed changes

khluu added 2 commits November 25, 2025 01:45

Merge branch 'main' into khluu/refactor_ci

a0d641a

move primerl

27a893b

Signed-off-by: Kevin H. Luu <[email protected]>

khluu requested a review from yeqcharlotte November 25, 2025 17:52

khluu added 10 commits November 26, 2025 02:05

test

7701154

Signed-off-by: Kevin H. Luu <[email protected]>

update pipeline yaml

0a7642e

Signed-off-by: Kevin H. Luu <[email protected]>

key

265bf9f

Signed-off-by: Kevin H. Luu <[email protected]>

build files

0784707

Signed-off-by: Kevin H. Luu <[email protected]>

permission

c8707ff

Signed-off-by: Kevin H. Luu <[email protected]>

key change

b9a6433

Signed-off-by: Kevin H. Luu <[email protected]>

depends_on for build job

ac82065

Signed-off-by: Kevin H. Luu <[email protected]>

Revert "[CI] fix url-encoding behavior in nightly metadata generation (…

8b886aa

…#29787)" This reverts commit 37593de.

Revert "[CI] Renovation of nightly wheel build & generation (#29690)"

95b4cdf

This reverts commit 36db0a3.

Merge branch 'main' into khluu/refactor_ci

f3fb422

mergify bot added the documentation Improvements or additions to documentation label Dec 1, 2025

khluu added 4 commits December 1, 2025 15:24

sync

d4d268c

Signed-off-by: Kevin H. Luu <[email protected]>

sync

1ad5b4d

Signed-off-by: Kevin H. Luu <[email protected]>

fix long command

c1629aa

Signed-off-by: Kevin H. Luu <[email protected]>

Merge branch 'main' into khluu/refactor_ci

234c89e

khluu added 3 commits December 1, 2025 18:48

debug

950643d

Signed-off-by: Kevin H. Luu <[email protected]>

slashes

98a38d1

Signed-off-by: Kevin H. Luu <[email protected]>

slashes

a020a18

Signed-off-by: Kevin H. Luu <[email protected]>

Uh oh!

[DNM][ci] Break down test-pipeline.yaml into test areas #29343

Are you sure you want to change the base?

[DNM][ci] Break down test-pipeline.yaml into test areas #29343

Conversation

khluu commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

khluu Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Nov 24, 2025

Uh oh!

yeqcharlotte Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khluu Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

khluu Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

khluu commented Nov 24, 2025 •

edited by github-actions bot

Loading

yeqcharlotte Nov 25, 2025 •

edited

Loading

khluu Nov 25, 2025 •

edited

Loading