[Refactor] [1/N] to simplify the vLLM serving architecture #28040

chaunceyjiang · 2025-11-04T11:15:36Z

Purpose

Split vllm/entrypoints/openai/api_server.py based on different functionalities.

Test Plan

(APIServer pid=126097) INFO 12-01 11:13:27 [api_server.py:1504] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:38] Available routes are:
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/load_lora_adapter, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/unload_lora_adapter, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /adapters, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /adapters/{adapter_name}, Methods: DELETE

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

markmc · 2025-11-04T11:37:46Z

For reference, from @DarkLight1337 in #27882

I propose restructuring the code for online serving into the following modules:

vllm.entrypoints.serve.core: Contains the code for setting up the async client and FastAPI app. May include some common APIs such as health check.

vllm.entrypoints.serve.openai: Contains only the code for OpenAI endpoints (e.g. Completions API, Chat Completions API, Responses API)

vllm.entrypoints.serve.anthropic: Contains only the code for Anthropic endpoints (Messages API)

vllm.entrypoints.serve.jina: Contains only the code for JinaAI endpoints (Reranker API)

vllm.entrypoints.serve.vllm: Contains only the code for vLLM endpoints (e.g. Tokenize API, Pooling API, dev mode endpoints)

In vllm.entrypoints.serve, we can have the actual entrypoint which uses .core to build the server, then incrementally attach endpoints to the FastAPI app by importing relevant functions from the API-specific submodules.

esmeetu · 2025-11-06T13:57:26Z

@chaunceyjiang @markmc @DarkLight1337 Just opened RFC #28218 , might be related to this refactor. Would love to hear your thoughts!

chatgpt-codex-connector · 2025-12-01T14:22:59Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

DarkLight1337 · 2025-12-01T14:34:22Z

/gemini review

vllm/entrypoints/serve/__init__.py

vllm/entrypoints/serve/tokenize/api_router.py

vllm/entrypoints/serve/elastic_ep/middleware.py

gemini-code-assist

Code Review

This pull request refactors the vLLM serving architecture by splitting the monolithic vllm/entrypoints/openai/api_server.py into smaller, more focused modules under vllm/entrypoints/serve/. This is a good improvement for maintainability. I've found two issues: one critical regression where some refactored API endpoints are not registered, and one high-severity issue where an API response type has changed, which could be a breaking change for clients.

vllm/entrypoints/serve/__init__.py

vllm/entrypoints/serve/elastic_ep/protocol.py

chaunceyjiang · 2025-12-01T15:32:55Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the vLLM serving architecture by splitting the monolithic api_server.py into a more modular structure with functionality-specific routers. This greatly improves the maintainability and readability of the code. The code has been moved logically into new modules under vllm/entrypoints/serve/. While reviewing the changes, I identified a critical issue with state synchronization for the elastic endpoint scaling feature that could cause problems in a multi-worker production environment. My detailed feedback is in the review comment.

vllm/entrypoints/serve/elastic_ep/middleware.py

mergify · 2025-12-01T17:28:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: chaunceyjiang <[email protected]>

mergify bot added the frontend label Nov 4, 2025

chaunceyjiang changed the title ~~[Refactor] [1/N] to simplify the vLLM serving architecture~~ [WIP][Refactor] [1/N] to simplify the vLLM serving architecture Nov 4, 2025

chaunceyjiang mentioned this pull request Nov 4, 2025

Adds anthropic /v1/messages endpoint to openai api_server #27882

Merged

5 tasks

mgoin mentioned this pull request Nov 12, 2025

[DisaggEverything] Tokens in<>out /generate endpoint #24261

Merged

3 tasks

noooop mentioned this pull request Nov 28, 2025

[Frontend] Resettle pooling entrypoints #29634

Merged

5 tasks

chaunceyjiang closed this Dec 1, 2025

chaunceyjiang force-pushed the vllm_serve_refactor branch from ab084a9 to 014ece9 Compare December 1, 2025 09:28

chaunceyjiang reopened this Dec 1, 2025

chaunceyjiang marked this pull request as ready for review December 1, 2025 14:22

chaunceyjiang requested a review from aarnphm as a code owner December 1, 2025 14:22

chaunceyjiang changed the title ~~[WIP][Refactor] [1/N] to simplify the vLLM serving architecture~~ [Refactor] [1/N] to simplify the vLLM serving architecture Dec 1, 2025

DarkLight1337 reviewed Dec 1, 2025

View reviewed changes

vllm/entrypoints/serve/__init__.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Dec 1, 2025

View reviewed changes

vllm/entrypoints/serve/tokenize/api_router.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Dec 1, 2025

View reviewed changes

vllm/entrypoints/serve/elastic_ep/middleware.py Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

vllm/entrypoints/serve/__init__.py Outdated Show resolved Hide resolved

vllm/entrypoints/serve/elastic_ep/protocol.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

vllm/entrypoints/serve/elastic_ep/middleware.py Show resolved Hide resolved

DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs labels Dec 1, 2025

mergify bot added the needs-rebase label Dec 1, 2025

chaunceyjiang added 5 commits December 2, 2025 08:13

[Refactor] [1/N] to simplify the vLLM serving architecture

e64e112

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

8d92609

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

28f94d5

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

c1a72c6

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

fab2ea7

Signed-off-by: chaunceyjiang <[email protected]>

chaunceyjiang added 7 commits December 2, 2025 08:14

[Refactor] [1/N] to simplify the vLLM serving architecture

2a60236

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

e358bd0

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

39341cd

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

063e5a8

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

f6ad5a1

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

4527627

Signed-off-by: chaunceyjiang <[email protected]>

[Refactor] [1/N] to simplify the vLLM serving architecture

e4e5489

Signed-off-by: chaunceyjiang <[email protected]>

chaunceyjiang force-pushed the vllm_serve_refactor branch from 83a5ddc to e4e5489 Compare December 2, 2025 08:14

mergify bot removed the needs-rebase label Dec 2, 2025

[Refactor] [1/N] to simplify the vLLM serving architecture

d02b17e

Signed-off-by: chaunceyjiang <[email protected]>

chaunceyjiang requested review from NickLucche and robertgshaw2-redhat as code owners December 2, 2025 09:34

[Refactor] [1/N] to simplify the vLLM serving architecture

54b840b

Signed-off-by: chaunceyjiang <[email protected]>

DarkLight1337 approved these changes Dec 3, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) December 3, 2025 07:51

vllm-bot merged commit 3f42b05 into vllm-project:main Dec 3, 2025
130 of 133 checks passed

chaunceyjiang deleted the vllm_serve_refactor branch December 3, 2025 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Refactor] [1/N] to simplify the vLLM serving architecture #28040

[Refactor] [1/N] to simplify the vLLM serving architecture #28040

chaunceyjiang commented Nov 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

markmc commented Nov 4, 2025

Uh oh!

esmeetu commented Nov 6, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

DarkLight1337 commented Dec 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Refactor] [1/N] to simplify the vLLM serving architecture #28040

[Refactor] [1/N] to simplify the vLLM serving architecture #28040

Conversation

chaunceyjiang commented Nov 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

markmc commented Nov 4, 2025

Uh oh!

esmeetu commented Nov 6, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

DarkLight1337 commented Dec 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chaunceyjiang commented Nov 4, 2025 •

edited by github-actions bot

Loading