Skip to content

Conversation

@chaunceyjiang
Copy link
Collaborator

@chaunceyjiang chaunceyjiang commented Nov 4, 2025

Purpose

Refer #28218

Split vllm/entrypoints/openai/api_server.py based on different functionalities.

Test Plan

(APIServer pid=126097) INFO 12-01 11:13:27 [api_server.py:1504] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:38] Available routes are:
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/load_lora_adapter, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/unload_lora_adapter, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /adapters, Methods: POST
(APIServer pid=126097) INFO 12-01 11:13:27 [launcher.py:46] Route: /adapters/{adapter_name}, Methods: DELETE

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the frontend label Nov 4, 2025
@chaunceyjiang chaunceyjiang changed the title [Refactor] [1/N] to simplify the vLLM serving architecture [WIP][Refactor] [1/N] to simplify the vLLM serving architecture Nov 4, 2025
@markmc
Copy link
Member

markmc commented Nov 4, 2025

For reference, from @DarkLight1337 in #27882

I propose restructuring the code for online serving into the following modules:

  • vllm.entrypoints.serve.core: Contains the code for setting up the async client and FastAPI app. May include some common APIs such as health check.
  • vllm.entrypoints.serve.openai: Contains only the code for OpenAI endpoints (e.g. Completions API, Chat Completions API, Responses API)
  • vllm.entrypoints.serve.anthropic: Contains only the code for Anthropic endpoints (Messages API)
  • vllm.entrypoints.serve.jina: Contains only the code for JinaAI endpoints (Reranker API)
  • vllm.entrypoints.serve.vllm: Contains only the code for vLLM endpoints (e.g. Tokenize API, Pooling API, dev mode endpoints)

In vllm.entrypoints.serve, we can have the actual entrypoint which uses .core to build the server, then incrementally attach endpoints to the FastAPI app by importing relevant functions from the API-specific submodules.

@esmeetu
Copy link
Member

esmeetu commented Nov 6, 2025

@chaunceyjiang @markmc @DarkLight1337 Just opened RFC #28218 , might be related to this refactor. Would love to hear your thoughts!

@chaunceyjiang chaunceyjiang reopened this Dec 1, 2025
@chaunceyjiang chaunceyjiang marked this pull request as ready for review December 1, 2025 14:22
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chaunceyjiang chaunceyjiang changed the title [WIP][Refactor] [1/N] to simplify the vLLM serving architecture [Refactor] [1/N] to simplify the vLLM serving architecture Dec 1, 2025
@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the vLLM serving architecture by splitting the monolithic vllm/entrypoints/openai/api_server.py into smaller, more focused modules under vllm/entrypoints/serve/. This is a good improvement for maintainability. I've found two issues: one critical regression where some refactored API endpoints are not registered, and one high-severity issue where an API response type has changed, which could be a breaking change for clients.

@chaunceyjiang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the vLLM serving architecture by splitting the monolithic api_server.py into a more modular structure with functionality-specific routers. This greatly improves the maintainability and readability of the code. The code has been moved logically into new modules under vllm/entrypoints/serve/. While reviewing the changes, I identified a critical issue with state synchronization for the elastic endpoint scaling feature that could cause problems in a multi-worker production environment. My detailed feedback is in the review comment.

@DarkLight1337 DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs labels Dec 1, 2025
@mergify
Copy link

mergify bot commented Dec 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 1, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) December 3, 2025 07:51
@vllm-bot vllm-bot merged commit 3f42b05 into vllm-project:main Dec 3, 2025
130 of 133 checks passed
@chaunceyjiang chaunceyjiang deleted the vllm_serve_refactor branch December 3, 2025 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants