-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Refactor] [1/N] to simplify the vLLM serving architecture #28040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] [1/N] to simplify the vLLM serving architecture #28040
Conversation
|
For reference, from @DarkLight1337 in #27882
|
|
@chaunceyjiang @markmc @DarkLight1337 Just opened RFC #28218 , might be related to this refactor. Would love to hear your thoughts! |
ab084a9 to
014ece9
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the vLLM serving architecture by splitting the monolithic vllm/entrypoints/openai/api_server.py into smaller, more focused modules under vllm/entrypoints/serve/. This is a good improvement for maintainability. I've found two issues: one critical regression where some refactored API endpoints are not registered, and one high-severity issue where an API response type has changed, which could be a breaking change for clients.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-executed refactoring of the vLLM serving architecture by splitting the monolithic api_server.py into a more modular structure with functionality-specific routers. This greatly improves the maintainability and readability of the code. The code has been moved logically into new modules under vllm/entrypoints/serve/. While reviewing the changes, I identified a critical issue with state synchronization for the elastic endpoint scaling feature that could cause problems in a multi-worker production environment. My detailed feedback is in the review comment.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
83a5ddc to
e4e5489
Compare
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Purpose
Refer #28218
Split
vllm/entrypoints/openai/api_server.pybased on different functionalities.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.