[Misc] Add In-Container restart capability through supervisord for openai server #28502

HappyAmazonian · 2025-11-12T00:44:13Z

Purpose

We would like to add process-level restart mechanism for Openai Server through our plugin https://github.com/aws/model-hosting-container-standards. The server will restart automatically within the container whenever the vllm serve exit accidentally.
We provide the executable standard-supervisor which generates config file for supervisor and start the [supervisor](https://supervisord.org/) process with the launch command. This enables simple integration in the vllm openai server dockerfile.
the plugin also provides easy way for overriding the supervisor config through environment variables.
The feature is turned of by default. People can enable it through PROCESS_AUTO_RECOVERY=true env var

Test Plan

in our repo aws/model-hosting-container-standards, we fully test the supervisor integration https://github.com/aws/model-hosting-container-standards/blob/main/python/tests/integration/test_supervisor_cli_integration.py
we've also manually tested the restart by killing the vllm process inside the docker container.

To test this, we can build the openai server with

docker build   --target  vllm-openai  --build-arg max_jobs=100   \
--build-arg nvcc_threads=64   --build-arg USE_SCCACHE=1  \
 --build-arg VLLM_USE_PRECOMPILED=1   --build-arg GIT_REPO_CHECK=0  \
 --build-arg RUN_WHEEL_CHECK=false   -f docker/Dockerfile   -t vllm-openai:latest .

Start the vllm server container

docker run   --runtime nvidia --gpus all --env SAGEMAKER_CONTAINER_LOG_LEVEL=INFO -v ~/.cache/huggingface:/root/.cache/huggingface --env VLLM_USE_STANDARD_SUPERVISOR=true --env "HUGGING_FACE_HUB_TOKEN=YOUR_TOKEN" --network=host --env PROCESS_AUTO_RECOVERY=true --ipc=host vllm-openai:latest TinyLlama/TinyLlama-1.1B-Chat-v1.0

use docker exec -it [container name] bash to log into the running container, use htop to kill the vllm subprocess and see the vllm process gets restarted

Test Result

(APIServer pid=1570) INFO:     Waiting for application startup.
(APIServer pid=1570) INFO:     Application startup complete.
(APIServer pid=1570) INFO 11-11 16:03:30 [launcher.py:110] Shutting down FastAPI HTTP server.
[rank0]:[W1111 16:03:30.686624060 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1570) INFO:     Shutting down
(APIServer pid=1570) INFO:     Waiting for application shutdown.
(APIServer pid=1570) INFO:     Application shutdown complete.
# Supervisor detected the process exited
2025-11-11 16:03:33,577 WARN exited: app (exit status 0; not expected)
2025-11-11 16:03:34,579 INFO spawned: 'app' with pid 1955
2025-11-11 16:03:35,581 INFO success: app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
## Restart the vllm serve
(APIServer pid=1955) INFO 11-11 16:03:43 [api_server.py:1897] vLLM API server version 0.11.1rc5.dev367+g40cd1ed24
(APIServer pid=1955) INFO 11-11 16:03:43 [utils.py:253] non-default args: {'model_tag': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0', 'model': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}
(APIServer pid=1955) INFO 11-11 16:03:44 [model.py:630] Resolved architecture: LlamaForCausalLM
(APIServer pid=1955) INFO 11-11 16:03:44 [model.py:1728] Using max model len 2048
(APIServer pid=1955) INFO 11-11 16:03:44 [scheduler.py:254] Chunked prefill is enabled with max_num_batched_tokens=2048.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-11-12T00:44:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

DarkLight1337 · 2025-11-13T04:43:40Z

cc @zhaozuy @simon-mo @robertgshaw2-redhat

russellb · 2025-11-13T16:10:07Z

@HappyAmazonian, your PR description included your HuggingFace token. I removed it, but I also suggest you revoke it.

simon-mo · 2025-11-13T19:22:31Z

Question, typically our users rely on the cluster orchestrator to perform the restart, such as

Docker's restart flag
Kubernetes will redeploy the pod on crash

This also gives granular observabilities into statistics of how often the pod crashes and restarts

Can you please leave some insights?

Lokiiiiii · 2025-11-13T20:05:50Z

Question, typically our users rely on the cluster orchestrator to perform the restart, such as
* Docker's restart flag

* Kubernetes will redeploy the pod on crash
This also gives granular observabilities into statistics of how often the pod crashes and restarts

Can you please leave some insights?

These are valid points, we can disable this feature by default but consider this as a first line of defense that is useful in the following scenarios -

Legacy managed deployment systems that do not use container orchestrators
Reducing recovery time by preserving local artifacts like Compilation cache, HF cache, etc.
Reducing recovery time by avoiding new container file system creation

We are trying to bring process resiliency without being opinionated on deployment patterns.

russellb · 2025-11-13T20:14:28Z

Since it looks like this does not require any new dependencies, I'm interested to see what an "off by deafult" approach would look like. If it's simple enough, it seems OK. Otherwise, I'd say I think this sounds like custom behavior that should be enabled by building your own image. Let's see what having it as an option looks like, though.

…enai server vllm-project#28502 Signed-off-by: Shen Teng <[email protected]>

HappyAmazonian · 2025-11-14T22:18:50Z

Since it looks like this does not require any new dependencies, I'm interested to see what an "off by deafult" approach would look like. If it's simple enough, it seems OK. Otherwise, I'd say I think this sounds like custom behavior that should be enabled by building your own image. Let's see what having it as an option looks like, though.

Hi. Thanks for replying. We changed our plugin. So now the by default, it uses os.execvp to pass the command. No behavioral change to the existing code. People can enable auto-restart by setting PROCESS_AUTO_RECOVERY=1

reference: https://github.com/aws/model-hosting-container-standards/blob/main/python/model_hosting_container_standards/supervisor/scripts/standard_supervisor.py#L199

Signed-off-by: Shen Teng <[email protected]>

simon-mo · 2025-11-23T10:16:32Z

Is standard-supervisor from Sagemaker? This makes our container default depends on a third party library for entrypoint, which is a bit risky.

Lokiiiiii · 2025-11-24T17:30:32Z

Is standard-supervisor from Sagemaker? This makes our container default depends on a third party library for entrypoint, which is a bit risky.

@simon-mo Yes, it's an Apace 2.0 licensed package we publish - https://github.com/aws/model-hosting-container-standards. It is not necessarily tied to SageMaker, it encapsulates a bunch of boilerplate code for inference containers. We could partner to mitigate the risks you perceive.

The intention for the launcher is to add basic functionality like in-container restart and converting ENV variables to CLI args.

simon-mo · 2025-11-27T17:15:13Z

Can this change be made in the orchestration system to override entrypoint? AFAIK K8s supports this.

Lokiiiiii · 2025-12-02T17:20:59Z

Can this change be made in the orchestration system to override entrypoint? AFAIK K8s supports this.

This PR specifically targets non-k8s based orchestration. Besides we would like to introduce more features into the launcher like converting ENV variables to arguments.

@simon-mo Would you be amenable to an entrypoint.sh script that makes the launcher conditional on an ENV variable being present ?

if <true>:
 standard-supervisor vllm serve
else:
 vllm serve # default path

@HappyAmazonian Could you take a shot at this ?

simon-mo · 2025-12-02T23:51:06Z

What are the non-K8s based orchestratation that uses Docker container and do not offer restart? Fargate?

Add configurable entrypoint.sh that allows toggling standard-supervisor usage via VLLM_USE_STANDARD_SUPERVISOR environment variable. - Default: direct vllm serve execution (standard-supervisor disabled) - Set VLLM_USE_STANDARD_SUPERVISOR=true to enable standard-supervisor Signed-off-by: Shen Teng <[email protected]>

Lokiiiiii · 2025-12-09T00:39:26Z

What are the non-K8s based orchestratation that uses Docker container and do not offer restart? Fargate?

Yes, Fargate and SageMaker processing and Batch Transform etc. There are legacy container orchestration patterns with polling loops rather than event driven architecture. That is the target audience here.

@simon-mo The PR has been updated to reduce the blast radius. The launcher is now an OPT-IN.

mergify bot added the ci/build label Nov 12, 2025

HappyAmazonian force-pushed the restart branch 2 times, most recently from 46faab3 to 9206e4b Compare November 12, 2025 23:40

HappyAmazonian marked this pull request as ready for review November 12, 2025 23:40

HappyAmazonian requested review from DarkLight1337, NickLucche, aarnphm, robertgshaw2-redhat and simon-mo as code owners November 12, 2025 23:40

Lokiiiiii approved these changes Nov 13, 2025

View reviewed changes

[Misc] Add In-Container restart capability through supervisord for op…

3d0bec8

…enai server vllm-project#28502 Signed-off-by: Shen Teng <[email protected]>

HappyAmazonian force-pushed the restart branch from 172b24e to 3d0bec8 Compare November 14, 2025 22:12

HappyAmazonian added 2 commits November 14, 2025 14:12

Merge branch 'main' into restart

4618717

Merge branch 'main' into restart

03b0f07

HappyAmazonian and others added 6 commits November 18, 2025 10:08

Merge branch 'main' into restart

9629712

Merge branch 'main' into restart

2f88ef1

bump sagemaker dependency version

1a01735

Signed-off-by: Shen Teng <[email protected]>

Merge branch 'main' into restart

b87e881

Merge branch 'main' into restart

6a2dcaf

bump version

ce6bada

Signed-off-by: Shen Teng <[email protected]>

HappyAmazonian added 2 commits December 8, 2025 23:03

Merge remote-tracking branch 'happyamazonian/main' into restart

bc33944

HappyAmazonian force-pushed the restart branch from 47c7e05 to c9d6f6f Compare December 8, 2025 23:35

Merge branch 'main' into restart

2e2f4d9

Uh oh!

[Misc] Add In-Container restart capability through supervisord for openai server #28502

Are you sure you want to change the base?

[Misc] Add In-Container restart capability through supervisord for openai server #28502

Conversation

HappyAmazonian commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

DarkLight1337 commented Nov 13, 2025

Uh oh!

russellb commented Nov 13, 2025

Uh oh!

simon-mo commented Nov 13, 2025

Uh oh!

Lokiiiiii commented Nov 13, 2025

Uh oh!

russellb commented Nov 13, 2025

Uh oh!

HappyAmazonian commented Nov 14, 2025

Uh oh!

simon-mo commented Nov 23, 2025

Uh oh!

Lokiiiiii commented Nov 24, 2025

Uh oh!

simon-mo commented Nov 27, 2025

Uh oh!

Lokiiiiii commented Dec 2, 2025

Uh oh!

simon-mo commented Dec 2, 2025

Uh oh!

Lokiiiiii commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HappyAmazonian commented Nov 12, 2025 •

edited by github-actions bot

Loading