-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
🚀 The feature, motivation and pitch
Feature Request: OpenAI-compatible input_video support for vLLM’s Chat Completions and Responses API
First, thank you for the incredible work on multimodal support and the recent EVS improvements for Qwen3-VL.
At the moment, vLLM fully supports video inside the Python API for models like Qwen3-VL, but the OpenAI-compatible /v1/chat/completions and /v1/responses endpoints do not expose any way to pass video inputs, even though the backend model executor now handles video embeddings.
This makes it impossible to perform video inference using vLLM via HTTP, because the HTTP parser does not convert input_video items into VideoItem, so the multimodal pipeline is never invoked.
Current Behavior
The vLLM OpenAI server currently supports:
/v1/chat/completions:
{"type": "text"}{"type": "image_url"}— for images provided as URLs or data URLs
/v1/responses:
{"type": "input_text"}{"type": "input_image"}— also URL/data URL or file ID
However, neither endpoint accepts any form of video input. The following OpenAI-style content blocks are rejected:
Chat Completions (expected analogue to image_url):
{
"type": "video_url",
"video_url": {
"url": "data:video/mp4;base64,..."
}
}Responses API (expected analogue to input_image):
{
"type": "input_video",
"video_url": "data:video/mp4;base64,..."
}Because the REST layer does not recognize video_url / input_video content types, it never converts them into VideoItem, so the multimodal video pipeline in the backend is never invoked, despite being fully implemented.
As a result, the only working path for video inference today is the Python API:
from vllm import LLM
llm.generate(...)This requires running models locally and bypasses HTTP entirely, preventing remote or OpenAI-compatible deployment of video-capable models.
Expected Behavior
vLLM’s /v1/chat/completions endpoint should accept:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen3-VL-30B-A3B-Thinking\",
\"messages\": [
{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,<your base64>\"
}
},
{ \"type\": \"text\", \"text\": \"Summarize this video.\" }
]
}
],
\"max_tokens\": 2048
}"and /v1/responses supports
curl http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen3-VL-30B-A3B-Thinking\",
\"input\": [
{
\"role\": \"user\",
\"content\": [
{ \"type\": \"input_text\", \"text\": \"Summarize this video\" },
{
\"type\": \"input_video\",
\"video_url\": \"data:video/mp4;base64,<your base64>\"
}
]
}
],
\"max_output_tokens\": 2048
}"
Why This Matters
- Qwen3-VL, InternVL, GLM-4V, etc. now support video natively inside vLLM
- But the OpenAI-compatible API provides no way to use it
- Users cannot deploy video models remotely via HTTP
- LM Studio and other HTTP-based clients cannot use video even though the backend supports it
- It prevents vLLM from being a drop-in replacement for OpenAI for multimodal agents
- OpenAI’s multimodal API already defines the input_* pattern (e.g., input_image, input_audio), and video is the next natural extension
This feature would unlock real production use-cases for video reasoning with vLLM.
Proposed Approach (High-Level)
I’m not prescribing an implementation, but here are the building blocks already in vLLM:
VideoIteminvllm.multimodal.inputs_process_video_input()_postprocess_video_embeds_evs()- EVS support in PR [Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752
process_vision_info()fromqwen_vl_utils(can be used or reimplemented internally)
And the missing piece is:
Parsing input_video items in the OpenAI REST layer and passing video embeddings into LLM.generate() via multi_modal_data.
This change is isolated to the JSON parsing layer and requires no model changes or new endpoints.
Conclusion
The backend of vLLM now has strong video multimodal support, especially with the new EVS work. Adding support for OpenAI-style "input_video" to the Chat Completions API would make vLLM fully capable of video inference over HTTP and bring it to feature parity with OpenAI’s multimodal API.
I’d be happy to help test or iterate on a proposal if this feature is accepted.
Thanks for your great work on vLLM!
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.