Skip to content

[Feature]: Add OpenAI-style input_video support to /v1/chat/completions for multimodal models (e.g., Qwen3-VL) #29754

@ehartford

Description

@ehartford

🚀 The feature, motivation and pitch

Feature Request: OpenAI-compatible input_video support for vLLM’s Chat Completions and Responses API

First, thank you for the incredible work on multimodal support and the recent EVS improvements for Qwen3-VL.

At the moment, vLLM fully supports video inside the Python API for models like Qwen3-VL, but the OpenAI-compatible /v1/chat/completions and /v1/responses endpoints do not expose any way to pass video inputs, even though the backend model executor now handles video embeddings.

This makes it impossible to perform video inference using vLLM via HTTP, because the HTTP parser does not convert input_video items into VideoItem, so the multimodal pipeline is never invoked.


Current Behavior

The vLLM OpenAI server currently supports:

/v1/chat/completions:

  • {"type": "text"}
  • {"type": "image_url"} — for images provided as URLs or data URLs

/v1/responses:

  • {"type": "input_text"}
  • {"type": "input_image"} — also URL/data URL or file ID

However, neither endpoint accepts any form of video input. The following OpenAI-style content blocks are rejected:

Chat Completions (expected analogue to image_url):

{
  "type": "video_url",
  "video_url": {
    "url": "data:video/mp4;base64,..."
  }
}

Responses API (expected analogue to input_image):

{
  "type": "input_video",
  "video_url": "data:video/mp4;base64,..."
}

Because the REST layer does not recognize video_url / input_video content types, it never converts them into VideoItem, so the multimodal video pipeline in the backend is never invoked, despite being fully implemented.

As a result, the only working path for video inference today is the Python API:

from vllm import LLM
llm.generate(...)

This requires running models locally and bypasses HTTP entirely, preventing remote or OpenAI-compatible deployment of video-capable models.


Expected Behavior

vLLM’s /v1/chat/completions endpoint should accept:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-30B-A3B-Thinking\",
    \"messages\": [
      {
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"video_url\",
            \"video_url\": {
              \"url\": \"data:video/mp4;base64,<your base64>\"
            }
          },
          { \"type\": \"text\", \"text\": \"Summarize this video.\" }
        ]
      }
    ],
    \"max_tokens\": 2048
  }"

and /v1/responses supports

curl http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-30B-A3B-Thinking\",
    \"input\": [
      {
        \"role\": \"user\",
        \"content\": [
          { \"type\": \"input_text\", \"text\": \"Summarize this video\" },
          {
            \"type\": \"input_video\",
            \"video_url\": \"data:video/mp4;base64,<your base64>\"
          }
        ]
      }
    ],
    \"max_output_tokens\": 2048
  }"

Why This Matters

  • Qwen3-VL, InternVL, GLM-4V, etc. now support video natively inside vLLM
  • But the OpenAI-compatible API provides no way to use it
  • Users cannot deploy video models remotely via HTTP
  • LM Studio and other HTTP-based clients cannot use video even though the backend supports it
  • It prevents vLLM from being a drop-in replacement for OpenAI for multimodal agents
  • OpenAI’s multimodal API already defines the input_* pattern (e.g., input_image, input_audio), and video is the next natural extension

This feature would unlock real production use-cases for video reasoning with vLLM.


Proposed Approach (High-Level)

I’m not prescribing an implementation, but here are the building blocks already in vLLM:

And the missing piece is:

Parsing input_video items in the OpenAI REST layer and passing video embeddings into LLM.generate() via multi_modal_data.

This change is isolated to the JSON parsing layer and requires no model changes or new endpoints.


Conclusion

The backend of vLLM now has strong video multimodal support, especially with the new EVS work. Adding support for OpenAI-style "input_video" to the Chat Completions API would make vLLM fully capable of video inference over HTTP and bring it to feature parity with OpenAI’s multimodal API.

I’d be happy to help test or iterate on a proposal if this feature is accepted.

Thanks for your great work on vLLM!

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions