[Feature]: Add OpenAI-style `input_video` support to `/v1/chat/completions` for multimodal models (e.g., Qwen3-VL)

### 🚀 The feature, motivation and pitch

### **Feature Request: OpenAI-compatible `input_video` support for vLLM’s Chat Completions and Responses API**

First, thank you for the incredible work on multimodal support and the recent EVS improvements for Qwen3-VL.

At the moment, vLLM fully supports video inside the *Python API* for models like Qwen3-VL, but the **OpenAI-compatible `/v1/chat/completions` and `/v1/responses` endpoints do not expose any way to pass video inputs**, even though the backend model executor now handles video embeddings.

This makes it impossible to perform video inference using vLLM via HTTP, because the HTTP parser does not convert input_video items into VideoItem, so the multimodal pipeline is never invoked.

---

## **Current Behavior**

The vLLM OpenAI server currently supports:

**`/v1/chat/completions`:**

* `{"type": "text"}`
* `{"type": "image_url"}` — for images provided as URLs or data URLs

**`/v1/responses`:**

* `{"type": "input_text"}`
* `{"type": "input_image"}` — also URL/data URL or file ID

However, **neither endpoint accepts any form of video input**. The following OpenAI-style content blocks are rejected:

**Chat Completions (expected analogue to `image_url`):**

```json
{
  "type": "video_url",
  "video_url": {
    "url": "data:video/mp4;base64,..."
  }
}
```

**Responses API (expected analogue to `input_image`):**

```json
{
  "type": "input_video",
  "video_url": "data:video/mp4;base64,..."
}
```

Because the REST layer does not recognize `video_url` / `input_video` content types, it never converts them into `VideoItem`, so **the multimodal video pipeline in the backend is never invoked**, despite being fully implemented.

As a result, the **only** working path for video inference today is the Python API:

```python
from vllm import LLM
llm.generate(...)
```

This requires running models locally and bypasses HTTP entirely, preventing remote or OpenAI-compatible deployment of video-capable models.

---

### **Expected Behavior**

vLLM’s `/v1/chat/completions` endpoint should accept:

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-30B-A3B-Thinking\",
    \"messages\": [
      {
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"video_url\",
            \"video_url\": {
              \"url\": \"data:video/mp4;base64,<your base64>\"
            }
          },
          { \"type\": \"text\", \"text\": \"Summarize this video.\" }
        ]
      }
    ],
    \"max_tokens\": 2048
  }"
```

and `/v1/responses` supports

```
curl http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-30B-A3B-Thinking\",
    \"input\": [
      {
        \"role\": \"user\",
        \"content\": [
          { \"type\": \"input_text\", \"text\": \"Summarize this video\" },
          {
            \"type\": \"input_video\",
            \"video_url\": \"data:video/mp4;base64,<your base64>\"
          }
        ]
      }
    ],
    \"max_output_tokens\": 2048
  }"
```

---

## **Why This Matters**

* Qwen3-VL, InternVL, GLM-4V, etc. now support video natively inside vLLM
* But the OpenAI-compatible API provides no way to use it
* Users cannot deploy video models remotely via HTTP
* LM Studio and other HTTP-based clients cannot use video even though the backend supports it
* It prevents vLLM from being a drop-in replacement for OpenAI for multimodal agents
* OpenAI’s multimodal API already defines the input_* pattern (e.g., input_image, input_audio), and video is the next natural extension

This feature would unlock real production use-cases for video reasoning with vLLM.

---

## **Proposed Approach (High-Level)**

I’m not prescribing an implementation, but here are the building blocks already in vLLM:

* `VideoItem` in `vllm.multimodal.inputs`
* `_process_video_input()`
* `_postprocess_video_embeds_evs()`
* EVS support in PR #29752
* `process_vision_info()` from `qwen_vl_utils` (can be used or reimplemented internally)

And the missing piece is:

### **Parsing `input_video` items in the OpenAI REST layer and passing video embeddings into `LLM.generate()` via `multi_modal_data`.**

This change is isolated to the JSON parsing layer and requires no model changes or new endpoints.

---

## **Conclusion**

The backend of vLLM now has strong video multimodal support, especially with the new EVS work. Adding support for OpenAI-style `"input_video"` to the Chat Completions API would make vLLM fully capable of video inference over HTTP and bring it to feature parity with OpenAI’s multimodal API.

I’d be happy to help test or iterate on a proposal if this feature is accepted.

Thanks for your great work on vLLM!

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Add OpenAI-style `input_video` support to `/v1/chat/completions` for multimodal models (e.g., Qwen3-VL) #29754

🚀 The feature, motivation and pitch

Feature Request: OpenAI-compatible `input_video` support for vLLM’s Chat Completions and Responses API

Current Behavior

Expected Behavior

Why This Matters

Proposed Approach (High-Level)

Parsing `input_video` items in the OpenAI REST layer and passing video embeddings into `LLM.generate()` via `multi_modal_data`.

Conclusion

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Add OpenAI-style input_video support to /v1/chat/completions for multimodal models (e.g., Qwen3-VL) #29754

Description

🚀 The feature, motivation and pitch

Feature Request: OpenAI-compatible input_video support for vLLM’s Chat Completions and Responses API

Current Behavior

Expected Behavior

Why This Matters

Proposed Approach (High-Level)

Parsing input_video items in the OpenAI REST layer and passing video embeddings into LLM.generate() via multi_modal_data.

Conclusion

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Add OpenAI-style `input_video` support to `/v1/chat/completions` for multimodal models (e.g., Qwen3-VL) #29754

Feature Request: OpenAI-compatible `input_video` support for vLLM’s Chat Completions and Responses API

Parsing `input_video` items in the OpenAI REST layer and passing video embeddings into `LLM.generate()` via `multi_modal_data`.