[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752

skyloevil · 2025-11-30T16:08:02Z

Summary

This PR implements EVS (Efficient Video Sampling) support for Qwen3-VL models, enabling dynamic video token pruning to improve inference efficiency while maintaining accuracy.
This brings Qwen3-VL to feature parity with Qwen2.5-VL’s EVS capabilities.

Benchmark

Start:

vllm serve "Qwen/Qwen3-VL-4B-Instruct"     --trust-remote-code     --gpu-memory-utilization 0.9     --max-model-len 16384  --video_pruning_rate 0.4 --allowed-local-media-path /workspace/dev_vllm/datasets/MSR-VTT/datasets--VLM2Vec--MSR-VTT/

Bench:

vllm bench serve   --model "Qwen/Qwen3-VL-4B-Instruct"   --backend "openai-chat"   --base-url "http://localhost:8000/"   --endpoint "/v1/chat/completions"   --dataset-name "sharegpt"   --dataset-path "/workspace/dev_vllm/datasets/MSR-VTT/datasets--VLM2Vec--MSR-VTT/snapshots/947b139d0c591235a3f311e45a31a88461bad0d2/msrvtt_sharegpt.json"   --num-prompts 10   --request-rate 1   --save-result   --result-dir benchmarks_results   --result-filename test_video_benchmark.json

--

Pruning Rate 10%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.33
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.83
Peak output token throughput (tok/s):    58.00
Peak concurrent requests:                9.00
Total Token throughput (tok/s):          18.71
---------------Time to First Token----------------
Mean TTFT (ms):                          638.17
Median TTFT (ms):                        421.46
P99 TTFT (ms):                           3480.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.15
Median TPOT (ms):                        43.18
P99 TPOT (ms):                           306.62
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.64
Median ITL (ms):                         37.99
P99 ITL (ms):                            459.81
==================================================

Pruning Rate 20%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.33
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.83
Peak output token throughput (tok/s):    56.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          18.71
---------------Time to First Token----------------
Mean TTFT (ms):                          611.81
Median TTFT (ms):                        460.34
P99 TTFT (ms):                           3382.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          69.65
Median TPOT (ms):                        48.21
P99 TPOT (ms):                           253.21
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.86
Median ITL (ms):                         38.02
P99 ITL (ms):                            296.69
==================================================

Pruning Rate 30%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.21
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.73
---------------Time to First Token----------------
Mean TTFT (ms):                          538.68
Median TTFT (ms):                        384.46
P99 TTFT (ms):                           3314.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.24
Median TPOT (ms):                        38.95
P99 TPOT (ms):                           223.20
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.40
Median ITL (ms):                         37.57
P99 ITL (ms):                            279.25
==================================================

Pruning Rate 40%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.19
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.74
---------------Time to First Token----------------
Mean TTFT (ms):                          530.50
Median TTFT (ms):                        388.43
P99 TTFT (ms):                           3231.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.88
Median TPOT (ms):                        39.01
P99 TPOT (ms):                           195.32
---------------Inter-token Latency----------------
Mean ITL (ms):                           49.09
Median ITL (ms):                         37.16
P99 ITL (ms):                            270.88
==================================================

Pruning Rate 50%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.20
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.73
---------------Time to First Token----------------
Mean TTFT (ms):                          521.40
Median TTFT (ms):                        361.01
P99 TTFT (ms):                           3426.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.53
Median TPOT (ms):                        39.37
P99 TPOT (ms):                           203.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.16
Median ITL (ms):                         37.55
P99 ITL (ms):                            258.28
==================================================

Pruning Rate 60%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.18
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    61.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.74
---------------Time to First Token----------------
Mean TTFT (ms):                          499.49
Median TTFT (ms):                        343.21
P99 TTFT (ms):                           3358.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.17
Median TPOT (ms):                        38.32
P99 TPOT (ms):                           130.30
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.54
Median ITL (ms):                         36.99
P99 ITL (ms):                            190.42
==================================================

Pruning Rate 70%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.10
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.85
Peak output token throughput (tok/s):    58.00
Peak concurrent requests:                6.00
Total Token throughput (tok/s):          18.75
---------------Time to First Token----------------
Mean TTFT (ms):                          481.76
Median TTFT (ms):                        320.49
P99 TTFT (ms):                           3424.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.20
Median TPOT (ms):                        38.07
P99 TPOT (ms):                           173.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.62
Median ITL (ms):                         36.81
P99 ITL (ms):                            195.79
==================================================

Pruning Rate 80%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.10
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.85
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                6.00
Total Token throughput (tok/s):          18.75
---------------Time to First Token----------------
Mean TTFT (ms):                          500.78
Median TTFT (ms):                        367.62
P99 TTFT (ms):                           3194.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          49.97
Median TPOT (ms):                        38.52
P99 TPOT (ms):                           161.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.89
Median ITL (ms):                         36.77
P99 ITL (ms):                            227.34
==================================================

Pruning Rate 90%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.00
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.86
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                6.00
Total Token throughput (tok/s):          18.77
---------------Time to First Token----------------
Mean TTFT (ms):                          452.43
Median TTFT (ms):                        305.85
P99 TTFT (ms):                           3918.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          48.27
Median TPOT (ms):                        37.67
P99 TPOT (ms):                           134.44
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.46
Median ITL (ms):                         36.46
P99 ITL (ms):                            196.09
==================================================

Pruning Rate 100%

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  100.98
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.86
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.78
---------------Time to First Token----------------
Mean TTFT (ms):                          461.73
Median TTFT (ms):                        312.73
P99 TTFT (ms):                           3956.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          47.39
Median TPOT (ms):                        37.63
P99 TPOT (ms):                           133.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.15
Median ITL (ms):                         36.52
P99 ITL (ms):                            198.98
==================================================

Test

Start：

vllm serve "Qwen/Qwen3-VL-4B-Instruct"     --trust-remote-code     --gpu-memory-utilization 0.9     --max-model-len 16384  --video_pruning_rate 0.5 --allowed-local-media-path /workspace/dev_vllm/datasets/MSR-VTT/datasets--VLM2Vec--MSR-VTT/

Image Test:

python examples/online_serving/openai_chat_completion_client_for_multimodal.py

Video Test:

python examples/online_serving/openai_chat_completion_client_for_multimodal.py -c video

---

Log

Changes

1. Core EVS Infrastructure

File: vllm/model_executor/models/qwen3_vl.py

a. Added `SupportsMultiModalPruning` interface

Enables EVS handling inside the V1 engine.

b. Processor-level placeholder generation (lines 1050–1087)

Computes per-frame token allocations from video_pruning_rate
First frame always receives full tokens_per_frame
Remaining tokens distributed across later frames
Generates an is_embed mask that matches expected token counts

c. Encoder-level video pruning (lines 1497–1549)

_postprocess_video_embeds_evs:

Applies EVS via cosine similarity
Obtains retention mask using compute_retention_mask
Appends MRoPE positions for pruned tokens
Includes debug logging of pruning statistics

d. Image embedding post-processing (lines 1464–1495)

_postprocess_image_embeds_evs:

Appends MRoPE positions for images
Required when EVS is active to maintain compatibility

e. EVS-aware frame-offset extraction (lines 1615–1696)

Enhances:

iter_mm_grid_hw
_extract_frame_offsets_from_mask

Capabilities:

Handles EVS and non-EVS paths
Recovers sparse token patterns

f. MRoPE position recomputation (lines 1698–1743)

recompute_mrope_positions:

Updates positional embeddings after pruning
Uses shared logic from vllm.multimodal.evs

g. Fixed MRoPE position calculation (lines 1765–1776)

Avoid negative text_len when frames are consecutive after pruning
Skip text position build when text_len <= 0
Guarantees monotonically increasing MRoPE indices

2. Shared EVS Utilities

Imported from vllm.multimodal.evs:

compute_retention_mask – core EVS pruning logic
compute_retained_tokens_count
compute_mrope_for_media
recompute_mrope_positions

Reuses identical EVS algorithms from Qwen2.5-VL for consistency.

chatgpt-codex-connector · 2025-11-30T16:09:14Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces Efficient Video Sampling (EVS) for Qwen3-VL, enhancing inference efficiency by pruning video tokens. The implementation is well-structured, reusing existing EVS utilities and adding necessary model-specific logic for token distribution, embedding post-processing, and MRoPE position calculation. The changes also include a bug fix for MRoPE position computation.

My review focuses on a potential issue in the fallback logic for EVS. I've identified an inconsistency that could lead to incorrect behavior if the fallback path is triggered. Please see the detailed comment for more information.

Overall, this is a great feature addition that brings Qwen3-VL to parity with Qwen2.5-VL. The included tests and benchmarks are also very helpful.

vllm/model_executor/models/qwen3_vl.py

skyloevil · 2025-11-30T16:43:47Z

@DarkLight1337 @ywang96 PTAL.

vllm/model_executor/models/qwen3_vl.py

DarkLight1337 · 2025-12-01T02:39:49Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces EVS (Efficient Video Sampling) support for Qwen3-VL models, which is a valuable feature for improving inference efficiency by enabling dynamic video token pruning. The implementation is comprehensive, touching on placeholder generation, embedding post-processing, and MRoPE position recalculation. The reuse of shared EVS utilities from vllm.multimodal.evs is a good practice for consistency. I've identified one critical issue that could lead to a crash under high video pruning rates, and I've provided a suggestion to resolve it.

DarkLight1337 · 2025-12-01T04:02:23Z

Can you fix pre-commit?

Implement multimodal pruning capabilities to optimize video token processing: - Add video_pruning_rate configuration support - Implement EVS-based video embedding pruning with retention masks - Add MRoPE position recomputation for pruned sequences - Add postprocessing for both image and video embeddings - Include test coverage for the new functionality Signed-off-by: zitian.zhao <[email protected]>

- Add INFO level log during model initialization to show EVS is enabled and display the configured pruning rate - Add DEBUG level log during video processing to show detailed pruning statistics including original/retained token counts, video dimensions, and actual reduction percentage - No functional changes, pure observability enhancement to help users verify EVS configuration and monitor pruning behavior Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

This commit addresses the ValueError that occurs when EVS (Efficient Video Sampling) is enabled with video_pruning_rate > 0. Changes: - Add EVS detection logic in iter_mm_grid_hw method - Implement _extract_frame_offsets_from_mask to extract frame offsets from the is_embed mask stored in mm_position - Add fallback to uniform distribution when mask is unavailable - Preserve original non-EVS behavior completely The new implementation supports sparse EVS retention patterns where different frames can have different numbers of retained tokens, which is the actual behavior of the EVS pruning algorithm. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

When using EVS (Efficient Video Sampling), frame offsets are extracted from the is_embed mask and may not be strictly increasing. This caused text_len (offset - st) to become negative in get_mrope_input_positions, leading to ValueError in np.broadcast_to. Changes: - Skip text position creation when text_len <= 0 (no text between frames) - Update st_idx after adding text positions to maintain position continuity - Use st_idx directly for video frame positions instead of (text_len + st_idx) This ensures position indices remain monotonically increasing even when frames are consecutive in the mask. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

Signed-off-by: zitian.zhao <[email protected]>

- Remove tests/models/multimodal/generation/test_qwen3_vl.py - Remove test_evs_fix.py (development test file) These test files were part of the development process and are not needed in the final implementation. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

The fallback logic in iter_mm_grid_hw had several issues: 1. Token distribution inconsistency: Used uniform distribution (total // t) instead of first-frame-full distribution used by the processor 2. Incorrect offset calculation: Only counted video_pad tokens, ignored timestamp and start/end tokens in the placeholder 3. Should never trigger: Processor always generates is_embed mask Replace with RuntimeError to catch bugs early if mask is missing. Based on code review feedback. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

Co-authored-by: skyloevil <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

Signed-off-by: zitian.zhao <[email protected]>

DarkLight1337 · 2025-12-01T04:29:37Z

vllm/model_executor/models/qwen3_vl.py

-            llm_pos_ids_list.append(
-                np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx
-            )
+            if text_len > 0:


cc @ywang96 @Isotr0py can you double check this updated M-RoPE implementation?

skyloevil · 2025-12-01T04:32:27Z

Can you fix pre-commit?

Solved. @DarkLight1337

skyloevil requested a review from sighingnow as a code owner November 30, 2025 16:08

mergify bot added the qwen Related to Qwen models label Nov 30, 2025

gemini-code-assist bot reviewed Nov 30, 2025

View reviewed changes

vllm/model_executor/models/qwen3_vl.py Outdated Show resolved Hide resolved

skyloevil force-pushed the feat/qwen3-vl-evs-feat branch 2 times, most recently from 3f2f542 to 31b2195 Compare November 30, 2025 16:41

ehartford mentioned this pull request Nov 30, 2025

[Feature]: Add OpenAI-style input_video support to /v1/chat/completions for multimodal models (e.g., Qwen3-VL) #29754

Closed

1 task

DarkLight1337 reviewed Dec 1, 2025

View reviewed changes

vllm/model_executor/models/qwen3_vl.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

skyloevil force-pushed the feat/qwen3-vl-evs-feat branch from 0a0d149 to 44a6e5a Compare December 1, 2025 03:40

skyloevil and others added 11 commits December 1, 2025 12:23

Fix EVS placeholder alignment for Qwen3-VL

8a3cedb

Signed-off-by: zitian.zhao <[email protected]>

Format EVS helper and add SPDX header

2e00438

Signed-off-by: zitian.zhao <[email protected]>

Remove development test file test_evs_fix.py

1493759

Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

Remove unnecessary log

d71d225

Co-authored-by: skyloevil <[email protected]> Signed-off-by: zitian.zhao <[email protected]>

solve pre-commit

0d44edd

Signed-off-by: zitian.zhao <[email protected]>

skyloevil force-pushed the feat/qwen3-vl-evs-feat branch from 0128fbf to 0d44edd Compare December 1, 2025 04:24

DarkLight1337 reviewed Dec 1, 2025

View reviewed changes

DarkLight1337 requested review from Isotr0py and ywang96 December 1, 2025 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752

[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752

skyloevil commented Nov 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

skyloevil commented Nov 30, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 commented Dec 1, 2025

Uh oh!

DarkLight1337 Dec 1, 2025

Uh oh!

skyloevil commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752

Are you sure you want to change the base?

[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752

Conversation

skyloevil commented Nov 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test

Log￼

Changes

1. Core EVS Infrastructure

a. Added SupportsMultiModalPruning interface

b. Processor-level placeholder generation (lines 1050–1087)

c. Encoder-level video pruning (lines 1497–1549)

d. Image embedding post-processing (lines 1464–1495)

e. EVS-aware frame-offset extraction (lines 1615–1696)

f. MRoPE position recomputation (lines 1698–1743)

g. Fixed MRoPE position calculation (lines 1765–1776)

2. Shared EVS Utilities

Uh oh!

chatgpt-codex-connector bot commented Nov 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

skyloevil commented Nov 30, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 commented Dec 1, 2025

Uh oh!

DarkLight1337 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

skyloevil commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

skyloevil commented Nov 30, 2025 •

edited by github-actions bot

Loading

Log

a. Added `SupportsMultiModalPruning` interface