Skip to content

Conversation

@skyloevil
Copy link
Contributor

@skyloevil skyloevil commented Nov 30, 2025

Summary

This PR implements EVS (Efficient Video Sampling) support for Qwen3-VL models, enabling dynamic video token pruning to improve inference efficiency while maintaining accuracy.
This brings Qwen3-VL to feature parity with Qwen2.5-VL’s EVS capabilities.


Benchmark

Start:

vllm serve "Qwen/Qwen3-VL-4B-Instruct"     --trust-remote-code     --gpu-memory-utilization 0.9     --max-model-len 16384  --video_pruning_rate 0.4 --allowed-local-media-path /workspace/dev_vllm/datasets/MSR-VTT/datasets--VLM2Vec--MSR-VTT/

Bench:

vllm bench serve   --model "Qwen/Qwen3-VL-4B-Instruct"   --backend "openai-chat"   --base-url "http://localhost:8000/"   --endpoint "/v1/chat/completions"   --dataset-name "sharegpt"   --dataset-path "/workspace/dev_vllm/datasets/MSR-VTT/datasets--VLM2Vec--MSR-VTT/snapshots/947b139d0c591235a3f311e45a31a88461bad0d2/msrvtt_sharegpt.json"   --num-prompts 10   --request-rate 1   --save-result   --result-dir benchmarks_results   --result-filename test_video_benchmark.json
image
-- Pruning Rate 10%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.33
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.83
Peak output token throughput (tok/s):    58.00
Peak concurrent requests:                9.00
Total Token throughput (tok/s):          18.71
---------------Time to First Token----------------
Mean TTFT (ms):                          638.17
Median TTFT (ms):                        421.46
P99 TTFT (ms):                           3480.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.15
Median TPOT (ms):                        43.18
P99 TPOT (ms):                           306.62
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.64
Median ITL (ms):                         37.99
P99 ITL (ms):                            459.81
==================================================
 
Pruning Rate 20%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.33
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.83
Peak output token throughput (tok/s):    56.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          18.71
---------------Time to First Token----------------
Mean TTFT (ms):                          611.81
Median TTFT (ms):                        460.34
P99 TTFT (ms):                           3382.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          69.65
Median TPOT (ms):                        48.21
P99 TPOT (ms):                           253.21
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.86
Median ITL (ms):                         38.02
P99 ITL (ms):                            296.69
==================================================
 
Pruning Rate 30%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.21
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.73
---------------Time to First Token----------------
Mean TTFT (ms):                          538.68
Median TTFT (ms):                        384.46
P99 TTFT (ms):                           3314.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.24
Median TPOT (ms):                        38.95
P99 TPOT (ms):                           223.20
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.40
Median ITL (ms):                         37.57
P99 ITL (ms):                            279.25
==================================================
 
Pruning Rate 40%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.19
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.74
---------------Time to First Token----------------
Mean TTFT (ms):                          530.50
Median TTFT (ms):                        388.43
P99 TTFT (ms):                           3231.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.88
Median TPOT (ms):                        39.01
P99 TPOT (ms):                           195.32
---------------Inter-token Latency----------------
Mean ITL (ms):                           49.09
Median ITL (ms):                         37.16
P99 ITL (ms):                            270.88
==================================================
 
Pruning Rate 50%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.20
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.73
---------------Time to First Token----------------
Mean TTFT (ms):                          521.40
Median TTFT (ms):                        361.01
P99 TTFT (ms):                           3426.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.53
Median TPOT (ms):                        39.37
P99 TPOT (ms):                           203.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.16
Median ITL (ms):                         37.55
P99 ITL (ms):                            258.28
==================================================
 
Pruning Rate 60%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.18
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.84
Peak output token throughput (tok/s):    61.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.74
---------------Time to First Token----------------
Mean TTFT (ms):                          499.49
Median TTFT (ms):                        343.21
P99 TTFT (ms):                           3358.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.17
Median TPOT (ms):                        38.32
P99 TPOT (ms):                           130.30
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.54
Median ITL (ms):                         36.99
P99 ITL (ms):                            190.42
==================================================
 
Pruning Rate 70%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.10
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.85
Peak output token throughput (tok/s):    58.00
Peak concurrent requests:                6.00
Total Token throughput (tok/s):          18.75
---------------Time to First Token----------------
Mean TTFT (ms):                          481.76
Median TTFT (ms):                        320.49
P99 TTFT (ms):                           3424.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.20
Median TPOT (ms):                        38.07
P99 TPOT (ms):                           173.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.62
Median ITL (ms):                         36.81
P99 ITL (ms):                            195.79
==================================================
 
Pruning Rate 80%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.10
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.85
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                6.00
Total Token throughput (tok/s):          18.75
---------------Time to First Token----------------
Mean TTFT (ms):                          500.78
Median TTFT (ms):                        367.62
P99 TTFT (ms):                           3194.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          49.97
Median TPOT (ms):                        38.52
P99 TPOT (ms):                           161.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.89
Median ITL (ms):                         36.77
P99 ITL (ms):                            227.34
==================================================
   
Pruning Rate 90%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  101.00
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.86
Peak output token throughput (tok/s):    59.00
Peak concurrent requests:                6.00
Total Token throughput (tok/s):          18.77
---------------Time to First Token----------------
Mean TTFT (ms):                          452.43
Median TTFT (ms):                        305.85
P99 TTFT (ms):                           3918.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          48.27
Median TPOT (ms):                        37.67
P99 TPOT (ms):                           134.44
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.46
Median ITL (ms):                         36.46
P99 ITL (ms):                            196.09
==================================================
   
Pruning Rate 100%
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  100.98
Total input tokens:                      900
Total generated tokens:                  996
Request throughput (req/s):              0.99
Output token throughput (tok/s):         9.86
Peak output token throughput (tok/s):    60.00
Peak concurrent requests:                7.00
Total Token throughput (tok/s):          18.78
---------------Time to First Token----------------
Mean TTFT (ms):                          461.73
Median TTFT (ms):                        312.73
P99 TTFT (ms):                           3956.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          47.39
Median TPOT (ms):                        37.63
P99 TPOT (ms):                           133.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.15
Median ITL (ms):                         36.52
P99 ITL (ms):                            198.98
==================================================

Test

Start:

vllm serve "Qwen/Qwen3-VL-4B-Instruct"     --trust-remote-code     --gpu-memory-utilization 0.9     --max-model-len 16384  --video_pruning_rate 0.5 --allowed-local-media-path /workspace/dev_vllm/datasets/MSR-VTT/datasets--VLM2Vec--MSR-VTT/

Image Test:

python examples/online_serving/openai_chat_completion_client_for_multimodal.py
849a2d7c356402ecbd7f170a8aaaa537 7493e6fee600c5d7783580c4123b0506

Video Test:

python examples/online_serving/openai_chat_completion_client_for_multimodal.py -c video
image image image ---

Log

2aed57d24bc85f7d1d4a64a9b2b94dc3

Changes

1. Core EVS Infrastructure

File: vllm/model_executor/models/qwen3_vl.py

a. Added SupportsMultiModalPruning interface

Enables EVS handling inside the V1 engine.

b. Processor-level placeholder generation (lines 1050–1087)

  • Computes per-frame token allocations from video_pruning_rate
  • First frame always receives full tokens_per_frame
  • Remaining tokens distributed across later frames
  • Generates an is_embed mask that matches expected token counts

c. Encoder-level video pruning (lines 1497–1549)

_postprocess_video_embeds_evs:

  • Applies EVS via cosine similarity
  • Obtains retention mask using compute_retention_mask
  • Appends MRoPE positions for pruned tokens
  • Includes debug logging of pruning statistics

d. Image embedding post-processing (lines 1464–1495)

_postprocess_image_embeds_evs:

  • Appends MRoPE positions for images
  • Required when EVS is active to maintain compatibility

e. EVS-aware frame-offset extraction (lines 1615–1696)

Enhances:

  • iter_mm_grid_hw
  • _extract_frame_offsets_from_mask

Capabilities:

  • Handles EVS and non-EVS paths
  • Recovers sparse token patterns

f. MRoPE position recomputation (lines 1698–1743)

recompute_mrope_positions:

  • Updates positional embeddings after pruning
  • Uses shared logic from vllm.multimodal.evs

g. Fixed MRoPE position calculation (lines 1765–1776)

  • Avoid negative text_len when frames are consecutive after pruning
  • Skip text position build when text_len <= 0
  • Guarantees monotonically increasing MRoPE indices

2. Shared EVS Utilities

Imported from vllm.multimodal.evs:

  • compute_retention_mask – core EVS pruning logic
  • compute_retained_tokens_count
  • compute_mrope_for_media
  • recompute_mrope_positions

Reuses identical EVS algorithms from Qwen2.5-VL for consistency.


@mergify mergify bot added the qwen Related to Qwen models label Nov 30, 2025
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Efficient Video Sampling (EVS) for Qwen3-VL, enhancing inference efficiency by pruning video tokens. The implementation is well-structured, reusing existing EVS utilities and adding necessary model-specific logic for token distribution, embedding post-processing, and MRoPE position calculation. The changes also include a bug fix for MRoPE position computation.

My review focuses on a potential issue in the fallback logic for EVS. I've identified an inconsistency that could lead to incorrect behavior if the fallback path is triggered. Please see the detailed comment for more information.

Overall, this is a great feature addition that brings Qwen3-VL to parity with Qwen2.5-VL. The included tests and benchmarks are also very helpful.

@skyloevil skyloevil force-pushed the feat/qwen3-vl-evs-feat branch 2 times, most recently from 3f2f542 to 31b2195 Compare November 30, 2025 16:41
@skyloevil
Copy link
Contributor Author

@DarkLight1337 @ywang96 PTAL.

@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces EVS (Efficient Video Sampling) support for Qwen3-VL models, which is a valuable feature for improving inference efficiency by enabling dynamic video token pruning. The implementation is comprehensive, touching on placeholder generation, embedding post-processing, and MRoPE position recalculation. The reuse of shared EVS utilities from vllm.multimodal.evs is a good practice for consistency. I've identified one critical issue that could lead to a crash under high video pruning rates, and I've provided a suggestion to resolve it.

@skyloevil skyloevil force-pushed the feat/qwen3-vl-evs-feat branch from 0a0d149 to 44a6e5a Compare December 1, 2025 03:40
@DarkLight1337
Copy link
Member

Can you fix pre-commit?

skyloevil and others added 11 commits December 1, 2025 12:23
Implement multimodal pruning capabilities to optimize video token processing:
- Add video_pruning_rate configuration support
- Implement EVS-based video embedding pruning with retention masks
- Add MRoPE position recomputation for pruned sequences
- Add postprocessing for both image and video embeddings
- Include test coverage for the new functionality

Signed-off-by: zitian.zhao <[email protected]>
- Add INFO level log during model initialization to show EVS is enabled
  and display the configured pruning rate
- Add DEBUG level log during video processing to show detailed pruning
  statistics including original/retained token counts, video dimensions,
  and actual reduction percentage
- No functional changes, pure observability enhancement to help users
  verify EVS configuration and monitor pruning behavior

Co-Authored-By: deitxfge <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
This commit addresses the ValueError that occurs when EVS (Efficient
Video Sampling) is enabled with video_pruning_rate > 0.

Changes:
- Add EVS detection logic in iter_mm_grid_hw method
- Implement _extract_frame_offsets_from_mask to extract frame offsets
  from the is_embed mask stored in mm_position
- Add fallback to uniform distribution when mask is unavailable
- Preserve original non-EVS behavior completely

The new implementation supports sparse EVS retention patterns where
different frames can have different numbers of retained tokens, which
is the actual behavior of the EVS pruning algorithm.

Co-Authored-By: deitxfge <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
When using EVS (Efficient Video Sampling), frame offsets are extracted
from the is_embed mask and may not be strictly increasing. This caused
text_len (offset - st) to become negative in get_mrope_input_positions,
leading to ValueError in np.broadcast_to.

Changes:
- Skip text position creation when text_len <= 0 (no text between frames)
- Update st_idx after adding text positions to maintain position continuity
- Use st_idx directly for video frame positions instead of (text_len + st_idx)

This ensures position indices remain monotonically increasing even when
frames are consecutive in the mask.

Co-Authored-By: deitxfge <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
- Remove tests/models/multimodal/generation/test_qwen3_vl.py
- Remove test_evs_fix.py (development test file)

These test files were part of the development process and are not
needed in the final implementation.

Co-Authored-By: deitxfge <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
Co-Authored-By: deitxfge <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
The fallback logic in iter_mm_grid_hw had several issues:
1. Token distribution inconsistency: Used uniform distribution
   (total // t) instead of first-frame-full distribution used
   by the processor
2. Incorrect offset calculation: Only counted video_pad tokens,
   ignored timestamp and start/end tokens in the placeholder
3. Should never trigger: Processor always generates is_embed mask

Replace with RuntimeError to catch bugs early if mask is missing.

Based on code review feedback.

Co-Authored-By: deitxfge <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
Co-authored-by: skyloevil <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
@skyloevil skyloevil force-pushed the feat/qwen3-vl-evs-feat branch from 0128fbf to 0d44edd Compare December 1, 2025 04:24
llm_pos_ids_list.append(
np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx
)
if text_len > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ywang96 @Isotr0py can you double check this updated M-RoPE implementation?

@skyloevil
Copy link
Contributor Author

Can you fix pre-commit?

Solved. @DarkLight1337

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants