-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL #29752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces Efficient Video Sampling (EVS) for Qwen3-VL, enhancing inference efficiency by pruning video tokens. The implementation is well-structured, reusing existing EVS utilities and adding necessary model-specific logic for token distribution, embedding post-processing, and MRoPE position calculation. The changes also include a bug fix for MRoPE position computation.
My review focuses on a potential issue in the fallback logic for EVS. I've identified an inconsistency that could lead to incorrect behavior if the fallback path is triggered. Please see the detailed comment for more information.
Overall, this is a great feature addition that brings Qwen3-VL to parity with Qwen2.5-VL. The included tests and benchmarks are also very helpful.
3f2f542 to
31b2195
Compare
|
@DarkLight1337 @ywang96 PTAL. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces EVS (Efficient Video Sampling) support for Qwen3-VL models, which is a valuable feature for improving inference efficiency by enabling dynamic video token pruning. The implementation is comprehensive, touching on placeholder generation, embedding post-processing, and MRoPE position recalculation. The reuse of shared EVS utilities from vllm.multimodal.evs is a good practice for consistency. I've identified one critical issue that could lead to a crash under high video pruning rates, and I've provided a suggestion to resolve it.
0a0d149 to
44a6e5a
Compare
|
Can you fix pre-commit? |
Implement multimodal pruning capabilities to optimize video token processing: - Add video_pruning_rate configuration support - Implement EVS-based video embedding pruning with retention masks - Add MRoPE position recomputation for pruned sequences - Add postprocessing for both image and video embeddings - Include test coverage for the new functionality Signed-off-by: zitian.zhao <[email protected]>
- Add INFO level log during model initialization to show EVS is enabled and display the configured pruning rate - Add DEBUG level log during video processing to show detailed pruning statistics including original/retained token counts, video dimensions, and actual reduction percentage - No functional changes, pure observability enhancement to help users verify EVS configuration and monitor pruning behavior Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
This commit addresses the ValueError that occurs when EVS (Efficient Video Sampling) is enabled with video_pruning_rate > 0. Changes: - Add EVS detection logic in iter_mm_grid_hw method - Implement _extract_frame_offsets_from_mask to extract frame offsets from the is_embed mask stored in mm_position - Add fallback to uniform distribution when mask is unavailable - Preserve original non-EVS behavior completely The new implementation supports sparse EVS retention patterns where different frames can have different numbers of retained tokens, which is the actual behavior of the EVS pruning algorithm. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
When using EVS (Efficient Video Sampling), frame offsets are extracted from the is_embed mask and may not be strictly increasing. This caused text_len (offset - st) to become negative in get_mrope_input_positions, leading to ValueError in np.broadcast_to. Changes: - Skip text position creation when text_len <= 0 (no text between frames) - Update st_idx after adding text positions to maintain position continuity - Use st_idx directly for video frame positions instead of (text_len + st_idx) This ensures position indices remain monotonically increasing even when frames are consecutive in the mask. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
- Remove tests/models/multimodal/generation/test_qwen3_vl.py - Remove test_evs_fix.py (development test file) These test files were part of the development process and are not needed in the final implementation. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
The fallback logic in iter_mm_grid_hw had several issues: 1. Token distribution inconsistency: Used uniform distribution (total // t) instead of first-frame-full distribution used by the processor 2. Incorrect offset calculation: Only counted video_pad tokens, ignored timestamp and start/end tokens in the placeholder 3. Should never trigger: Processor always generates is_embed mask Replace with RuntimeError to catch bugs early if mask is missing. Based on code review feedback. Co-Authored-By: deitxfge <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
Co-authored-by: skyloevil <[email protected]> Signed-off-by: zitian.zhao <[email protected]>
Signed-off-by: zitian.zhao <[email protected]>
0128fbf to
0d44edd
Compare
| llm_pos_ids_list.append( | ||
| np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx | ||
| ) | ||
| if text_len > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved. @DarkLight1337 |
Summary
This PR implements EVS (Efficient Video Sampling) support for Qwen3-VL models, enabling dynamic video token pruning to improve inference efficiency while maintaining accuracy.
This brings Qwen3-VL to feature parity with Qwen2.5-VL’s EVS capabilities.
Benchmark
Start:
Bench:
Pruning Rate 10%
Pruning Rate 20%
Pruning Rate 30%
Pruning Rate 40%
Pruning Rate 50%
Pruning Rate 60%
Pruning Rate 70%
Pruning Rate 80%
Pruning Rate 90%
Pruning Rate 100%
Test
Start:
Image Test:
Video Test:
Log
Changes
1. Core EVS Infrastructure
File:
vllm/model_executor/models/qwen3_vl.pya. Added
SupportsMultiModalPruninginterfaceEnables EVS handling inside the V1 engine.
b. Processor-level placeholder generation (lines 1050–1087)
video_pruning_ratec. Encoder-level video pruning (lines 1497–1549)
_postprocess_video_embeds_evs:compute_retention_maskd. Image embedding post-processing (lines 1464–1495)
_postprocess_image_embeds_evs:e. EVS-aware frame-offset extraction (lines 1615–1696)
Enhances:
iter_mm_grid_hw_extract_frame_offsets_from_maskCapabilities:
f. MRoPE position recomputation (lines 1698–1743)
recompute_mrope_positions:vllm.multimodal.evsg. Fixed MRoPE position calculation (lines 1765–1776)
text_lenwhen frames are consecutive after pruningtext_len <= 02. Shared EVS Utilities
Imported from
vllm.multimodal.evs:compute_retention_mask– core EVS pruning logiccompute_retained_tokens_countcompute_mrope_for_mediarecompute_mrope_positionsReuses identical EVS algorithms from Qwen2.5-VL for consistency.