Skip to content

Conversation

@x574chen
Copy link
Collaborator

支持Qwen2.5VL ViT TensorRT运行

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @x574chen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented support for running the Qwen2.5VL Vision Transformer (ViT) engine with TensorRT, which is crucial for accelerating inference. This involved adding a new dedicated Python module for the Qwen2.5VL Vision Transformer, updating the model loading and ONNX export logic to correctly handle its specific configurations, and making minor adjustments to related utility and test files. The goal is to provide a more performant inference solution for Qwen2.5VL models within the DashInfer VLM toolkit.

Highlights

  • TensorRT Support for Qwen2.5VL: I've added comprehensive TensorRT support for the Qwen2.5VL Vision Transformer (ViT) engine, enabling optimized inference performance for these models.
  • Updated Model Support Documentation: The README.md has been updated to reflect the expanded support for various Qwen2.5-VL model sizes, including 3B, 7B, 32B, and 72B.
  • Code Refactoring and Modularity: I've refactored the default_weight_loader function into a new utils.py file, improving code organization and reusability across different vision transformer implementations.
  • Test Script Enhancements: Minor adjustments were made to the OpenAI API benchmark and chat completion test scripts, primarily related to API endpoint configurations and local image handling for testing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Qwen2.5VL with TensorRT for the Vision Transformer. The changes include adding a new model definition file for Qwen2.5VL, refactoring some utility functions, and updating the model loading and TensorRT conversion logic to handle the new model. The changes are well-structured, but I've found a few issues. There is a critical bug in the calculation of image tokens for the TRT engine, which will likely lead to runtime errors. I've also identified a high-severity issue related to code duplication and a typo in the ONNX-to-plan conversion script. Additionally, there are several medium-severity suggestions to improve code clarity, remove redundancy, and maintain consistent coding style, such as adding missing newlines at the end of files.

* embed_dim
/ (embed_dim * (spatial_merge_size**2))
)
image_tokens = int(visual_inputs["input"].shape[1] * (spatial_merge_size**2))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calculation for image_tokens appears to be incorrect. The number of tokens should be reduced by the merger operation, which effectively divides the sequence length by spatial_merge_size**2. The current implementation uses multiplication, which would incorrectly increase the number of tokens and likely cause shape mismatch errors downstream. This seems to be a bug introduced when refactoring the code.

Suggested change
image_tokens = int(visual_inputs["input"].shape[1] * (spatial_merge_size**2))
image_tokens = int(visual_inputs["input"].shape[1] / (spatial_merge_size**2))

Comment on lines +31 to +53
self.is_qwen_2_5 = is_qwen_2_5
if is_qwen_2_5:
self.config = Qwen2_5_VLConfig.from_pretrained(
model_path, trust_remote_code=True, revision=None, code_revision=None
).vision_config
self.model_path = model_path
self.input_embed_dim = (
self.config.in_channels
* self.config.temporal_patch_size
* self.config.patch_size
* self.config.patch_size
)
else:
self.config = Qwen2VLVisionConfig.from_pretrained(
model_path, trust_remote_code=True, revision=None, code_revision=None
)
self.model_path = model_path
self.input_embed_dim = (
self.config.in_channels
* self.config.temporal_patch_size
* self.config.patch_size
* self.config.patch_size
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is significant code duplication in this __init__ method. The assignments to self.model_path and the calculation of self.input_embed_dim are identical in both branches of the if/else statement. This can be refactored to remove duplication, which improves maintainability. Additionally, there is a typo with extra spaces in self.model_path = model_path in the else block.

        self.is_qwen_2_5 = is_qwen_2_5
        if is_qwen_2_5:
            self.config = Qwen2_5_VLConfig.from_pretrained(
                model_path, trust_remote_code=True, revision=None, code_revision=None
            ).vision_config
        else:
            self.config = Qwen2VLVisionConfig.from_pretrained(
                model_path, trust_remote_code=True, revision=None, code_revision=None
            )
        self.model_path = model_path
        self.input_embed_dim = (
            self.config.in_channels
            * self.config.temporal_patch_size
            * self.config.patch_size
            * self.config.patch_size
        )

Comment on lines +259 to +262
b = v.size(0)
q = q.view(b, -1, n, d)
k = k.view(b, -1, n, d)
v = v.view(b, -1, n, d)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These lines are redundant. The variable b is already available as a parameter to the function, and its value is the same as v.size(0). The calls to .view() on q, k, and v are also redundant as they already have the correct shape from the operations on lines 254-256. This code can be safely removed to improve clarity.

# Note: only used for debug
if name not in params_dict.keys():
continue
default_weight_loader(params_dict[name], loaded_weight) No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to POSIX standards, please add a newline character at the end of the file.

Suggested change
default_weight_loader(params_dict[name], loaded_weight)
default_weight_loader(params_dict[name], loaded_weight)

except Exception:
# NOTE: This exception is added for the purpose of setting breakpoint to
# debug weight loading issues.
raise No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to POSIX standards, please add a newline character at the end of the file.

Suggested change
raise
raise

'''
from ..visual_embedding.DFN_vit import Qwen2VisionTransformer
from ..visual_embedding.DFN_vit import Qwen2VisionTransformer
from ..visual_embedding.DFN_vit_2_5 import Qwen2_5VisionTransformer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to POSIX standards, please add a newline character at the end of the file.

Suggested change
from ..visual_embedding.DFN_vit_2_5 import Qwen2_5VisionTransformer
from ..visual_embedding.DFN_vit_2_5 import Qwen2_5VisionTransformer

from .cache import *

from .. import Qwen2VisionTransformer
from .. import Qwen2VisionTransformer, Qwen2_5VisionTransformer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to POSIX standards, please add a newline character at the end of the file.

Suggested change
from .. import Qwen2VisionTransformer, Qwen2_5VisionTransformer
from .. import Qwen2VisionTransformer, Qwen2_5VisionTransformer

Comment on lines +163 to +166
if hasattr(self.hf_model_config, "architectures") and "Qwen2_5_VLForConditionalGeneration" in self.hf_model_config.architectures:
is_qwen_2_5= True
else:
is_qwen_2_5 = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This if/else block can be simplified into a single boolean assignment. This improves code conciseness and readability. Also, there is a typo is_qwen_2_5= True with a space before =.

            is_qwen_2_5 = hasattr(self.hf_model_config, "architectures") and "Qwen2_5_VLForConditionalGeneration" in self.hf_model_config.architectures

@x574chen x574chen added this pull request to the merge queue Jul 29, 2025
Merged via the queue into modelscope:main with commit 4ebf617 Jul 29, 2025
4 checks passed
kzjeef pushed a commit that referenced this pull request Jul 29, 2025
* dashinfer vlm: add tensorrt support for qwen2.5vl

* update benchmark

---------

Co-authored-by: Xiaotong Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant