TRT Support for Qwen2.5VL Error.

Code : 
`3945858be258c95656fdeabcaf56413b35dd368e`

Test method:
`dashinfer_vlm_serve --model Qwen2.5-VL-3B-Instruct  --host 127.0.0.1 --vision_engine tensorrt`

Version:
```
transformers             4.54.0
torch                    2.7.1
torchvision              0.22.1
onnx                     1.18.0
tensorrt                 10.5.0
tensorrt-cu12            10.13.0.35
tensorrt-cu12-bindings   10.13.0.35
tensorrt-cu12-libs       10.13.0.35
```

Error log:
```
call setenv()                                                
AllSpark python package start init.                                                                                              
[Info] No Multi-NUMA support on CUDA Version.                                                                                                                                                                                                                     
[INFO   ]  args: Namespace(host='127.0.0.1', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_keys=None, ssl=False, model='/cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/', vision_engine='
tensorrt', device='cuda', max_length=32000, max_batch=128, parallel_size=1, enable_prefix_cache=False, quant_type=None, dtype='bfloat16', min_pixels=3136, max_pixels=12845056)
defaultdict(None, {'host': '127.0.0.1', 'port': 8000, 'allow_credentials': False, 'allowed_origins': ['*'], 'allowed_methods': ['*'], 'allowed_headers': ['*'], 'api_keys': None, 'ssl': False, 'model': '/cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct
/', 'vision_engine': 'tensorrt', 'device': 'cuda', 'max_length': 32000, 'max_batch': 128, 'parallel_size': 1, 'enable_prefix_cache': False, 'quant_type': None, 'dtype': 'bfloat16', 'min_pixels': 3136, 'max_pixels': 12845056})
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.27it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow p
rocessor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatic
ally. Loading from `preprocessor.json` will be removed in v5.0.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
model config:
{'vision_config': Qwen2_5_VLVisionConfig {
  "depth": 32,
  "fullatt_block_indexes": [
    7,
    15,
    23,
    31
  ],
  "hidden_act": "silu",
  "hidden_size": 1280,
  "in_channels": 3,
  "in_chans": 3,
  "initializer_range": 0.02,
  "intermediate_size": 3420,
  "model_type": "qwen2_5_vl",
  "num_heads": 16,
  "out_hidden_size": 2048,
  "patch_size": 14,
  "spatial_merge_size": 2,
  "spatial_patch_size": 14,
  "temporal_patch_size": 2,
  "tokens_per_second": 2,
  "transformers_version": "4.54.0",
  "window_size": 112
}
, 'text_config': Qwen2_5_VLTextConfig {
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_token_id": null,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
  "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 128000,
  "max_window_layers": 70,
  "model_type": "qwen2_5_vl_text",
  "num_attention_heads": 16,
  "num_hidden_layers": 36,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.54.0",
  "use_cache": true,
  "use_sliding_window": false,
  "video_token_id": null,
  "vision_end_token_id": 151653,
  "vision_start_token_id": 151652,
  "vision_token_id": 151654,
  "vocab_size": 151936
}
, 'image_token_id': 151655, 'video_token_id': 151656, 'return_dict': True, 'output_hidden_states': False, 'torchscript': False, 'torch_dtype': torch.bfloat16, '_output_attentions': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forw
ard': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'architectures': ['Qwen2_5_VLForConditionalGeneration'], 'finetuning_task': None, 'id2label': {0: 'LAB
EL_0', 1: 'LABEL_1'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'task_specific_params': None, 'problem_type': None, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': 151643, 'pad_token_id': None, 'eos_token_id': 151645, 'sep_token_id': None, 'decoder_
start_token_id': None, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0,
 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remov
e_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, '_name_or_path': '/cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/', '_commit_hash': None, '_attn_implementation_internal': N
one, 'transformers_version': '4.41.2', 'attention_dropout': 0.0, 'vision_start_token_id': 151652, 'vision_end_token_id': 151653, 'vision_token_id': 151654, 'hidden_act': 'silu', 'hidden_size': 2048, 'initializer_range': 0.02, 'intermediate_size': 11008, 'max
_position_embeddings': 128000, 'max_window_layers': 70, 'model_type': 'Qwen_v20', 'num_attention_heads': 16, 'num_hidden_layers': 36, 'num_key_value_heads': 2, 'rms_norm_eps': 1e-06, 'rope_theta': 1000000.0, 'sliding_window': 32768, 'use_cache': True, 'use_s
liding_window': False, 'rope_scaling': {'type': 'default', 'mrope_section': [16, 24, 24], 'rope_type': 'default'}, 'vocab_size': 151936, 'tf_legacy_loss': False, 'use_bfloat16': False, 'rotary_emb_base': 1000000.0, 'size_per_head': 128}
WARNING: Logging before InitGoogleLogging() is written to STDERR 
I20250729 23:12:53.070276 1730607 thread_pool_with_id.h:37] ThreadPoolWithID init with thread number: 1                                                                                                                                                [214/94864]
I20250729 23:12:53.070394 1730607 thread_pool_with_id.h:37] ThreadPoolWithID init with thread number: 1
I20250729 23:12:53.070473 1730607 as_engine.cpp:107] AllSpark Init with Version: 2.4.0/(GitSha1:169754a8-dirty)
Qwen VL 2.5 start convert onnx.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Start converting ONNX model!
Loading safetensors checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.34it/s]
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:374: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (
and might lead to errors or silently give incorrect results).
  for t, h, w in grid_thw:
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:416: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (
and might lead to errors or silently give incorrect results).
  for grid_t, grid_h, grid_w in grid_thw:
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:435: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be t
reated as a constant in the future. This means that the trace might not generalize to other inputs!
  cu_window_seqlens.extend(cu_seqlens_tmp.tolist())
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:436: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be
 treated as a constant in the future. This means that the trace might not generalize to other inputs!
  window_index_id += (grid_t * llm_grid_h * llm_grid_w).item()
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:445: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of
 constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  cu_window_seqlens = torch.tensor(
Export to ONNX file successfully! The ONNX file stays in /root/.cache/as_model/model.onnx
Start converting TRT engine!
[07/29/2025-23:13:41] [TRT] [I] [MemUsageChange] Init CUDA: CPU -2, GPU +0, now: CPU 10806, GPU 9316 (MiB)
[07/29/2025-23:13:42] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU -1917, GPU +6, now: CPU 8687, GPU 9322 (MiB)
[07/29/2025-23:13:42] [TRT] [I] ----------------------------------------------------------------
[07/29/2025-23:13:42] [TRT] [I] Input filename:   /root/.cache/as_model/model.onnx
[07/29/2025-23:13:42] [TRT] [I] ONNX IR version:  0.0.8
[07/29/2025-23:13:42] [TRT] [I] Opset version:    17
[07/29/2025-23:13:42] [TRT] [I] Producer name:    pytorch
[07/29/2025-23:13:42] [TRT] [I] Producer version: 2.7.1
[07/29/2025-23:13:42] [TRT] [I] Domain:           
[07/29/2025-23:13:42] [TRT] [I] Model version:    0
[07/29/2025-23:13:42] [TRT] [I] Doc string:       
[07/29/2025-23:13:42] [TRT] [I] ----------------------------------------------------------------
[07/29/2025-23:13:42] [TRT] [W] ModelImporter.cpp:653: Make sure input grid_thw has Int64 binding.
Succeeded parsing /root/.cache/as_model/model.onnx
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_12: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_6: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_6: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 1 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_6: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_9: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_9: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 1 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_9: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_10: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_10: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 1 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_14: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] Detected layernorm nodes in FP16.
[07/29/2025-23:13:43] [TRT] [W] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help pr
eserving accuracy.
[07/29/2025-23:13:43] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/29/2025-23:13:43] [TRT] [W] Was not able to infer kOPT value(s) for tensor /vision_model/ReduceMax_output_0. Using one(s).
[07/29/2025-23:13:43] [TRT] [W] Was not able to infer kOPT value(s) for tensor /vision_model/ReduceMax_output_0. Using one(s).
[07/29/2025-23:13:44] [TRT] [I] Compiler backend is used during engine build.
[07/29/2025-23:13:55] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[07/29/2025-23:14:00] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 1: Myelin ([shape.cpp:verify_output_type:1583] Mismatched type for tensor ONNXTRT_squeezeTensor_6846_output, i32 vs. expected type:i64. In compileGraph at optimizer/myelin/codeGener
ator.cpp:1346)
Traceback (most recent call last):
  File "/dockerdata/dash-infer/dash-infer/python/.venv/bin/dashinfer_vlm_serve", line 10, in <module>
    sys.exit(main())
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 685, in main
    init()
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 94, in init
    model_loader.load_model(direct_load=False, load_format="auto")
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/model_loader.py", line 170, in serialize
    onnx_trt_obj.generate_trt_engine(onnxFile, self.vision_model_path)
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/trt/onnx_to_plan.py", line 203, in generate_trt_engine
    raise RuntimeError("Failed building %s" % planFile)
RuntimeError: Failed building /root/.cache/as_model/model.plan
I20250729 23:14:01.364504 1730607 as_engine.cpp:113] ~AsEngine called
I20250729 23:14:01.364549 1730607 as_engine.cpp:119] model_state_map_ size:0
I20250729 23:14:01.364559 1730607 weight_manager.cpp:721] ~WeightManager
I20250729 23:14:01.364566 1730607 as_engine.cpp:143] ~AsEngineImpl finished.
I20250729 23:14:01.364686 1730787 thread_pool_with_id.h:93] dummy message for wake up.
I20250729 23:14:01.364728 1730787 thread_pool_with_id.h:47] Thread Pool with id: 0 Exit!!!
I20250729 23:14:01.364852 1730786 thread_pool_with_id.h:93] dummy message for wake up.
I20250729 23:14:01.364892 1730786 thread_pool_with_id.h:47] Thread Pool with id: 0 Exit!!!
```

@x574chen Which model or transfomers lib your're using ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TRT Support for Qwen2.5VL Error. #92

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TRT Support for Qwen2.5VL Error. #92

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions