[qwen3 vl] [qwen2 vl] Add multimodal RoPE and Deepstack implementation to mlx-lm language model port

Qwen3 VL models rely on "Deepstack" embeddings to improve image-to-text capabilities. This is not implemented in mlx-lm yet, so OCR capability with the unified arch (i.e. vision_add_on) is subpar compared to the mlx-vlm implementation. Ref: [1](https://github.com/QwenLM/Qwen3-VL/blob/005de16/README.md?plain=1#L52) [2](https://github.com/Blaizzy/mlx-vlm/blob/f6ba0b5/mlx_vlm/models/qwen3_vl/language.py#L249)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[qwen3 vl] [qwen2 vl] Add multimodal RoPE and Deepstack implementation to mlx-lm language model port #237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[qwen3 vl] [qwen2 vl] Add multimodal RoPE and Deepstack implementation to mlx-lm language model port #237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions