Skip to content

[qwen3 vl] [qwen2 vl] Add multimodal RoPE and Deepstack implementation to mlx-lm language model port #237

@neilmehta24

Description

@neilmehta24

Qwen3 VL models rely on "Deepstack" embeddings to improve image-to-text capabilities. This is not implemented in mlx-lm yet, so OCR capability with the unified arch (i.e. vision_add_on) is subpar compared to the mlx-vlm implementation. Ref: 1 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions