Skip to content

Conversation

@daje0601
Copy link

@daje0601 daje0601 commented Dec 2, 2025

Purpose

This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model.

Background

Currently, vLLM's WhisperForConditionalGeneration does not implement the SupportsLoRA interface, preventing users from using LoRA adapters with Whisper models. This limitation requires users to deploy
separate model instances for each fine-tuned variant, which is inefficient in terms of GPU memory usage.

Changes

1. vllm/model_executor/models/whisper.py

  • Add SupportsLoRA interface to WhisperForConditionalGeneration
  • Add embedding_modules and embedding_padding_modules attributes required by LoRA
  • Update packed_modules_mapping to use simplified keys (qkv_proj, kv_proj) for LoRA compatibility

2. vllm/lora/layers/column_parallel_linear.py

  • Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice) configurations
  • This is necessary because Whisper's cross-attention layers (encoder_attn.kv_proj) only have K and V projections, not Q
  • Update can_replace_layer() to accept both 2-module and 3-module configurations
  • Refactor slice_lora_a() to dynamically handle variable number of slices

3. vllm/lora/worker_manager.py

  • Add fallback to max_target_positions when max_position_embeddings is not available
  • Whisper config uses max_target_positions instead of max_position_embeddings

4. examples/offline_inference/whisper_multilora_inference.py

  • Add example script demonstrating Whisper Multi-LoRA inference

5. tests/lora/test_whisper_lora.py

  • Add unit tests for Whisper LoRA interface compliance
  • Add tests for KV-only configuration support
  • Add tests for WorkerLoRAManager Whisper compatibility

Test Plan

# Run unit tests
pytest tests/lora/test_whisper_lora.py -v

# Run example (requires LoRA adapter)
python examples/offline_inference/whisper_multilora_inference.py

Test Result(Unit Tests)

tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_supports_lora_attribute PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_padding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_packed_modules_mapping_format PASSED
tests/lora/test_whisper_lora.py::TestMergedQKVParallelLinearWithLoRAKVOnly::test_can_replace_layer_accepts_2_modules PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_fallback PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_priority PASSED

Manual Testing
Tested with openai/whisper-large-v3-turbo base model and custom LoRA adapters:

  • Server startup with --enable-lora flag: ✅
  • Single LoRA inference: ✅
  • Multi-LoRA switching between requests: ✅
  • Concurrent requests with different LoRAs: ✅

Example Usage

from vllm import LLM
from vllm.lora.request import LoRARequest

# Initialize with LoRA support
llm = LLM(
    model="openai/whisper-large-v3-turbo",
    enable_lora=True,
    max_loras=4,
    max_lora_rank=64,
)

# Use different LoRA adapters per request
outputs = llm.generate(
    inputs,
    lora_request=LoRARequest("my_whisper_lora", 1, "/path/to/lora")
)

or

vllm serve yourname/yourmodel \
--host 0.0.0.0 \
--port 8181 \
--dtype bfloat16 \
--trust-remote-code \
--enable-lora \
--lora-modules  \
lora1=lora_module_path \
lora2=lora_module_path \
--max-lora-rank 32 \
--gpu-memory-utilization 0.7 \
--tensor-parallel-size 1

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0.

@daje0601 daje0601 requested a review from jeejeelee as a code owner December 2, 2025 08:56
@mergify
Copy link

mergify bot commented Dec 2, 2025

Documentation preview: https://vllm--29856.org.readthedocs.build/en/29856/

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-LoRA support for Whisper models, which is a valuable addition. The implementation is robust and well-engineered. I appreciate that instead of a model-specific hack, the changes generalize the existing LoRA infrastructure to support Whisper's architecture, particularly the KV-only packed layers in cross-attention. The inclusion of comprehensive unit tests and a clear example script significantly enhances the quality and usability of this contribution. The code is clean, the logic is sound, and the changes are well-documented. Overall, this is an excellent pull request.

@daje0601 daje0601 closed this Dec 2, 2025
@daje0601 daje0601 reopened this Dec 2, 2025
@daje0601 daje0601 force-pushed the whisper-multi-lora-support branch from 8897a78 to 93182eb Compare December 2, 2025 11:25
This PR enables Multi-LoRA support for Whisper speech-to-text models,
allowing users to serve multiple fine-tuned Whisper adapters from a
single base model.

Changes:
- Add SupportsLoRA interface to WhisperForConditionalGeneration
- Add embedding_modules and embedding_padding_modules attributes
- Update packed_modules_mapping for LoRA compatibility
- Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice)
  configurations used in Whisper's cross-attention layers
- Add fallback to max_target_positions in WorkerLoRAManager for
  Whisper compatibility
- Add example script for Whisper Multi-LoRA inference
- Add unit tests for Whisper LoRA support

Signed-off-by: daje0601 <[email protected]>
@daje0601 daje0601 force-pushed the whisper-multi-lora-support branch from 93182eb to ba3826b Compare December 2, 2025 11:36
@jeejeelee
Copy link
Collaborator

Will look at this PR ASAP, also cc @NickLucche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants