-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Misc] Unify tokenizer registration #29767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Unify tokenizer registration #29767
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively unifies tokenizer registration by moving get_tokenizer to vllm.tokenizers and registering in-tree tokenizers like HfTokenizer and MistralTokenizer using a new decorator-based mechanism. This refactoring simplifies the tokenizer loading logic. I have identified two high-severity issues: one related to a breaking change that lacks a proper deprecation cycle, and another concerning a logic flaw in the new TokenizerRegistry.register method that could lead to silent failures. Addressing these will improve backward compatibility and the robustness of the new API.
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>
Purpose
tokenizer_modeinstead oftokenizer_nameas the registry ID, with temporary backward compatibility.get_tokenizer, also registerHfTokenizerandMistralTokenizertoTokenizerRegistry.get_tokenizertovllm.tokenizers. The old location will be deprecated in the next PR.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.