-
Notifications
You must be signed in to change notification settings - Fork 165
Open
Description
Hello,
I would like to ask about the limitations of sparse embeddings, specifically BM25 when it comes to language support, and in particular Persian (Farsi).
supported_languages = [
"arabic",
"danish",
"dutch",
"english",
"finnish",
"french",
"german",
"greek",
"hungarian",
"italian",
"norwegian",
"portuguese",
"romanian",
"russian",
"spanish",
"swedish",
"tamil",
"turkish",
]Are the limitations mainly related to stop word handling, tokenization, or other language-specific preprocessing steps?
If the limitation is primarily related to stop words or preprocessing, is there a recommended way to extend or customize BM25 for Persian?
I am eager to contribute and would be happy to provide the necessary data (e.g., stop word lists, etc.) to enable better Persian support.
https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py
https://huggingface.co/Qdrant/bm25/tree/main
Thanks for your guidance!
Metadata
Metadata
Assignees
Labels
No labels