Sparse Embeddings (BM25) and Persian Language Support

Hello,
I would like to ask about the limitations of sparse embeddings, specifically BM25 when it comes to language support, and in particular Persian (Farsi).
```python
supported_languages = [
    "arabic",
    "danish",
    "dutch",
    "english",
    "finnish",
    "french",
    "german",
    "greek",
    "hungarian",
    "italian",
    "norwegian",
    "portuguese",
    "romanian",
    "russian",
    "spanish",
    "swedish",
    "tamil",
    "turkish",
]
```

Are the limitations mainly related to stop word handling, tokenization, or other language-specific preprocessing steps?

If the limitation is primarily related to stop words or preprocessing, is there a recommended way to extend or customize BM25 for Persian? 

I am eager to contribute and would be happy to provide the necessary data (e.g., stop word lists, etc.) to enable better Persian support.

https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py
https://huggingface.co/Qdrant/bm25/tree/main

Thanks for your guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse Embeddings (BM25) and Persian Language Support #558

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sparse Embeddings (BM25) and Persian Language Support #558

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions