Skip to content

[Bug]: BM25 always downloading from HuggingFace if local_files_only=False #560

@sglebs

Description

@sglebs

What happened?

fastembed.sparse.sparse_text_embedding.py has this:

        for EMBEDDING_MODEL_TYPE in self.EMBEDDINGS_REGISTRY:
            supported_models = EMBEDDING_MODEL_TYPE._list_supported_models()
            if any(model_name.lower() == model.model.lower() for model in supported_models):
                self.model = EMBEDDING_MODEL_TYPE(
                    model_name,
                    cache_dir,
                    threads=threads,
                    providers=providers,
                    cuda=cuda,
                    device_ids=device_ids,
                    lazy_load=lazy_load,
                    **kwargs,
                )
                return

which eventually gets here in BM25's constructor:

        self._model_dir = self.download_model(
            model_description,
            self.cache_dir,
            local_files_only=self._local_files_only,
            specific_model_path=self._specific_model_path,
        )

which eventually gets here:

                    return Path(
                        cls.download_files_from_huggingface(
                            hf_source,
                            cache_dir=cache_dir,
                            extra_patterns=extra_patterns,
                            **kwargs,
                        )
                    )

I am passing cache_dir=/tmp and local_files_only=False. The cached file already exist from a previous run. But unfortunately the code is like this:


        if local_files_only:
            disable_progress_bars()
            if metadata_file.exists():
                metadata = json.loads(metadata_file.read_text())
                verified = _verify_files_from_metadata(snapshot_dir, metadata, repo_files=[])
                if not verified:
                    logger.warning(
                        "Local file sizes do not match the metadata."
                    )  # do not raise, still make an attempt to load the model
            else:
                logger.warning(
                    "Metadata file not found. Proceeding without checking local files."
                )  # if users have downloaded models from hf manually, or they're updating from previous versions of
                # fastembed
            result = snapshot_download(
                repo_id=hf_source_repo,
                allow_patterns=allow_patterns,
                cache_dir=cache_dir,
                local_files_only=local_files_only,
                **kwargs,
            )
            return result

        repo_revision = model_info(hf_source_repo).sha
        repo_tree = list(list_repo_tree(hf_source_repo, revision=repo_revision, repo_type="model"))

It ends up doing the full dance of downloading files.

I don't want local_files_only, I am expecting a "local files first". After all, this is what a cache means - use it if in the cache.

"local files only" should only determine if you fail/return right away or if you keep going to download from HF.

What is the expected behaviour?

Load from local cached files if present.

A minimal reproducible example

I have this in my DOckerfile, so I was expecting it to load the file from the local cache:

RUN echo "from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding \n" > ${HF_HOME}/docker.py && \
    echo "embeddings = SparseTextEmbedding('Qdrant/bm25') \n" >> ${HF_HOME}/docker.py && \
    echo "embedded_junk_vector = embeddings.embed('what is the arity of this vector?') \n" >> ${HF_HOME}/docker.py && \
    echo "print('Model cached successfully at:', embeddings.model_cache_dir) \n" >> ${HF_HOME}/docker.py && \
    python3 ${HF_HOME}/docker.py

So, in my code later on I was expecting it to load from local files on a subsequent run.

What Python version are you on? e.g. python --version

Python 3.12.11

FastEmbed version

fastembed==0.7.3

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

If there is any connectivity issues with HF at runtime and I use BM25, I get this:


  File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/sparse_text_embedding.py", line 77, in __init__
    self.model = EMBEDDING_MODEL_TYPE(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/bm25.py", line 119, in __init__
    self._model_dir = self.download_model(
                      ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fastembed/common/model_management.py", line 458, in download_model
    raise ValueError(f"Could not load model {model.model} from any source.")
ValueError: Could not load model Qdrant/bm25 from any source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions