Skip to content

[FEAT] Add Offline Entity Verification via Local Wikipedia Dataset #3

@liujuanjuan1984

Description

@liujuanjuan1984

Is your feature request related to a problem? Please describe.

The current entity verification process in entity_service.py relies exclusively on online API calls to Wikipedia, managed by wiki_extractor.py. When processing a large number of new entities, the service makes sequential network requests for each entity not found in our local database. This serial, online-only approach is a significant performance bottleneck due to network latency and API rate limits, which slows down the entire data ingestion and processing pipeline.

Describe the solution you'd like

I propose introducing a new, optional offline verification mode that leverages a local Wikipedia dataset (e.g., a database or an index file populated from a Wikipedia dump). This feature would not replace the existing online mode but would act as a high-speed, comprehensive first-pass check before resorting to network calls.

The enhanced workflow in AsyncEntityService.batch_get_or_create_entities would be:

  1. Check Local App DB (Current): First, check the application's own database for the entity.
  2. Check Offline Wikipedia Dataset (New): If not found, query the local Wikipedia dataset for the entity's existence, wikibase_item, disambiguation status, etc. This would be extremely fast as it's a local query.
  3. Fallback to Online API (Current): Only if the entity is not found in the offline dataset, the service should then fall back to the existing online API call (get_wiki_page_info).

This hybrid approach would dramatically accelerate verification for the vast majority of entities, reserving slower online calls only for very new or obscure entities not present in the local dataset snapshot.

Describe alternatives you've considered

One alternative is to parallelize the online API requests. However, this could easily lead to hitting Wikipedia's API rate limits and getting our IP address blocked. It is also less reliable and still subject to network instability. A local-first approach is faster, more robust, and independent of external factors.

Additional context

  • Implementation will require a new data access layer or service to interface with the local Wikipedia dataset.
  • Configuration options will be needed to enable this mode and specify the path or connection details for the dataset.
  • This enhancement primarily impacts app/services/entity_service.py by introducing the new verification step. It will also require developing a local counterpart to the API-calling functions in app/services/wiki_extractor.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions