-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Is your feature request related to a problem? Please describe.
The current entity verification process in entity_service.py relies exclusively on online API calls to Wikipedia, managed by wiki_extractor.py. When processing a large number of new entities, the service makes sequential network requests for each entity not found in our local database. This serial, online-only approach is a significant performance bottleneck due to network latency and API rate limits, which slows down the entire data ingestion and processing pipeline.
Describe the solution you'd like
I propose introducing a new, optional offline verification mode that leverages a local Wikipedia dataset (e.g., a database or an index file populated from a Wikipedia dump). This feature would not replace the existing online mode but would act as a high-speed, comprehensive first-pass check before resorting to network calls.
The enhanced workflow in AsyncEntityService.batch_get_or_create_entities would be:
- Check Local App DB (Current): First, check the application's own database for the entity.
- Check Offline Wikipedia Dataset (New): If not found, query the local Wikipedia dataset for the entity's existence,
wikibase_item, disambiguation status, etc. This would be extremely fast as it's a local query. - Fallback to Online API (Current): Only if the entity is not found in the offline dataset, the service should then fall back to the existing online API call (
get_wiki_page_info).
This hybrid approach would dramatically accelerate verification for the vast majority of entities, reserving slower online calls only for very new or obscure entities not present in the local dataset snapshot.
Describe alternatives you've considered
One alternative is to parallelize the online API requests. However, this could easily lead to hitting Wikipedia's API rate limits and getting our IP address blocked. It is also less reliable and still subject to network instability. A local-first approach is faster, more robust, and independent of external factors.
Additional context
- Implementation will require a new data access layer or service to interface with the local Wikipedia dataset.
- Configuration options will be needed to enable this mode and specify the path or connection details for the dataset.
- This enhancement primarily impacts
app/services/entity_service.pyby introducing the new verification step. It will also require developing a local counterpart to the API-calling functions inapp/services/wiki_extractor.py.