-
Notifications
You must be signed in to change notification settings - Fork 21
Feat: change ingester data insertion policy #1638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: change ingester data insertion policy #1638
Conversation
2e861d1 to
f761ab5
Compare
f761ab5 to
8f5cafe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR changes the ingester's data insertion policy from "never update" to "only update null fields", controlled by a new PRIO_DB environment variable. The implementation replaces Django's ORM bulk_create with raw SQL queries using PostgreSQL's ON CONFLICT clause with COALESCE operations to selectively update fields based on the priority policy.
Key changes:
- Added dynamic SQL query generation function
_generate_model_insert_querythat creates upsert queries with field-level update control - Replaced
bulk_createwith raw SQL execution usingcursor.executemany()inconsume_buffer - Introduced
PRIO_DBenvironment variable to toggle between database-priority and data-priority update modes
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| backend/kernelCI_app/constants/ingester.py | Adds PRIO_DB constant with documentation (contains typo in env var name) |
| backend/kernelCI_app/models.py | Adds module docstring noting explicit id column requirement for ingester |
| backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py | Implements dynamic SQL query generation and replaces ORM with raw SQL for data insertion |
| backend/kernelCI_app/tests/unitTests/commands/monitorSubmissions/kcidbng_ingester_test.py | Updates test to mock database connections instead of Django ORM methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
8f5cafe to
8c15328
Compare
backend/kernelCI_app/tests/unitTests/commands/monitorSubmissions/kcidbng_ingester_test.py
Outdated
Show resolved
Hide resolved
cf256e7 to
cf8caf0
Compare
Now allows for overwrites of null data Closes kernelci#1552
cf8caf0 to
27a4d35
Compare
|
kind of related: https://forum.djangoproject.com/t/speeding-up-postgres-bulk-create-by-using-unnest/36508 I wonder how much time we are taking on the |
| ON CONFLICT (id) | ||
| DO UPDATE SET | ||
| _timestamp = GREATEST(checkouts._timestamp, EXCLUDED._timestamp), | ||
| id = COALESCE(checkouts.id, EXCLUDED.id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as id is the conflict target we could remove it from here (and the other tables too) if it is easy, checkouts.id and EXCLUDED.id will be the same
gustavobtflores
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a comment about the id's in the insert_queries.py, pre-approved
Description
Changes the insertion policy of the ingester from "never update" to "only update null fields". This follows the policy adopted by kcidb, it always prioritizes the incoming data in the following logic (considering existing / incoming):
Changes
consume_bufferinstead of Django'sbulk_createHow to test
Start the ingester and insert some data to it referencing the same object, check if the policy is respected.
Example realistic data for testing can be found here, but other tests are encouraged.
Future changes
Closes #1552