Skip to content

Conversation

@MarceloRobert
Copy link
Collaborator

@MarceloRobert MarceloRobert commented Nov 27, 2025

Description

Changes the insertion policy of the ingester from "never update" to "only update null fields". This follows the policy adopted by kcidb, it always prioritizes the incoming data in the following logic (considering existing / incoming):

  • null + new data = new data
  • existing data + new data = existing data
  • existing data + null = existing data

Changes

  • Added a new command to generate the SQL insertion query for any model
    • This command generates a python file with a dictionary relating a table with its insertion query and which fields should be updated
  • Used SQL queries on consume_buffer instead of Django's bulk_create
  • Updated unit tests

How to test

Start the ingester and insert some data to it referencing the same object, check if the policy is respected.
Example realistic data for testing can be found here, but other tests are encouraged.

Future changes

  1. Adding automated integration tests so that we can guarantee that this policy is kept regardless of changes in the ingester is also useful.

Closes #1552

@MarceloRobert MarceloRobert self-assigned this Nov 27, 2025
@MarceloRobert MarceloRobert added the Database Issue that alters only configs of a database itself label Nov 27, 2025
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch 2 times, most recently from 2e861d1 to f761ab5 Compare November 27, 2025 18:00
@MarceloRobert MarceloRobert changed the title WIP: Feat/ingester django tests Feat: change ingester data insertion policy Nov 27, 2025
@MarceloRobert MarceloRobert marked this pull request as ready for review November 27, 2025 18:11
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch from f761ab5 to 8f5cafe Compare November 27, 2025 18:19
Copilot finished reviewing on behalf of MarceloRobert November 27, 2025 18:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the ingester's data insertion policy from "never update" to "only update null fields", controlled by a new PRIO_DB environment variable. The implementation replaces Django's ORM bulk_create with raw SQL queries using PostgreSQL's ON CONFLICT clause with COALESCE operations to selectively update fields based on the priority policy.

Key changes:

  • Added dynamic SQL query generation function _generate_model_insert_query that creates upsert queries with field-level update control
  • Replaced bulk_create with raw SQL execution using cursor.executemany() in consume_buffer
  • Introduced PRIO_DB environment variable to toggle between database-priority and data-priority update modes

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
backend/kernelCI_app/constants/ingester.py Adds PRIO_DB constant with documentation (contains typo in env var name)
backend/kernelCI_app/models.py Adds module docstring noting explicit id column requirement for ingester
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py Implements dynamic SQL query generation and replaces ORM with raw SQL for data insertion
backend/kernelCI_app/tests/unitTests/commands/monitorSubmissions/kcidbng_ingester_test.py Updates test to mock database connections instead of Django ORM methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch from 8f5cafe to 8c15328 Compare November 27, 2025 19:15
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch 3 times, most recently from cf256e7 to cf8caf0 Compare November 28, 2025 21:28
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch from cf8caf0 to 27a4d35 Compare November 28, 2025 21:32
@MarceloRobert MarceloRobert added the Ingester The issue relates to the ingester tool, including the command itself and related functions. label Nov 28, 2025
@tales-aparecida
Copy link

tales-aparecida commented Nov 29, 2025

kind of related: https://forum.djangoproject.com/t/speeding-up-postgres-bulk-create-by-using-unnest/36508

I wonder how much time we are taking on the make_*_instance methods

ON CONFLICT (id)
DO UPDATE SET
_timestamp = GREATEST(checkouts._timestamp, EXCLUDED._timestamp),
id = COALESCE(checkouts.id, EXCLUDED.id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as id is the conflict target we could remove it from here (and the other tables too) if it is easy, checkouts.id and EXCLUDED.id will be the same

Copy link
Contributor

@gustavobtflores gustavobtflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a comment about the id's in the insert_queries.py, pre-approved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Database Issue that alters only configs of a database itself Ingester The issue relates to the ingester tool, including the command itself and related functions.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify if django-ingester is respecting kcidb limitations on fields update

3 participants