Skip to content

Conversation

@mdashti
Copy link
Collaborator

@mdashti mdashti commented Dec 10, 2025

Ticket(s) Closed

What

This PR adds validation to aggregation queries to ensure that field names exist in the schema and are configured as fast fields. Previously, aggregations would silently return empty results when given invalid field names, making it difficult to debug typos and configuration errors.

Why

Users were experiencing confusing behavior where aggregations with typos in field names would succeed but return empty results ({"buckets": []}), with no indication that the field didn't exist. This made it impossible to distinguish between:

  • A typo in the field name (configuration error)
  • A valid field with no data
  • A query that matched no documents

This behavior was inconsistent with other Tantivy query types (like ExistsQuery) and user expectations from SQL databases and Elasticsearch, where invalid column/field references return clear errors.

How

Modified the aggregation field accessor functions in accessor_helpers.rs:

  1. get_ff_reader(): Now validates field existence before returning a column reader. Returns:

    • FieldNotFound error if the field doesn't exist in the schema
    • SchemaError if the field exists but isn't configured as a fast field
    • Empty column only if the field is valid but has no values in the segment
  2. get_all_ff_reader_or_empty(): Added the same validation logic for terms aggregations that handle multiple column types

The validation checks the schema to distinguish between non-existent fields and valid fields that happen to be empty in a particular segment, ensuring we only error on actual configuration problems.

Tests

  • Added test test_aggregation_invalid_field_returns_error covering all major aggregation types (date_histogram, histogram, terms, avg, range)
  • Fixed existing tests that were inadvertently using invalid field names

Breaking Change: Code using invalid field names in aggregations will now receive errors instead of empty results. This is intentional to catch configuration errors early.

@PSeitz
Copy link
Collaborator

PSeitz commented Dec 10, 2025

The main difference is that we don't always have a fixed schema with JSON fields, which means a field could exist on one segment, but not another one.

Even if a requested field is not part of a JSON field, it would break use cases where you federate a query over different indices.

Adding some validation on top should easy with the get_fast_field_names method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aggregations silently return empty results for invalid fields

3 participants