Skip to content

Commit c696042

Browse files
committed
Add search_sources and search_documents methods to VectorIndex
1 parent d75d585 commit c696042

File tree

11 files changed

+1412
-719
lines changed

11 files changed

+1412
-719
lines changed

docs/modules/index/index.md

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,70 @@ The Index Module provides a vector indexing system for Django applications. It e
6464
```
6565

6666
5. Build your indexes with `manage.py rebuild_indexes`
67-
6. Query your index with `MyIndex().search("query")`
67+
6. Query your index with `MyIndex().search_sources("query")`
68+
69+
## Querying Indexes
70+
71+
When indexes are built, source objects are often chunked in to many separate Documents before they are embedded and inserted in to the index.
72+
73+
This means that a query to the underlying index can return multiple Documents from the same source object. For example; if you have a `Book` Django model with a big summary to be embedded, searching the index might return many Documents from the same Book.
74+
75+
This can be fine in some cases, in RAG applications the most relevant chunks are usually what you want, even if they all come from the same source.
76+
77+
In other cases, such as finding similar content, this behaviour can be a hindrance.
78+
79+
To solve this, Vector Indexes provide two query methods depending on your needs:
80+
81+
### Document Search
82+
83+
```python
84+
85+
MyVectorIndex().search_documents("Similar to this")
86+
```
87+
88+
`search_documents` returns a queryset-like interface over Document objects. If the underlying vector provider returns multiple Documents from the same source object, these will all be returned.
89+
90+
This is useful for RAG-like applications where the most relevant chunks are important.
91+
92+
### Source Search
93+
94+
```python
95+
96+
MyVectorIndex().search_sources("Similar to this")
97+
```
98+
99+
When using the `search_sources` method, a Vector Index will attempt to map results from the index back to original source objects, i.e. in the `Book` example, when using `ModelSource(model=Book)`, this method will return a queryset-like interface over `Book` models.
100+
101+
As the underlying storage provider is likely to return multiple Documents for the same source object, this method overfetches Documents to attempt to ensure enough source objects are returned for your query.
102+
103+
This overfetching behaviour can be customised:
104+
105+
```python
106+
107+
MyVectorIndex().search_sources(
108+
"Similar to this",
109+
overfetch_multiplier=4,
110+
max_overfetch_iterations=3
111+
)
112+
```
113+
114+
Where:
115+
116+
- `overfetch_multiplier` defines how many multiples of the requested limit will be retrieved from the source, e.g. if you request 5 results and provide an `overfetch_multiplier` of 4, 20 Documents will be retrieved from the index internally. The top 5 unique sources from these will then be returned.
117+
- `max_overfetch_iterations` defines the maximum number of times the underlying search will be repeated to get all-unique source objects, e.g. if the initial search doesn't return enough unique objects, it will be repeated with an increasing number of items up to `max_overfetch_iterations` times.
118+
119+
### Converting Between Result Types
120+
121+
You can convert between result types on an existing queryset:
122+
123+
```python
124+
# Start with document search, convert to sources
125+
docs = MyVectorIndex().search_documents("query")
126+
sources = docs.as_sources()
127+
128+
# Start with source search, convert to documents
129+
sources = MyVectorIndex().search_sources("query")
130+
docs = sources.as_documents()
131+
```
132+
133+
This can be useful in RAG applications where you want to use `Documents` for building context, but then present source objects to users as the 'Sources referenced'.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "django-ai-core"
3-
version = "0.1.0"
3+
version = "0.1.1"
44
description = "Django AI Core provides developer-focused AI features for implementing AI tooling in to Django sites."
55
readme = "README.md"
66
license = "BSD-3-Clause"

src/django_ai_core/contrib/index/base.py

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@
77

88
if TYPE_CHECKING:
99
from .embedding import EmbeddingTransformer
10-
from .query import QueryHandler, ResultQuerySetMixin
10+
from .query import QueryHandler
1111
from .schema import Document
1212
from .source import Source
13-
from .storage.base import StorageProvider
13+
from .storage.base import BaseStorageQuerySet, StorageProvider
1414

1515

1616
logger = logging.getLogger(__name__)
@@ -85,16 +85,37 @@ def update(self, documents: Iterable["Document"]):
8585
if isinstance(source, HasPostIndexUpdateHook):
8686
source.post_index_update(self)
8787

88-
def search(self, query: str) -> "ResultQuerySetMixin":
89-
"""Search the index and return a queryish object of results.
88+
def search_sources(
89+
self,
90+
query: str,
91+
*,
92+
overfetch_multiplier: int | None = None,
93+
max_overfetch_iterations: int | None = None,
94+
) -> "BaseStorageQuerySet":
95+
"""Search the index and return a queryish object of results
96+
mapped back to original source objects.
9097
Args:
9198
query: The search query string
9299
Returns:
93100
ResultQuerySet instance for the search results
94101
"""
95-
return self.query_handler.search(query)
102+
return self.query_handler.search_sources(
103+
query,
104+
overfetch_multiplier=overfetch_multiplier,
105+
max_overfetch_iterations=max_overfetch_iterations,
106+
)
107+
108+
def search_documents(self, query: str) -> "BaseStorageQuerySet":
109+
"""Search the index and return a queryish object of Documents
110+
as stored in the underlying index.
111+
Args:
112+
query: The search query string
113+
Returns:
114+
ResultQuerySet instance for the search results
115+
"""
116+
return self.query_handler.search_documents(query)
96117

97-
def find_similar(self, obj: object) -> "ResultQuerySetMixin":
118+
def find_similar(self, obj: object) -> "BaseStorageQuerySet":
98119
"""Find objects similar to the given object.
99120
Args:
100121
obj: The object to find similar objects to

0 commit comments

Comments
 (0)