RFC: On-Demand Collection Loading via loadSubset #676

KyleAMathews · 2025-10-14T18:20:05Z

KyleAMathews
Oct 14, 2025
Maintainer

Summary

Now released as part of DB 0.5 — https://x.com/tan_stack/status/1988638050662895789 & https://tanstack.com/blog/tanstack-db-0.5-query-driven-sync

TanStack DB collections currently perform full dataset synchronization before becoming ready, limiting their ability to scale to large datasets. This RFC introduces on-demand subset loading through an optional loadSubset function that collections can provide, enabling efficient pagination, filtered loading, and progressive sync patterns. Collections without loadSubset remain in eager mode with full sync behavior, while collections implementing loadSubset can load specific data subsets based on live query predicates, allowing immediate rendering with synchronous data access when subsets are already loaded.

Background

TanStack DB collections materialize subsets of database data in memory on the client. Collections manage data synchronization through a sync function provided in CollectionConfig, which is called once when the collection is created.

The current sync function signature accepts parameters for managing collection state:

sync: (params: {
  collection: Collection<T, TKey, any, any, any>
  begin: () => void
  write: (message: Omit<ChangeMessage<T>, 'key'>) => void
  commit: () => void
  markReady: () => void
  truncate: () => void
}) => void | CleanupFn

Collections track active subscriptions through reference counting. When the first subscription is created via subscribeChanges(), the collection calls startSync() to begin data synchronization. The collection transitions through loading states until markReady() is called, at which point collection.status === 'ready' and subscriptions receive the initial state. When the last subscription is removed and gcTime has elapsed, any cleanup function is called.

Live queries execute through useLiveQuery(), which subscribes to collections and returns data synchronously when available or enters a loading state when waiting for data. The preload() method allows awaiting collection readiness before rendering.

QueryCollection integrates TanStack Query by calling a queryFn with query parameters and managing the resulting data as a collection. In the current implementation, the queryFn is called once with the full query definition.

Electric collections sync data from Postgres via HTTP Shape subscriptions, materializing the shape log into collection state. Shapes support filtering via WHERE clauses, ordering, and column selection. Other collection types like Trailbase, RxDB, and similar systems follow comparable patterns of syncing data from backend sources into local collections.

Problem

The current eager-only synchronization model creates five significant limitations that prevent TanStack DB from scaling to real-world application data sizes:

1. Cannot scale to large datasets

Collections must sync their entire dataset before collection.status === 'ready', making them impractical for large tables. A collection of millions of posts cannot reasonably sync everything to the client. Applications need the ability to load only the subset required for the current view—such as the first 20 posts ordered by creation date—without waiting for complete dataset synchronization.

2. Predicates don't influence initial sync

Subscription predicates cannot affect initial synchronization because startSync() executes before predicates are available. Even subscriptions with narrow predicates like where(status === 'active').limit(10) trigger full collection sync before returning any data.

3. No progressive loading patterns

Some use cases benefit from both immediate subset loading and background full syncing. An infinite scroll feed might want to immediately load the first page while syncing the full dataset in the background for offline support. The current model forces a binary choice between immediate readiness (local-only collections) or full sync (eager mode).

4. Pagination cannot load from backend

The useLiveInfiniteQuery prototype demonstrates this limitation clearly. The current implementation can only paginate through data already present in the collection. When a user clicks "load more," the hook slices the next page from already-synced data. True infinite loading requires fetching additional pages from the backend on demand, but collections lack the API to support this pattern.

5. Inefficient for live query collections

Live query collections subscribe to all source collections, forcing each source to perform full synchronization even when the query only needs a small subset. For example:

from(postsCollection)
  .join(usersCollection, ...)
  .where(posts.userId === currentUser)
  .limit(10)

This query needs 10 posts for a specific user, but postsCollection syncs its entire dataset before the query can execute.

The lack of on-demand loading prevents TanStack DB from handling application data at realistic scale across all collection types—whether Electric, QueryCollection, Trailbase, RxDB, or others.

Proposal

Sync Function Return Type

The sync function in CollectionConfig will be extended to optionally return a SyncConfigRes object:

export type LoadSubsetFn = (options: LoadSubsetOptions) => true | Promise<void>
export type CleanupFn = () => void

export type SyncConfigRes = {
  cleanup?: CleanupFn
  loadSubset?: LoadSubsetFn
}

export interface SyncConfig
  T extends object = Record<string, unknown>,
  TKey extends string | number = string | number,
> {
  sync: (params: {
    collection: Collection<T, TKey, any, any, any>
    begin: () => void
    write: (message: Omit<ChangeMessage<T>, 'key'>) => void
    commit: () => void
    markReady: () => void
    truncate: () => void
  }) => void | CleanupFn | SyncConfigRes
}

Collections can return:

void: Eager mode with no cleanup
CleanupFn: Eager mode with cleanup
SyncConfigRes: On-demand mode if loadSubset is provided, otherwise eager mode

Sync Modes

Collections operate in one of two modes based on their sync function return value:

Eager Mode (current behavior, default):

Collection does not return a loadSubset function
Performs full initial sync when first subscription is created
Calls markReady() only after entire dataset is synced
Subscriptions receive data immediately from already-synced collection
collection.status === 'ready' after full sync completes
preload() waits for full initial sync to complete

On-Demand Mode:

Collection returns a loadSubset function in SyncConfigRes
No automatic initial sync when collection is created
Data is loaded only when subscriptions request specific subsets via loadSubset()
Each subscription independently loads the data it needs
preload() is typically a no-op since collections don't perform background syncing

Collections can implement additional loading patterns on top of on-demand mode. A common pattern is progressive mode, where the collection returns a loadSubset function (making it on-demand) but also initiates a background full sync:

Both subset loading and background sync race—whichever completes first wins
If background sync completes first, pending loadSubset() calls can abort and return early since data is already present
collection.status === 'ready' after full background sync completes
preload() waits for full background sync to complete
Best for offline-first applications that want immediate data access but also want to cache the full dataset

Collections will only expose syncMode configuration if they support both eager and on-demand modes. Otherwise, they default to their single supported mode.

LoadSubset Function Contract

When a collection provides a loadSubset function, it accepts the following options:

export type LoadSubsetOptions = {
  /** The where expression to filter the data */
  where?: BasicExpression<boolean>
  /** The order by clause to sort the data */
  orderBy?: OrderBy
  /** The limit of the data to load */
  limit?: number
  /**
   * The subscription that triggered the load.
   * Advanced sync implementations can use this for:
   * - LRU caching keyed by subscription
   * - Reference counting to track active subscriptions
   * - Subscribing to subscription events (e.g., finalization/unsubscribe)
   * @optional Available when called from CollectionSubscription, may be undefined for direct calls
   */
  subscription?: Subscription
}

The function returns:

true: Data is already present in the collection, subscription can proceed synchronously
Promise<void>: Data is being loaded, resolves when subset is available

The loadSubset function is responsible for:

Determining if the requested subset is already loaded
If not, initiating data fetching from the backend/source
Writing fetched data to the collection via write() and commit()
Returning true for immediate access or a Promise that resolves when loading completes

Live Query Integration

Live queries interact with loadSubset through the subscription lifecycle:

When useLiveQuery() creates a subscription with predicates, the subscription system calls loadSubset() if the collection provides one
Live queries begin executing immediately, processing any available data
The live query remains in loading state until all loadSubset() promises resolve
If all loadSubset() calls return true, the live query executes completely synchronously with no loading state
Once all subsets are loaded, the live query transitions to ready and remains ready indefinitely

For pagination via useLiveInfiniteQuery:

Calling fetchNextPage() updates the subscription's limit predicate
The updated predicate triggers a new loadSubset() call with the increased limit
The collection determines if additional data needs fetching or if it's already present

For joins in live query collections:

As data flows through the left collection, new related entity IDs may be discovered
The live query collection calls loadSubset() on the right collection with predicates like where: inArray(ref('userId'), ['abc', 'xyz])
This creates dynamic, data-driven subset loading throughout query execution

Synchronous Data Access

The true return value enables flicker-free rendering when navigating between components:

User navigates to a component that needs posts where status === 'active' limited to 20
Collection has already loaded this subset from a previous component
loadSubset() returns true immediately
useLiveQuery() executes completely synchronously in the same render
Component renders with data immediately, no loading state flicker

Collection-Specific Implementations

Electric Collections:

Track loaded subsets using Shape subscription handles and offsets
For each column used in WHERE clauses, maintain disjoint sets of loaded ranges (lower and upper bounds)
When loadSubset() is called, compare requested predicate against loaded ranges
Create new Shape subscriptions only for gaps in loaded data
Support progressive mode by starting background full-table Shape subscription alongside subset loading

QueryCollection:

Integrate LoadSubsetOptions into TanStack Query's meta object
Set queryKey dynamically based on predicate parameters
Use predicate deduplication mechanism to identify if cached data satisfies the request
Return true if non-stale data covers the predicate
Call queryFn with predicate parameters in meta for new data fetching
Track pagination cursors on the subscription object for continuation requests
Detect pagination continuations by checking if all predicates remain the same except for an increased limit value

Live Query Collections:

Subscribe to source collections and call their loadSubset() functions with translated predicates
For from({ post: postsCollection }).where(({ post }) => eq(post.userId, '123')).limit(10), call postsCollection.loadSubset({ where: eq(ref('userId'), '123'), limit: 10 })
For joins, as left-side data arrives, extract related IDs and call loadSubset() on right-side collection
Dynamic predicate updates as new data flows through the query

Predicate Deduplication

A predicate deduplication library will provide set operations on PredicateObject instances to help collections track loaded subsets:

Containment checking: Determine if a new predicate's result set is fully contained within previously loaded data
Hybrid tracking: Maintain one combined query for unlimited predicates and separate entries for limited predicates
Merge optimization: Combine compatible predicates to minimize tracking overhead

Collections use these tools to:

Check if loadSubset() requests are already satisfied by loaded data
Track which data ranges have been fetched
Determine when to reuse cached data versus fetching new data

The exact mechanics of predicate set operations are outside the scope of this RFC and will be documented separately. Collections are free to implement their own tracking mechanisms based on their specific requirements (e.g., staleness policies, cache eviction strategies).

Error Handling

When loadSubset() encounters an error:

Collection calls setError() to record the error in collection state
The loadSubset() promise rejects with the error
Live query captures the error and returns it in the result map
Components can render error states via useLiveQuery().error
Unhandled errors at the loadSubset() call site are also caught and recorded

Retry logic is collection-dependent. Collections may implement automatic retries, exponential backoff, or leave retry decisions to the application layer.

Preload Behavior

The preload() method behavior varies by sync mode:

Eager Mode:

preload() calls startSync() and waits for collection.status === 'ready'
Ensures full dataset is loaded before rendering
Existing behavior preserved

On-Demand Mode:

preload() is typically a no-op since collections don't perform background syncing
Developers should call await liveQuery.preload() instead to ensure specific query data is loaded
Documentation will educate users about this shift from collection-level to query-level preloading

Progressive Mode (collection-implemented):

preload() waits for background full sync to complete
Ensures offline-first applications have complete dataset cached
Query-level preloading works immediately via subset loading while background sync continues

Areas Requiring Prototyping

Pagination State Tracking: How collections track pagination state (cursors, offsets, keyset values) across multiple subscriptions with different pagination requirements will be determined through implementation. Collections must match new loadSubset() calls against previous calls to determine what data to fetch, but the exact mechanism for tracking this state needs prototyping.

Definition of Success

This proposal succeeds if it enables the following outcomes:

1. Large Dataset Support

Collections can handle tables with millions of rows without requiring full synchronization. A live query displaying the first 20 posts from a 10-million-row table loads only those 20 rows and transitions to ready state in under 2 seconds (network-dependent).

2. Predicate-Driven Loading

Subscription predicates directly control initial data loading. A live query with where(({ post }) => eq(post.status, 'active')).limit(10) loads exactly 10 active records, not the entire collection. The collection's loadSubset() function is called with the subscription's exact predicate parameters.

3. Synchronous Data Access

When navigating between components that request identical or already-loaded subsets, useLiveQuery() returns data synchronously with zero loading state flicker. loadSubset() returns true for 100% of navigation cases where data is already present.

4. True Infinite Scroll

useLiveInfiniteQuery can fetch additional pages from the backend by calling fetchNextPage(). Each call increases the limit predicate, triggers loadSubset(), and fetches only the incremental data not already loaded. A 1000-row feed loads in chunks of 20 rows on demand rather than syncing all 1000 rows upfront.

5. Efficient Query Collections

Live queries over multiple source collections only trigger subset loading on those sources. A query from({ post: postsCollection }).where(({ post }) => eq(post.userId, '123')).limit(10) loads 10 posts and only the related users, not the full posts and users collections. Query execution time is proportional to result set size, not source collection size.

6. Progressive Mode Viability

Collections can implement progressive loading patterns where subset requests resolve immediately while background full sync continues. An infinite scroll feed displays the first page in under 1 second while the full dataset syncs for offline support over the next 30 seconds.

7. Backward Compatibility

Existing collections continue working without changes. Collections without loadSubset() functions maintain current eager-mode behavior. No breaking changes to collection API or live query usage patterns.

8. QueryCollection Integration

QueryCollection successfully integrates LoadSubsetOptions into TanStack Query's queryFn calls. Developers can access meta.where, meta.orderBy, and meta.limit to construct backend API requests. Pagination continuations correctly pass previous cursors to subsequent queryFn calls.

pawelblaszczyk5 · 2025-10-14T18:58:28Z

pawelblaszczyk5
Oct 14, 2025

Stoked for this, sounds really great. One thing that I may be missing - will the current “eager” mode with the built in collections still be supported, eg Electric one?

4 replies

KyleAMathews Oct 14, 2025
Maintainer Author

Yeah eager is still the default.

pawelblaszczyk5 Oct 14, 2025

Yeah eager is still the default.

So how will someone opt in into this with the electric collection? I understood from the RFC that the mode depends on collection implementation and its return value? 🤔

KyleAMathews Oct 14, 2025
Maintainer Author

There'll be a new syncMode option. Electric will expose eager (default), on-demand, and progressive.

pawelblaszczyk5 Oct 14, 2025

Thanks a lot for the explaining!

fezproof · 2025-10-15T03:01:15Z

fezproof
Oct 15, 2025

If collections can load data in subsets, could this unlock a sort of SSR with Tanstack db?
If you could pass an initial set of data to a collection, you could load it as part of an SSR page load and then populate the collection from there.

1 reply

KyleAMathews Oct 15, 2025
Maintainer Author

Yeah! This is a really nice benefit of this approach — it'll make SSR very efficient.

mhsnook · 2025-11-07T16:59:14Z

mhsnook
Nov 7, 2025

Hi, question, -- first off I love this! thank you.

I wonder if you could clarify, under this approach, I would provide my own loader function which would accept this "where" argument, and I could write a function that, say, responds to certain parameters by fetching from an entirely different API? e.g. I might fetch cards from a REST API when I want them one at a time, or one language at a time, but then I might also want "most recent from my friends; weighted by my preferences" or something like that, which might be an entirely different API.

(On the other thread where I requested the ability to basically just cat different react-query caches, this was an important reason for why; I sometimes have vastly different fetching strategies.)

I loave to hear that each segment will have its own preload, subscriber, and loading state.

Note that useShape does not have the same kind of subscription-management that we know and love around here, note on abort controllers.

Note that if you have multiple components using the same component, this will stop updates for all subscribers. Which is probably not what you want. We plan to add a better API for unsubscribing from updates & cleaning up shapes that are no longer needed. If interested, please file an issue to start a discussion.

0 replies

firatoezcan · 2025-11-12T15:40:00Z

firatoezcan
Nov 12, 2025

Hey Kyle, this RFC is great, I don't really have anything to add to the API/implementation plan, however one thing that would be great for adoption of Tanstack DB is maybe some transitional solution until this RFC is discussed enough and the implementation starts (which will probably be a ton of work as well).

Transitional solution in the sense of: How can we achieve some parts of what this RFC wants to achieve with what we currently have, even if it means using janky workarounds.

The benefit here would be that projects can start building ontop of Tanstack DB right now to get the benefits it offers already (greenfield projects that have a half-life of at least 5 years).
Since data loading is such a fundamental aspect of an application, it's hard to reason about migrating over after a years worth of code has been written already + it would require thinking in a different way anyways.

From the RFC I would say the important stuff that would suffice for a lot of applications are:

Large Dataset Support
Predicate-Driven Loading
Synchronous Data Access

If there is some way that this can be done in a not-perfect-but-still-ok-ish way today, I would say it'd be worthwhile to maybe write dedicated documentation for this (with obvious disclaimers that this is WIP and to be changed).

I wouldn't mind chiming in here and diving deep, but obviously I am not sure if with what's currently available it would even be a reasonable attempt.

Let me know what you (or other maintainers) think about this :)

1 reply

KyleAMathews Nov 12, 2025
Maintainer Author

Heh so amusingly we just released it!

https://x.com/tan_stack/status/1988638050662895789 & https://tanstack.com/blog/tanstack-db-0.5-query-driven-sync

RFC: On-Demand Collection Loading via loadSubset #676

Uh oh!

Uh oh!

KyleAMathews Oct 14, 2025 Maintainer

Summary

Background

Problem

1. Cannot scale to large datasets

2. Predicates don't influence initial sync

3. No progressive loading patterns

4. Pagination cannot load from backend

5. Inefficient for live query collections

Proposal

Sync Function Return Type

Sync Modes

LoadSubset Function Contract

Live Query Integration

Synchronous Data Access

Collection-Specific Implementations

Predicate Deduplication

Error Handling

Preload Behavior

Areas Requiring Prototyping

Definition of Success

1. Large Dataset Support

2. Predicate-Driven Loading

3. Synchronous Data Access

4. True Infinite Scroll

5. Efficient Query Collections

6. Progressive Mode Viability

7. Backward Compatibility

8. QueryCollection Integration

Replies: 4 comments · 6 replies

Uh oh!

pawelblaszczyk5 Oct 14, 2025

Uh oh!

KyleAMathews Oct 14, 2025 Maintainer Author

Uh oh!

pawelblaszczyk5 Oct 14, 2025

Uh oh!

KyleAMathews Oct 14, 2025 Maintainer Author

Uh oh!

pawelblaszczyk5 Oct 14, 2025

Uh oh!

fezproof Oct 15, 2025

Uh oh!

KyleAMathews Oct 15, 2025 Maintainer Author

Uh oh!

mhsnook Nov 7, 2025

Uh oh!

firatoezcan Nov 12, 2025

Uh oh!

KyleAMathews Nov 12, 2025 Maintainer Author

KyleAMathews
Oct 14, 2025
Maintainer

Replies: 4 comments 6 replies

pawelblaszczyk5
Oct 14, 2025

KyleAMathews Oct 14, 2025
Maintainer Author

KyleAMathews Oct 14, 2025
Maintainer Author

fezproof
Oct 15, 2025

KyleAMathews Oct 15, 2025
Maintainer Author

mhsnook
Nov 7, 2025

firatoezcan
Nov 12, 2025

KyleAMathews Nov 12, 2025
Maintainer Author