Parallelize graph writes #542

MarkWolters · 2025-10-15T14:33:32Z

This PR adds the option to specify that graph indexes should be written in parallel rather than sequentially. By default the existing sequential write behavior is maintained, parallel writes will only be used when the withParallelWrites(true) option is set through the OnDiskGraphIndexWriter.Builder class. Testing results below show the speedup achieved in the write phase across a number of cores. These gains appear to scale linearly with respect to dataset size (ie writing a dataset of 10 million records will take about 10x as long as a dataset of 1 million records but the speedup in parallel v sequential is roughly equal).

ETA: Testing also showed that the performance gains using a "prod-like" i3.4xlarge w/16 vCPUs and 8 disks striped RAID0 were roughly equivalent to the performance gains using a 64 vCPU m5.16xlarge with standard SSD

github-actions · 2025-10-15T14:33:49Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

…t of order

michaeljmarshall · 2025-10-17T16:36:24Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java

+        buffer.clear();
+        var writer = new ByteBufferIndexWriter(buffer);


The ownership model and lifecycle for the buffer here is a bit ambiguous to me, especially because the buffer is passed into this class as a parameter. I wonder if we can make the ByteBufferIndexWriter hide some of the implementation details so that we do not have a buffer.clear() and the buffer management logic at the end of this method. Instead, we could make the ByteBufferIndexWriter manage all of that, by adding a clone and a reset method, perhaps?

updated ByteBufferIndexWriter with CloneBuffer (to avoid confusion with Object.clone()) and updated reset to include clearing the buffer. Moved that logic out of NodeRecordTask to clarify ownership in latest commit.

michaeljmarshall · 2025-10-17T17:09:06Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

+        for (int newOrdinal = 0; newOrdinal <= maxOrdinal; newOrdinal++) {
+            final int ordinal = newOrdinal;
+            final long fileOffset = baseOffset + (long) ordinal * recordSize;
+
+            Future<NodeRecordTask.Result> future = executor.submit(() -> {
+                var view = viewPerThread.get();
+                var buffer = bufferPerThread.get();
+
+                var task = new NodeRecordTask(
+                        ordinal,
+                        ordinalMapper,
+                        graph,
+                        view,
+                        inlineFeatures,
+                        featureStateSuppliers,
+                        recordSize,
+                        fileOffset,
+                        buffer
+                );
+
+                return task.call();
+            });
+
+            futures.add(future);
+        }


IIUC, this doesn't have any back pressure mechanism, and we need that to prevent excessive memory utilization. I think we might want to consider an implementation that uses semaphores to manage concurrency.

In the latest commit I've added backpressure mechanisms to both this loop and the file channel write loop. I'm not entirely clear on what solution you had in mind vis-a-vis semaphores, but I am open to re-implementing in another fashion if you feel that the code could be improved.

michaeljmarshall · 2025-10-17T17:09:47Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

+
+                    // result.data is already a copy made in NodeRecordTask to avoid
+                    // race conditions with thread-local buffer reuse
+                    afc.write(result.data, result.fileOffset).get();


The .get() here negates the usages of the async file channel.

We could use a sempahore to limit the number of write tasks in flight.

this section has been rewritten to remove the get() call at this point in the code and to split the write process into sub-tasks. Again, I'm happy to go with a different implementation if the code can be improved but I'm not sure exactly what you had in mind.

Copilot

Pull Request Overview

This PR introduces parallel write capability for graph indexes to improve write throughput. The implementation maintains backward compatibility by defaulting to sequential writes unless explicitly enabled via withParallelWrites(true). Testing shows linear performance scaling with dataset size, with speedups consistent across different hardware configurations.

Key changes:

Added ParallelGraphWriter class that orchestrates parallel L0 record building using thread pools and asynchronous file I/O
Introduced ByteBufferIndexWriter for in-memory record construction before bulk disk writes
Added withParallelWrites(boolean) builder option to OnDiskGraphIndexWriter.Builder

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ParallelWriteExample.java	Example demonstrating parallel vs sequential write usage patterns and benchmark comparison
Grid.java	Enables parallel writes in the production Grid.buildOnDisk method
ParallelGraphWriter.java	Core parallel writer implementation with thread pooling, memory-aware backpressure, and async I/O
OnDiskGraphIndexWriter.java	Updated to support optional parallel write mode via builder configuration
NodeRecordTask.java	Task implementation for building individual node records in parallel worker threads
GraphIndexWriterTypes.java	New enum defining available writer types (sequential vs parallel)
GraphIndexWriter.java	Added factory methods for creating appropriate writer builders based on type
ByteBufferIndexWriter.java	IndexWriter implementation for writing to ByteBuffers in memory

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java

sam-herman · 2025-10-21T15:22:48Z

@MarkWolters this is very cool change!

re:

sequential writes will only be used when the withParallelWrites(true) option is set through the OnDiskGraphIndexWriter.Builder class

Did you mean parallel writes will only be used when the withParallelWrites(true) option is set?

jvector-base/src/main/java/io/github/jbellis/jvector/disk/ByteBufferIndexWriter.java

jvector-examples/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelWriteExample.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

sam-herman

Added more comments, primarily around two aspects:

Calculation of buffer - Seems like we need to consider alignment and hardware parallelism to reduce buffer sizes.
Batching and GC and malloc overhead - It seems as if we are creating a task per ordinal, which seems like something that could be somewhat expensive on memory allocation and GC overhead.

sam-herman · 2025-10-22T06:13:04Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java

+ * This task is designed to be executed in a thread pool, with each worker thread
+ * owning its own ImmutableGraphIndex.View for thread-safe neighbor iteration.
+ */
+class NodeRecordTask implements Callable<NodeRecordTask.Result> {


What would be the memory impact of creating this object for each write operation?

That depends on dataset size and vector dimensionality. But the number of NodeRecordTasks in existence at any time is bounded by a buffer that gets sized based on available memory as a back pressure mechanism.

jvector-base/src/main/java/io/github/jbellis/jvector/disk/ByteBufferIndexWriter.java

jvector-examples/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelWriteExample.java

jvector-base/src/main/java/io/github/jbellis/jvector/disk/ByteBufferIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriterTypes.java

sam-herman · 2025-10-23T02:28:42Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java

+    @Override
+    public Result call() throws Exception {
+        // Writer automatically clears buffer on construction
+        var writer = new ByteBufferIndexWriter(buffer);


Why not make the ByteBufferIndexWriter thread local and avoid recreating it on every call?

I don't think there's a significant difference. The constructor is trivially cheap, it just stores the buffer reference and initial position, and the real cost (the BytBuffer allocation) is already thread local. ByteBufferIndexWrite itself is tiny, only 2 fields, so allocation cost is negligible.

sam-herman · 2025-10-23T02:46:15Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

+     *
+     * @return the number of records to buffer before writing
+     */
+    private int calculateOptimalBufferSize() {


Should we take graph properties to this calculation to consider alignment with device blocks?
Also if we know the supported parallelism of the hardware, can leverage that as well to reduce the buffer size in the calculation to achieve the minimal queue depth?

Should we take graph properties to this calculation to consider alignment with device blocks? - I'm afraid I don't really understand what you are asking here, can you point me to an example of what you are referring to, or go into a little more detail?

sam-herman · 2025-10-23T02:50:12Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

+            final long fileOffset = baseOffset + (long) ordinal * recordSize;
+
+            // Submit task to build this record
+            Future<NodeRecordTask.Result> future = executor.submit(() -> {


This will create a new future object for each ordinal? Should we batch those to prevent GC and memory allocation overhead?

It is batched. The buffer is checked and, if full, written and cleared on lines 190 - 193.
if (futures.size() >= bufferSize) { writeRecordsAsync(futures); futures.clear(); }

Please correct me if I'm misunderstanding, but it seems to me that for every ordinal we will (eventually) create a task object? If this is the case, can we avoid that?

I'm also wondering about this. If we are wrong, it may be useful to add a comment in the code explaining this point. Others might trip here too in the future.

You are correct in your summary @sam-herman, but it seemed to me like the natural splitting point. I could assign a range of ordinals to each task (which, I suppose it what you were suggesting by batching, which I misunderstood) instead of only a single ordinal. What do you think of that? cc:@marianotepper

I don't have a specific suggestion here, besides the ones suggested above. Not sure if that one would be over tuning the implementation either. We could run a micro benchmark to see if memory usage increases significantly or not.

Hi @MarkWolters,

I've been thinking more about this, and aside from the memory allocation aspect, I have another concern related to disk alignment and throughput (as mentioned in my earlier comment regarding ParallelGraphWriter line 215).

Given a file of length N bytes and a disk capable of k concurrent writes, the ideal scenario for maximizing throughput would be to minimize the disk queue depth Q. In theory, if we have k threads each handling a contiguous buffer of size B = N/k, we could keep Q close to zero.

However, in the current implementation, it looks like the write of each node and it's adjacency list, are being distributed randomly among the threads. This seems to reduce the number of contiguous writes, which could increase the queue depth and lower overall throughput.

Is this understanding correct? Also, for the graphs you mentioned in the description, it might be helpful to measure the actual disk queue depth to assess how effectively we're maximizing throughput.

The eventual solution I settled on is similar to this suggestion, although it uses dataset size in terms of number of vectors rather than absolute file size. Monitoring the disk queue in Java would be tricky as we're talking about writing OS-specific code, and I don't think this enhancement should introduce that level of risk.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriterTypes.java

marianotepper · 2025-10-29T12:49:46Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java

+    }
+
+    @Override
+    public Result call() throws Exception {


I think that it would be useful for SequentialWriter and writeL0RecordsSequential to share this implementation somehow, so that we do not have code duplication.

I 100% agree. My only question is whether that belongs in this PR or not. Originally the logic was duplicated between OnDiskSequentialGraphIndexWriter and OnDiskGraphIndexWriter, so there's not actually any new duplication being added here. If the consensus is that I should tackle that here I will do so. What do you think @marianotepper and @sam-herman?

True, we can do that in a subsequent PR.

marianotepper · 2025-10-29T12:56:05Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

+            final long fileOffset = baseOffset + (long) ordinal * recordSize;
+
+            // Submit task to build this record
+            Future<NodeRecordTask.Result> future = executor.submit(() -> {


I'm also wondering about this. If we are wrong, it may be useful to add a comment in the code explaining this point. Others might trip here too in the future.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java

…r each ordinal

marianotepper

LGTM

MarkWolters added 10 commits October 7, 2025 17:06

initial commit

a67bfcf

adding async file channel usage

16aaaf6

expose threadpool configuration

9b4dea1

test application added

7433443

add FADC to testing

f2df4da

some final details

b45766c

added read test to test code and fixed data corruption bug

09bc7d9

update regression tests to use parallel writes

9bdd64c

speedup for tests

2cfa7c9

Merge branch 'main' into parallelize_graph_writes

4646ba5

MarkWolters requested review from jshook, marianotepper and tlwillke as code owners October 15, 2025 14:33

MarkWolters requested review from michaeljmarshall and sam-herman October 15, 2025 16:20

removed option of parallel index write to mem + sequential to disk ou…

ec68bda

…t of order

michaeljmarshall reviewed Oct 17, 2025

View reviewed changes

backpressure controls and clearer buffer ownership

a0c89a7

MarkWolters requested a review from Copilot October 21, 2025 11:55

Copilot AI reviewed Oct 21, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java Outdated Show resolved Hide resolved

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java Outdated Show resolved Hide resolved

sam-herman reviewed Oct 22, 2025

View reviewed changes

api and comment cleanup, added parallel Write jmh benchmark

67edfae

sam-herman reviewed Oct 23, 2025

View reviewed changes

added create method for ByteBufferIndexWriter and moved comments

09eb70f

marianotepper reviewed Oct 29, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/ParallelGraphWriter.java Outdated Show resolved Hide resolved

MarkWolters added 2 commits October 31, 2025 14:26

Update Tasks to write ranges of ordinals rather than separate task fo…

7daf69e

…r each ordinal

batching based on dataset size

53a6951

Merge branch 'main' into parallelize_graph_writes

f6240b6

sam-herman mentioned this pull request Nov 12, 2025

[RFC] Add Random Access Write Support to IndexOutput apache/lucene#15420

Open

marianotepper self-requested a review November 13, 2025 13:53

marianotepper approved these changes Nov 13, 2025

View reviewed changes

		buffer.clear();
		var writer = new ByteBufferIndexWriter(buffer);

Parallelize graph writes #542

Are you sure you want to change the base?

Parallelize graph writes #542

Conversation

MarkWolters commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 15, 2025 • edited by MarkWolters Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

sam-herman commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sam-herman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sam-herman Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sam-herman Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarkWolters commented Oct 15, 2025 •

edited

Loading

github-actions bot commented Oct 15, 2025 •

edited by MarkWolters

Loading

sam-herman commented Oct 21, 2025 •

edited

Loading

sam-herman left a comment •

edited

Loading

sam-herman Oct 23, 2025 •

edited

Loading

sam-herman Oct 29, 2025 •

edited

Loading