[RFC] Add Random Access Write Support to IndexOutput

# [RFC] Add Random Access Write Support to IndexOutput

## Summary

This RFC proposes adding random access write capabilities to Lucene's `IndexOutput` interface to enable use cases that require non-sequential writes while maintaining compatibility with Lucene's existing architecture.

## Motivation

Currently, Lucene's `IndexOutput` interface is designed exclusively for sequential, append-only writes. This design provides important benefits:

- **Immutability**: Index files are written once and never modified
- **Efficient checksumming**: Checksums can be computed incrementally during writes without re-reading data
- **Atomic commits**: Files are finalized atomically, ensuring index consistency
- **Simplicity**: Sequential writes are easier to reason about and optimize

However, there are emerging use cases where random access write capabilities would provide significant benefits:

### Use Case 1: Modern Storage Optimization

Modern storage devices (SSDs, NVMe) can handle random writes efficiently and support concurrent writes to different file regions. Random access writes could enable parallel construction of index structures, significantly reducing total write time.

**Reference**: [JVector PR #542](https://github.com/datastax/jvector/pull/542) demonstrates performance improvements from parallel writes during vector index construction.

### Use Case 2: Mutable Index Structures

Some applications require the ability to update index structures in-place without full rewrites:
- Dynamic graph updates for vector search indices
- In-place modifications during index optimization
- Incremental updates to reduce I/O overhead

**References**:
- [OpenSearch JVector Issue #169](https://github.com/opensearch-project/opensearch-jvector/issues/169) discusses mutable index requirements
- [OpenSearch k-NN Issue #2715](https://github.com/opensearch-project/k-NN/issues/2715) discusses reducing I/O during frequent merges

### Use Case 3: Graph Index Construction Optimization

When building graph-based indices (HNSW, Vamana) with inlined vectors:
- Currently requires maintaining separate flat vector files during construction for scoring
- Random access writes would allow reusing the partially-written graph index file
- Eliminates redundant storage and I/O operations

**Reference**: _[To be added]_

## Proposed Solution

### Option 1: New Interface Extending IndexOutput

Introduce a new `RandomAccessIndexOutput` interface that extends `IndexOutput`:

```java
/**
 * An IndexOutput that supports random access writes in addition to sequential writes.
 *
 * <p>This interface allows seeking to arbitrary positions within the file and writing
 * data at those positions. Implementations must handle checksum computation appropriately
 * when random access writes are performed.
 *
 * <p>Instances of this class are <b>not</b> thread-safe.
 *
 * @lucene.experimental
 */
public abstract class RandomAccessIndexOutput extends IndexOutput {

  protected RandomAccessIndexOutput(String resourceDescription, String name) {
    super(resourceDescription, name);
  }

  /**
   * Sets the file pointer to the specified position for the next write.
   *
   * @param pos the position in bytes from the beginning of the file
   * @throws IOException if an I/O error occurs
   * @throws IllegalArgumentException if pos is negative or beyond the current file length
   */
  public abstract void seek(long pos) throws IOException;

  /**
   * Forces any buffered output to be written to the underlying storage.
   *
   * @throws IOException if an I/O error occurs
   */
  public abstract void flush() throws IOException;

  /**
   * Returns the checksum of bytes in the specified range.
   * This allows computing checksums for specific regions when random access writes are used.
   *
   * @param startOffset the starting position (inclusive)
   * @param endOffset the ending position (exclusive)
   * @return the checksum value for the specified range
   * @throws IOException if an I/O error occurs
   * @throws IllegalArgumentException if the range is invalid
   */
  public abstract long getChecksum(long startOffset, long endOffset) throws IOException;
}
```

### Option 2: Capability-Based Approach

Alternatively, add optional methods to `IndexOutput` with default implementations that throw `UnsupportedOperationException`, similar to how `IndexInput` handles `RandomAccessInput`:

```java
// In IndexOutput class
public void seek(long pos) throws IOException {
  throw new UnsupportedOperationException(
    "This IndexOutput implementation does not support random access writes");
}
```

## Design Considerations

### Checksum Handling

Random access writes complicate incremental checksum computation. Proposed approaches:

1. **Range-based checksums**: The `getChecksum(startOffset, endOffset)` method allows computing checksums for specific regions
2. **Invalidation on seek**: Seeking invalidates the current checksum; callers must explicitly recompute
3. **Implementation-specific**: Leave checksum strategy to implementations (e.g., in-memory implementations could maintain full checksums)

### Safety Guarantees

- Random access writes should only be used during index construction, not for modifying committed segments
- Implementations should validate that seeks don't extend beyond current file length
- Thread-safety remains the caller's responsibility (consistent with existing `IndexOutput` contract)

## Alternatives Considered

1. **External libraries**: Projects could implement random access writes outside Lucene, but this creates fragmentation and prevents sharing optimizations
2. **Custom Directory implementations**: Possible but requires duplicating significant infrastructure

## Open Questions

1. Should random access writes be restricted to specific Directory implementations?
2. What are the implications for index verification and corruption detection?
3. Should there be explicit markers in the index format to indicate random-access-written files?
4. How should this interact with encryption or compression layers?

## Backward Compatibility

This proposal is fully backward compatible:
- No changes to existing `IndexOutput` contract
- New functionality is opt-in via new class or optional methods
- Existing `IndexOutput` implementations remain unchanged
- Codecs that don't need random access continue using standard `IndexOutput`
- Existing codecs and applications continue working unchanged



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add Random Access Write Support to IndexOutput #15420

[RFC] Add Random Access Write Support to IndexOutput

Summary

Motivation

Use Case 1: Modern Storage Optimization

Use Case 2: Mutable Index Structures

Use Case 3: Graph Index Construction Optimization

Proposed Solution

Option 1: New Interface Extending IndexOutput

Option 2: Capability-Based Approach

Design Considerations

Checksum Handling

Safety Guarantees

Alternatives Considered

Open Questions

Backward Compatibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Add Random Access Write Support to IndexOutput #15420

Description

[RFC] Add Random Access Write Support to IndexOutput

Summary

Motivation

Use Case 1: Modern Storage Optimization

Use Case 2: Mutable Index Structures

Use Case 3: Graph Index Construction Optimization

Proposed Solution

Option 1: New Interface Extending IndexOutput

Option 2: Capability-Based Approach

Design Considerations

Checksum Handling

Safety Guarantees

Alternatives Considered

Open Questions

Backward Compatibility

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions