Skip to content

[RFC] Add Random Access Write Support to IndexOutput #15420

@sam-herman

Description

@sam-herman

[RFC] Add Random Access Write Support to IndexOutput

Summary

This RFC proposes adding random access write capabilities to Lucene's IndexOutput interface to enable use cases that require non-sequential writes while maintaining compatibility with Lucene's existing architecture.

Motivation

Currently, Lucene's IndexOutput interface is designed exclusively for sequential, append-only writes. This design provides important benefits:

  • Immutability: Index files are written once and never modified
  • Efficient checksumming: Checksums can be computed incrementally during writes without re-reading data
  • Atomic commits: Files are finalized atomically, ensuring index consistency
  • Simplicity: Sequential writes are easier to reason about and optimize

However, there are emerging use cases where random access write capabilities would provide significant benefits:

Use Case 1: Modern Storage Optimization

Modern storage devices (SSDs, NVMe) can handle random writes efficiently and support concurrent writes to different file regions. Random access writes could enable parallel construction of index structures, significantly reducing total write time.

Reference: JVector PR #542 demonstrates performance improvements from parallel writes during vector index construction.

Use Case 2: Mutable Index Structures

Some applications require the ability to update index structures in-place without full rewrites:

  • Dynamic graph updates for vector search indices
  • In-place modifications during index optimization
  • Incremental updates to reduce I/O overhead

References:

Use Case 3: Graph Index Construction Optimization

When building graph-based indices (HNSW, Vamana) with inlined vectors:

  • Currently requires maintaining separate flat vector files during construction for scoring
  • Random access writes would allow reusing the partially-written graph index file
  • Eliminates redundant storage and I/O operations

Reference: [To be added]

Proposed Solution

Option 1: New Interface Extending IndexOutput

Introduce a new RandomAccessIndexOutput interface that extends IndexOutput:

/**
 * An IndexOutput that supports random access writes in addition to sequential writes.
 *
 * <p>This interface allows seeking to arbitrary positions within the file and writing
 * data at those positions. Implementations must handle checksum computation appropriately
 * when random access writes are performed.
 *
 * <p>Instances of this class are <b>not</b> thread-safe.
 *
 * @lucene.experimental
 */
public abstract class RandomAccessIndexOutput extends IndexOutput {

  protected RandomAccessIndexOutput(String resourceDescription, String name) {
    super(resourceDescription, name);
  }

  /**
   * Sets the file pointer to the specified position for the next write.
   *
   * @param pos the position in bytes from the beginning of the file
   * @throws IOException if an I/O error occurs
   * @throws IllegalArgumentException if pos is negative or beyond the current file length
   */
  public abstract void seek(long pos) throws IOException;

  /**
   * Forces any buffered output to be written to the underlying storage.
   *
   * @throws IOException if an I/O error occurs
   */
  public abstract void flush() throws IOException;

  /**
   * Returns the checksum of bytes in the specified range.
   * This allows computing checksums for specific regions when random access writes are used.
   *
   * @param startOffset the starting position (inclusive)
   * @param endOffset the ending position (exclusive)
   * @return the checksum value for the specified range
   * @throws IOException if an I/O error occurs
   * @throws IllegalArgumentException if the range is invalid
   */
  public abstract long getChecksum(long startOffset, long endOffset) throws IOException;
}

Option 2: Capability-Based Approach

Alternatively, add optional methods to IndexOutput with default implementations that throw UnsupportedOperationException, similar to how IndexInput handles RandomAccessInput:

// In IndexOutput class
public void seek(long pos) throws IOException {
  throw new UnsupportedOperationException(
    "This IndexOutput implementation does not support random access writes");
}

Design Considerations

Checksum Handling

Random access writes complicate incremental checksum computation. Proposed approaches:

  1. Range-based checksums: The getChecksum(startOffset, endOffset) method allows computing checksums for specific regions
  2. Invalidation on seek: Seeking invalidates the current checksum; callers must explicitly recompute
  3. Implementation-specific: Leave checksum strategy to implementations (e.g., in-memory implementations could maintain full checksums)

Safety Guarantees

  • Random access writes should only be used during index construction, not for modifying committed segments
  • Implementations should validate that seeks don't extend beyond current file length
  • Thread-safety remains the caller's responsibility (consistent with existing IndexOutput contract)

Alternatives Considered

  1. External libraries: Projects could implement random access writes outside Lucene, but this creates fragmentation and prevents sharing optimizations
  2. Custom Directory implementations: Possible but requires duplicating significant infrastructure

Open Questions

  1. Should random access writes be restricted to specific Directory implementations?
  2. What are the implications for index verification and corruption detection?
  3. Should there be explicit markers in the index format to indicate random-access-written files?
  4. How should this interact with encryption or compression layers?

Backward Compatibility

This proposal is fully backward compatible:

  • No changes to existing IndexOutput contract
  • New functionality is opt-in via new class or optional methods
  • Existing IndexOutput implementations remain unchanged
  • Codecs that don't need random access continue using standard IndexOutput
  • Existing codecs and applications continue working unchanged

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions