Skip to content

How to read a single row group batch, given only the row group bytes and the schema bytes #804

@keller-mark

Description

@keller-mark

Use case

I want to use readParquet to read data from a single row group, without loading the full parquet file's bytes.

readParquetStream does not sufficiently address this use case because 1) I want to access the row groups randomly (e.g., only need to access the Kth row group, not all of them in order) and 2) I want to retain the low-level control over accessing the parquet bytes (e.g., they are not always coming from a URL, the parquet bytes may be coming from a local file or they may be coming via some other protocol).

I tried manually concatenating the row group bytes with the schema bytes, and then passing to readParquet or arrow-js-ffi's parseRecordBatch, but neither worked. I assume the footer bytes contain references to byte positions/offsets that no longer make sense when the bytes are manually concatenated like this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions