Skip to content

Write data streaming to a parquet file #542

@andresgutgon

Description

@andresgutgon

What?

Hi, we're using at the time @dsnp/parquetjs to write parquet files in node. But is a fork of an old package. And doesn't look super maintained.

So I came across this repo that looks super active but is not clear to me if we can do what we're doing now with parquet-wasm. So maybe you can help me understand.

What do we want to do?

We want to iterate a huge PostgreSQL table with a cursor so we have batches of rows that we want to iterate and store in a parquet file.

So I was wondering if that's possible with parquet-wasm. Handle streaming of data and at the end save the file in disk

This is how we do with @dsnp/parquetjs

const BATCH_SIZE = 4096
const SQL_QUERY = 'SELECT * FROM users'
async function writeParquet(): Promise<string> {
  return new Promise<string>((resolve) => {
    let url: string
    // This doesn't matter. 
    // Source batchquery do a cursor pg iteration 
    // and we receive N rows for each batch in `onBatch` method
    OUR_POSTGREST_DB.batchQuery(SQL_QUERY, {
      batchSize: BATCH_SIZE,
      onBatch: async (batch) => {
        if (!writer) {
          const schema = this.buildParquetSchema(batch.fields)
          writer = await ParquetWriter.openFile(schema, '/path/to/file.parquet', {
              rowGroupSize: size > ROW_GROUP_SIZE ? size : ROW_GROUP_SIZE,
          })
        }

        for (const row of batch.rows) {
          // This does not write in parquet I think but accumulate as many rows
          // as you define in `rowGroupSize`
          await writer.appendRow(row)
        }

        if (batch.lastBatch) {
          await writer.close()
          resolve(url)
        }
      },
    })
  })
}

Thanks for the help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions