-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
What?
Hi, we're using at the time @dsnp/parquetjs to write parquet files in node. But is a fork of an old package. And doesn't look super maintained.
So I came across this repo that looks super active but is not clear to me if we can do what we're doing now with parquet-wasm. So maybe you can help me understand.
What do we want to do?
We want to iterate a huge PostgreSQL table with a cursor so we have batches of rows that we want to iterate and store in a parquet file.
So I was wondering if that's possible with parquet-wasm. Handle streaming of data and at the end save the file in disk
This is how we do with @dsnp/parquetjs
const BATCH_SIZE = 4096
const SQL_QUERY = 'SELECT * FROM users'
async function writeParquet(): Promise<string> {
return new Promise<string>((resolve) => {
let url: string
// This doesn't matter.
// Source batchquery do a cursor pg iteration
// and we receive N rows for each batch in `onBatch` method
OUR_POSTGREST_DB.batchQuery(SQL_QUERY, {
batchSize: BATCH_SIZE,
onBatch: async (batch) => {
if (!writer) {
const schema = this.buildParquetSchema(batch.fields)
writer = await ParquetWriter.openFile(schema, '/path/to/file.parquet', {
rowGroupSize: size > ROW_GROUP_SIZE ? size : ROW_GROUP_SIZE,
})
}
for (const row of batch.rows) {
// This does not write in parquet I think but accumulate as many rows
// as you define in `rowGroupSize`
await writer.appendRow(row)
}
if (batch.lastBatch) {
await writer.close()
resolve(url)
}
},
})
})
}Thanks for the help!
Metadata
Metadata
Assignees
Labels
No labels