Skip to content

Reading data from a Parquet file without the fuss #49

@sfkeller

Description

@sfkeller

How about reading data from a Parquet file without the fuss (inspired by https://duckdb.org/docs/guides/file_formats/parquet_import )?

Below a very sketchy, untested script how to implement such a thing. I actually could only guess what parquet.schema returns, since there's no docs yet about this (see #48).

The resulting table product_example2 from reading product_example.parquet (as defined in https://github.com/CrunchyData/pg_parquet) should be created without having created table product_example2 before hand:

CREATE OR REPLACE FUNCTION import_foreign_schema_from_parquet(uri TEXT, table_name TEXT DEFAULT 'parquet_table') RETURNS VOID AS $$
DECLARE
    parquet_info RECORD;
    columns TEXT := '';
BEGIN
    -- Loop through the metadata returned by parquet.metadata function
    FOR parquet_info IN EXECUTE 'SELECT column_name, type_name FROM parquet.metadata(' || quote_literal(uri) || ')' LOOP
        -- Append column name and type to the columns string
        columns := columns || quote_ident(parquet_info.column_name) || ' ' || 
                   CASE parquet_info.type_name
                       WHEN 'INT32' THEN 'INTEGER'
                       WHEN 'INT64' THEN 'BIGINT'
                       WHEN 'FLOAT' THEN 'REAL'
                       WHEN 'DOUBLE' THEN 'DOUBLE PRECISION'
                       WHEN 'BOOLEAN' THEN 'BOOLEAN'
                       WHEN 'UTF8' THEN 'TEXT'
                       WHEN 'LIST' THEN 'ANYELEMENT[]' 
                       WHEN 'TIMESTAMP_MICROS' THEN 'TIMESTAMP'
                       WHEN 'TIMESTAMP_MILLIS' THEN 'TIMESTAMPTZ'
                       ELSE 'TEXT'  -- Default to TEXT for unknown types
                   END || ', ';
    END LOOP;

    -- Remove trailing comma and space
    columns := RTRIM(columns, ', ');

    -- Execute the CREATE TABLE statement with the generated columns
    EXECUTE 'CREATE TABLE ' || quote_ident(table_name) || ' (' || columns || ');';
END;
$$ LANGUAGE plpgsql;

-- Import foreign "schema" (i.e. table definition) from Parquet - creates table "product_example2":
SELECT import_foreign_schema_from_parquet('/tmp/product_example.parquet', 'product_example2');

-- Copy the Parquet file to the newly created table product_example2: 
COPY product_example2 FROM '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');

-- Show table product_example2 - should have the same structure as the original table product_example:
SELECT * FROM product_example2
UNION ALL
SELECT * FROM product_example;

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions