Skip to content

Conversation

@yohplala
Copy link

Replaces PR #953
I am very sorry, I did a mess in commits history, so I restarted from fresh.

Text from PR #953 applies:

PR aiming to solve #949

The solution implemented is a "at read time" solution. Step by step:

  • a new ParquetFile attribute has been created to store categorical values : global_cats,
  • when reading row group (with ParquetFile.read_row_group_file), it passes this global attribute down to read_col
  • read_col uses this global attribute and populates it with new categorical values that are encountered when reading successively new row group
  • whenever new values are found (or existing values with inconsistent codes compared to previous row groups), it creates a remapping table specific for this row group. It uses it to update to correct codes the column values.

Additional modifications are:

  • when slicing a ParquetFile (row group selection) with getitem, global_cats is reset. It could be used in a future modification to retrieve categorical values (why not) but in case of slicing, fewer categorical values would remain relevant.
    Anyhow, at next read_row_group_file operation, it would be repopulated with the right values
  • global_cats has been added to getstate to ensure corret pickling
  • 2 test cases have been provided, testing 1 or 2 categorical columns, appending up to 3 times, using categorical strings and integers

Finally, datapage v2 CI workflow now runs also.

@martindurant
Copy link
Member

datapage v2 CI workflow now runs also

Well done!

@yohplala
Copy link
Author

@martindurant
A quick word, I am still working on this PR. I could check with a new test case I am working on it is not working when we are row filtering + using nulls. I am investigating.

@martindurant
Copy link
Member

@yohplala , you might be interested in looking at the progress in kylebarron/arro3#313 , to see to what extent arro3 can meet your requirements, and maybe enumerate what fastparquet can do that that package still cannot.

@yohplala
Copy link
Author

@yohplala , you might be interested in looking at the progress in kylebarron/arro3#313 , to see to what extent arro3 can meet your requirements, and maybe enumerate what fastparquet can do that that package still cannot.

Thanks Martin, I will be happy to review.
I will first focus on the on-going fix we can bring in this PR and PR #956 and I do have a feature I would like to work on and then propose (implementing pf = ParquetFile.create_empty(fn, file_scheme, partition_on) from which we could then pf.write_row_groups() and/or mutate with the other methods that are already available)
I think this proposal is not so far, and I would like to keep some time for this. Then yes, I will be happy to check this new project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants