Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evolution #818

Merged
merged 15 commits into from
Nov 14, 2022
Merged

Evolution #818

merged 15 commits into from
Nov 14, 2022

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Nov 2, 2022

Starts to fix #817

Handles

  • passing in an explicit set of dtypes in to_pandas
  • picking the most recent schema from the files of a dataset (where there is no global schema) by looking for new columns. Does not handle types that are upcasted.

(NB: if there is a [_common]_metadata, will not probe other schemas, assuming
that file should reflect the most recent situation.

So far, we only handle additional fields, and specifying explicit dtypes, not
auto casting.
@martindurant
Copy link
Member Author

@yohplala, I thought you would find this idea and implementation interesting (I am not asking for you to do anything, just for your curiosity)

@martindurant
Copy link
Member Author

@rjzamora , this is a really useful new feature! I could use just a little help integrating with dask - the two failing tests could be changed to be simpler and pass with and without this change, but I wouldn't mind your thoughts.

Secondly, doing this also revealed that row-filtering, which fastparquet supports at the ParquetFile API level (in .to_pandas() ), cannot be called by dask, which calls fastparquet.core functions directly. I'm not sure what to do about that, short of duplicating the relevant code.

@martindurant martindurant merged commit 647bccf into dask:main Nov 14, 2022
@martindurant martindurant deleted the evolution branch November 14, 2022 22:01
@martindurant martindurant mentioned this pull request Nov 17, 2022
Closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema evolution
1 participant