-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dask DataFrame options to loader and materializer configs #2809
Conversation
92db81d
to
6a7e7f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is this being a breaking change for current users. We avoid breaking changes as best as we can outside of larger version bumps, the next being 0.10.0
which wont happen for a few months.
What are your thoughts on introducing these new read
and to
keys as sibling Selector keys in the current schema to allow for back compat?
I think that's reasonable. This PR could mark the current flat keys as deprecated and introduce the |
The DataFrame loader configs currently specify where to load data from. Place these under a key named “from” to make room for additional loader options.
Add options for performing some common operations on loaded dataframes, such as sampling rows, repartitioning the dataframe, and lowercasing column names.
The DataFrame materializer configs currently specify where to write data to. Place these under a key named “to” to make room for additional materializer options.
Modify the DataFrame materializer to allow materialization in multiple formats.
Add options for performing some common operations on materialized dataframes, such as sampling rows, repartitioning the dataframe, and resetting the index.
This was intended to be read, not from, matching the underlying function names like read_parquet.
Allow any combination of materializer types to be specified, with no specific type required.
6a7e7f5
to
91d4619
Compare
I've split out just the portion of this PR that moves the options under |
This PR modifies the loader and materializer for the Dask DataFrame DagsterType to perform some common utility operations on DataFrames such as sampling or repartitioning dataframes. This allows pipelines to easily adjust loaded and stored dataframes, without introducing additional solids or custom code.
To accomplish this, the existing configs for specifying where to read dataframes from or write dataframes to are placed under
read
andto
keys.Before:
After:
Then, additional config keys can be added to perform common operations on dataframes being read or written. For example, the following config samples 0.1% of the rows in the dataset and repartitions the dataframe to one partition on load.
Example:
In this implementation, the possible options are:
Read:
Write:
Additionally, the materializer now allows multiple output formats.