Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dask DataFrame options to loader and materializer configs #2809

Closed
wants to merge 11 commits into from

Conversation

kinghuang
Copy link
Contributor

@kinghuang kinghuang commented Aug 11, 2020

This PR modifies the loader and materializer for the Dask DataFrame DagsterType to perform some common utility operations on DataFrames such as sampling or repartitioning dataframes. This allows pipelines to easily adjust loaded and stored dataframes, without introducing additional solids or custom code.

To accomplish this, the existing configs for specifying where to read dataframes from or write dataframes to are placed under read and to keys.

Before:

solids:
  example:
    inputs:
      dataframe:
        parquet:
          path: s3://some_bucket/some_path

After:

solids:
  example:
    inputs:
      dataframe:
        read:
          parquet:
            path: s3://some_bucket/some_path

Then, additional config keys can be added to perform common operations on dataframes being read or written. For example, the following config samples 0.1% of the rows in the dataset and repartitions the dataframe to one partition on load.

Example:

solids:
  example:
    inputs:
      dataframe:
        read:
          parquet:
            path: s3://some_bucket/some_path
          sample: 0.001
          repartition:
            npartitions: 1

In this implementation, the possible options are:

Read:

  • sample
  • repartition
  • lower_cols (lowercase column names)
  • reset_index

Write:

  • sample
  • repartition
  • reset_index

Additionally, the materializer now allows multiple output formats.

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is this being a breaking change for current users. We avoid breaking changes as best as we can outside of larger version bumps, the next being 0.10.0 which wont happen for a few months.

What are your thoughts on introducing these new read and to keys as sibling Selector keys in the current schema to allow for back compat?

@kinghuang
Copy link
Contributor Author

What are your thoughts on introducing these new read and to keys as sibling Selector keys in the current schema to allow for back compat?

I think that's reasonable. This PR could mark the current flat keys as deprecated and introduce the read and to keys as replacements.

The DataFrame loader configs currently specify where to load data from.
Place these under a key named “from” to make room for additional loader
options.
Add options for performing some common operations on loaded dataframes,
such as sampling rows, repartitioning the dataframe, and lowercasing
column names.
The DataFrame materializer configs currently specify where to write
data to. Place these under a key named “to” to make room for additional
materializer options.
Modify the DataFrame materializer to allow materialization in multiple
formats.
Add options for performing some common operations on materialized
dataframes, such as sampling rows, repartitioning the dataframe, and
resetting the index.
This was intended to be read, not from, matching the underlying
function names like read_parquet.
Allow any combination of materializer types to be specified, with no
specific type required.
@kinghuang
Copy link
Contributor Author

I've split out just the portion of this PR that moves the options under read and to into #2821. And, I'll follow up with a separate PR for the other stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants