Add Dask DataFrame options to loader and materializer configs #2809

kinghuang · 2020-08-11T22:27:11Z

This PR modifies the loader and materializer for the Dask DataFrame DagsterType to perform some common utility operations on DataFrames such as sampling or repartitioning dataframes. This allows pipelines to easily adjust loaded and stored dataframes, without introducing additional solids or custom code.

To accomplish this, the existing configs for specifying where to read dataframes from or write dataframes to are placed under read and to keys.

Before:

solids:
  example:
    inputs:
      dataframe:
        parquet:
          path: s3://some_bucket/some_path

After:

solids:
  example:
    inputs:
      dataframe:
        read:
          parquet:
            path: s3://some_bucket/some_path

Then, additional config keys can be added to perform common operations on dataframes being read or written. For example, the following config samples 0.1% of the rows in the dataset and repartitions the dataframe to one partition on load.

Example:

solids:
  example:
    inputs:
      dataframe:
        read:
          parquet:
            path: s3://some_bucket/some_path
          sample: 0.001
          repartition:
            npartitions: 1

In this implementation, the possible options are:

Read:

sample
repartition
lower_cols (lowercase column names)
reset_index

Write:

sample
repartition
reset_index

Additionally, the materializer now allows multiple output formats.

alangenfeld

My main concern here is this being a breaking change for current users. We avoid breaking changes as best as we can outside of larger version bumps, the next being 0.10.0 which wont happen for a few months.

What are your thoughts on introducing these new read and to keys as sibling Selector keys in the current schema to allow for back compat?

kinghuang · 2020-08-14T17:17:27Z

What are your thoughts on introducing these new read and to keys as sibling Selector keys in the current schema to allow for back compat?

I think that's reasonable. This PR could mark the current flat keys as deprecated and introduce the read and to keys as replacements.

The DataFrame loader configs currently specify where to load data from. Place these under a key named “from” to make room for additional loader options.

Add options for performing some common operations on loaded dataframes, such as sampling rows, repartitioning the dataframe, and lowercasing column names.

The DataFrame materializer configs currently specify where to write data to. Place these under a key named “to” to make room for additional materializer options.

Modify the DataFrame materializer to allow materialization in multiple formats.

Add options for performing some common operations on materialized dataframes, such as sampling rows, repartitioning the dataframe, and resetting the index.

This was intended to be read, not from, matching the underlying function names like read_parquet.

Allow any combination of materializer types to be specified, with no specific type required.

kinghuang · 2020-08-16T19:09:09Z

I've split out just the portion of this PR that moves the options under read and to into #2821. And, I'll follow up with a separate PR for the other stuff.

kinghuang force-pushed the dask-df-options branch from 92db81d to 6a7e7f5 Compare August 11, 2020 22:29

kinghuang mentioned this pull request Aug 11, 2020

Create Dask resource #2811

Merged

alangenfeld reviewed Aug 14, 2020

View reviewed changes

kinghuang added 11 commits August 16, 2020 12:53

Place existing DataFrame loader configs under a "from" key

a8708a5

The DataFrame loader configs currently specify where to load data from. Place these under a key named “from” to make room for additional loader options.

Rename file_type/file_options to read_type/read_options

c154cb7

Add utility options in DataFrame loader

27108e3

Add options for performing some common operations on loaded dataframes, such as sampling rows, repartitioning the dataframe, and lowercasing column names.

Place existing DataFrame materializer configs under a "to" key

790f339

The DataFrame materializer configs currently specify where to write data to. Place these under a key named “to” to make room for additional materializer options.

Rename file_type/file_options to to_type/to_options

8c55580

Allow materialization to multiple formats

a2ee08d

Modify the DataFrame materializer to allow materialization in multiple formats.

Add utility options in DataFrame materializer

6fbb854

Add options for performing some common operations on materialized dataframes, such as sampling rows, repartitioning the dataframe, and resetting the index.

Fix repartition field definition

4d1f7f8

Change from key to read

5bf6f01

This was intended to be read, not from, matching the underlying function names like read_parquet.

Make DataFrame materializer types optional

b1ed578

Allow any combination of materializer types to be specified, with no specific type required.

Use Dagster Bool and Float for field types

91d4619

kinghuang force-pushed the dask-df-options branch from 6a7e7f5 to 91d4619 Compare August 16, 2020 19:01

kinghuang closed this Aug 16, 2020

kinghuang deleted the dask-df-options branch August 16, 2020 19:02

kinghuang mentioned this pull request Aug 16, 2020

Move Dask DataFrame read/to options under read/to keys #2821

Merged

kinghuang mentioned this pull request Sep 8, 2020

Add utility options to Dask DataFrame type loader/materializer #2888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dask DataFrame options to loader and materializer configs #2809

Add Dask DataFrame options to loader and materializer configs #2809

kinghuang commented Aug 11, 2020 •

edited

Loading

alangenfeld left a comment

kinghuang commented Aug 14, 2020

kinghuang commented Aug 16, 2020

Add Dask DataFrame options to loader and materializer configs #2809

Add Dask DataFrame options to loader and materializer configs #2809

Conversation

kinghuang commented Aug 11, 2020 • edited Loading

alangenfeld left a comment

Choose a reason for hiding this comment

kinghuang commented Aug 14, 2020

kinghuang commented Aug 16, 2020

kinghuang commented Aug 11, 2020 •

edited

Loading