Create Dask resource #2811

kinghuang · 2020-08-11T23:11:05Z

This PR creates a Dask resource, which manages a Dask client and optional cluster. The resource is to be used by solids and the Dask DataFrame type's materializer to compute Dask graphs.

The resource can be configured to connect to a pre-existing cluster by its scheduler address, or create a cluster on demand.

Here are example resource configs for typical use cases.

Connect to an existing Dask cluster via its scheduler:

resources:
  dask:
    config:
      client:
        name: my-dagster-pipeline
        address: tcp://dask-scheduler-here:8786

Create a local cluster with 4 workers, and 1 thread per worker:

The resource will create a Cluster object, and pass it as the address option to the client.

resources:
  dask:
    config:
      client:
        name: my-dagster-pipeline
      cluster:
        local:
          n_workers: 4
          threads_per_worker: 1

The DataFrame type does not require a Dask resource. But, if a resource is provided under the dask key, the type's materializer (if configured) will run with the resource's client as the current client. Otherwise, the global client and scheduler will apply.

Related to the above, the compute field is removed from all the materialization options. Providing a way to not compute a Dask DataFrame materialization does not make sense, as the future is never returned in the pipeline and cannot be computed at a later time. Materializations must be computed if specified.

Solids using the Dask resource for computation should consider making the resource client the current client by using the as_current() resource manager.

@solid(…)
def some_solid(context):
  with context.resources.dask.client.as_current():
    …

This PR carries the commits from #2821, and is meant to be merged after it.

alangenfeld · 2020-08-13T21:58:51Z

can you add some basic tests to make sure this doesn't get broken over time?

alangenfeld · 2020-08-14T17:04:27Z

other than that I think this is fine, we just need to settle out the schema changes on the PR this depends on

kinghuang · 2020-09-01T22:54:27Z

Added a couple of tests for local clusters created via the resource.

I've also changed the implementation to not require a Dask resource on the DataFrame type so that it is possible to simply run with Dask's global config (as before). And, I'm thinking of making set_as_default an option instead of forcing it to be False.

kinghuang · 2020-09-07T22:14:37Z

@alangenfeld This is ready for review. Thanks!

kinghuang · 2020-09-08T17:54:31Z

Found an error with Client creation. Putting this back on draft for now.

kinghuang · 2020-09-09T16:17:00Z

Updated. The code formatting issues identified by Black in dagster_dask/data_frame.py have been fixed in #2888.

alangenfeld · 2020-09-09T18:10:20Z

python_modules/libraries/dagster-dask/dagster_dask/data_frame.py

@@ -288,8 +287,7 @@
        "options": {
            "path": (Any, True, "Path to a target filename."),
            "key": (String, True, "Datapath within the files."),
-            "compute": (Bool, False, "Whether or not to execute immediately."),


did you mean to remove this? just concerned about breaking changes

Yes, all the compute options have been removed, because they don't make sense in the context of materialization configs. Setting compute=False will cause Dask to return a future, which in the materialization flow will never be computed. This would result in AssetMaterialization objects being yielded for assets that don't actually exist.

Can you add a note to Changes.md at the repo root - something like

## 0.9.6 (Upcoming) **Breaking Changes** * [dagster-dask] removed the `compute` options key which would result in un-executed futures if used

but add some more context on where that key is

Added a note to the changelog.

Rebased and updated for post-0.9.6.

alangenfeld · 2020-09-09T18:13:46Z

python_modules/libraries/dagster-dask/dagster_dask/resources.py

+                raise ValueError(f"Unknown cluster type “{cluster_type}”.")
+
+            # Import the cluster module by name, and get the cluster Class.
+            cluster_module = import_module(cluster_meta["module"])


a bit concerned about the late import here, and how missing dependency failures will manifest - but I can't come up with a solution that really solves or that. This same problem exists in the executor as well.

Few ideas:

Each cluster type's field description could mention their required module.

Is there a hook for validating resource configs? That might be a good time to check if a module exists.

The dagster_dask module could declare extra_requires for the modules. Though, I'm not sure if this is something this module wants to take on, considering Dask's distributed module doesn't already do it, and it might not be obvious to users.

As an aside, I'm aware that the resource and executor will have separate implementations for configuring and creating Dask clients and clusters with this PR. I'm thinking of following up with another PR later to bring them together under a common implementation.

I've added the module name to each cluster config's description. For example, the kube type's description now says:

Kubernetes cluster config. Requires dask_kubernetes.

alangenfeld · 2020-09-11T16:00:30Z

Great, those further tweaks all sound great. Will get this merged once I get a clean buildkite run.

An aside - not sure if you caught this yet, but you can run make black and make isort at the repo root to catch formatting issues. Also make pylint but that is very very slow, so maybe just reference the cli args and point at the files you changed.

kinghuang · 2020-09-11T16:38:31Z

Ah, I've been running those manually, and obviously getting it wrong compared to the automated checks. I'll give that a try!

Create a Dask resource to represent a Dask client and optional cluster. The resource may create a client that connects to an existing cluster or create a new cluster, depending on the resource config.

Providing a way to not compute a Dask DataFrame materialization does not make sense, as the future is never returned and cannot be computed at a later time. Materializations must be computed if specified.

Require a dask resource to be specified to use a Dask DataFrame. Materializations will be performed using the client in the resource instead of whatever the default client happens to be.

The passthrough_df solid simplify returns the input dataframe.

The dask_pipeline pipeline simply calls the passthrough_df solid. It is configured with a dask_resource.

Run a pipeline with a custom Dask local cluster config and test that the client’s scheduler has the configured numbered of workers and threads per worker.

The scheduler config is really the client config, and is not exclusive to specifying a cluster config. Rename scheduler to client, and also add the missing options from the Client initializer.

Reduce the resource requirements to run the unit tests on the Dask resource.

Directly use Dagster config types instead of inferring from native Python types

Explicitly set the Dask resource's client as the current client using a context manager, then test using the current client.

While the distributed module is technically the same as dask.distributed, the latter is the canonical way of accessing it.

Only one cluster type may be configured. It is not valid to create multiple clusters for a client.

If no cluster configuration is provided, set _cluster to None.

Use the cluster property to check and get the cluster object, instead of directly accessing the underlying attribute.

kinghuang · 2020-09-16T03:10:27Z

@alangenfeld Do I need some sort of config files to run black and isort? I just rebased and ran make black on the project root and it reformatted 348 files across the project. Similar for make isort.

kinghuang · 2020-09-16T03:15:07Z

I've discarded all the changes outside of dagster_dask and pushed an update in the meantime.

alangenfeld · 2020-09-16T15:15:35Z

ah shoot, ya you can see what versions of black and isort we are set to here https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dev-requirements.txt should have mentioned that

kinghuang · 2020-09-16T15:29:27Z

Ah, much better. Unfortunately, it's undone some of the changes the other run of black applied. Updated! 😅

kinghuang marked this pull request as draft August 16, 2020 19:00

kinghuang force-pushed the dask-resource branch 2 times, most recently from 1249860 to aa75183 Compare September 1, 2020 22:06

kinghuang mentioned this pull request Sep 4, 2020

Move Dask DataFrame read/to options under read/to keys #2821

Merged

kinghuang force-pushed the dask-resource branch from aa75183 to e62bfc8 Compare September 7, 2020 18:20

kinghuang marked this pull request as ready for review September 7, 2020 18:48

kinghuang marked this pull request as draft September 7, 2020 19:48

kinghuang force-pushed the dask-resource branch from a90de96 to f0e1ba4 Compare September 7, 2020 20:59

kinghuang marked this pull request as ready for review September 7, 2020 21:46

kinghuang marked this pull request as draft September 8, 2020 17:54

kinghuang marked this pull request as ready for review September 8, 2020 18:56

kinghuang mentioned this pull request Sep 9, 2020

Add utility options to Dask DataFrame type loader/materializer #2888

Merged

alangenfeld reviewed Sep 9, 2020

View reviewed changes

kinghuang force-pushed the dask-resource branch from 1c8c26d to 478d939 Compare September 10, 2020 21:01

kinghuang mentioned this pull request Sep 11, 2020

Bring together Dask Client/Cluster implementations for the Dask executor and resource #2901

Open

kinghuang force-pushed the dask-resource branch from 478d939 to fe24710 Compare September 11, 2020 15:47

kinghuang added 6 commits September 15, 2020 21:03

Create Dask resource

86bb0a2

Create a Dask resource to represent a Dask client and optional cluster. The resource may create a client that connects to an existing cluster or create a new cluster, depending on the resource config.

Remove compute fields from materialization options

a420fb0

Providing a way to not compute a Dask DataFrame materialization does not make sense, as the future is never returned and cannot be computed at a later time. Materializations must be computed if specified.

Compute materializations on Dask resource

74b7e49

Require a dask resource to be specified to use a Dask DataFrame. Materializations will be performed using the client in the resource instead of whatever the default client happens to be.

Import dask_resource to the base module

74b9694

Create passthrough solid for testing

f01040b

The passthrough_df solid simplify returns the input dataframe.

Create pipeline for Dask testing

468cb03

The dask_pipeline pipeline simply calls the passthrough_df solid. It is configured with a dask_resource.

kinghuang added 18 commits September 15, 2020 21:03

Test local cluster configuration

749596b

Run a pipeline with a custom Dask local cluster config and test that the client’s scheduler has the configured numbered of workers and threads per worker.

Test multiple local cluster configurations

5333085

Change scheduler config to client config

cfd464d

The scheduler config is really the client config, and is not exclusive to specifying a cluster config. Rename scheduler to client, and also add the missing options from the Client initializer.

Reduce number of threads per worker in tests

800ec5b

Reduce the resource requirements to run the unit tests on the Dask resource.

Reformat code according to Black

12b3d2d

Use config ypes instead of inferring

5febe27

Directly use Dagster config types instead of inferring from native Python types

Refine some code comments

fb045d8

Set and test the current client

d6e052b

Explicitly set the Dask resource's client as the current client using a context manager, then test using the current client.

Remove unused imports

a604aeb

Refine import of Dask Client

02aa1a4

While the distributed module is technically the same as dask.distributed, the latter is the canonical way of accessing it.

Use Selector for cluster config config options

d7f2591

Only one cluster type may be configured. It is not valid to create multiple clusters for a client.

Improve code formatting when getting results

872b596

Set _cluster to None if no config

2779bc9

If no cluster configuration is provided, set _cluster to None.

Use cluster property to get cluster

ef924e7

Use the cluster property to check and get the cluster object, instead of directly accessing the underlying attribute.

Remove extraneous spaces

6b8d6fd

Add changelog entry for removal of compute option

7fb0bb5

Remove unused import of DataFrame

e888530

Note required module for each cluster type

f66a1cd

Apply changes from black and isort

cc46023

kinghuang force-pushed the dask-resource branch from fe24710 to cc46023 Compare September 16, 2020 03:13

More changes from black

6b3f5f8

Merge branch 'master' into dask-resource

0c388f9

alangenfeld merged commit bb63b6c into dagster-io:master Sep 18, 2020

kinghuang deleted the dask-resource branch September 18, 2020 18:05

kinghuang mentioned this pull request Sep 30, 2020

Create a dask resource #1859

Closed

kinghuang mentioned this pull request Oct 15, 2020

Documentation for Dask resource and DataFrame type #3112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Dask resource #2811

Create Dask resource #2811

kinghuang commented Aug 11, 2020 •

edited

Loading

alangenfeld commented Aug 13, 2020

alangenfeld commented Aug 14, 2020

kinghuang commented Sep 1, 2020

kinghuang commented Sep 7, 2020

kinghuang commented Sep 8, 2020 •

edited

Loading

kinghuang commented Sep 9, 2020

alangenfeld Sep 9, 2020

kinghuang Sep 9, 2020

alangenfeld Sep 9, 2020

kinghuang Sep 9, 2020

kinghuang Sep 11, 2020

alangenfeld Sep 9, 2020

kinghuang Sep 9, 2020

kinghuang Sep 9, 2020

kinghuang Sep 11, 2020

alangenfeld commented Sep 11, 2020

kinghuang commented Sep 11, 2020

kinghuang commented Sep 16, 2020

kinghuang commented Sep 16, 2020

alangenfeld commented Sep 16, 2020

kinghuang commented Sep 16, 2020

Create Dask resource #2811

Create Dask resource #2811

Conversation

kinghuang commented Aug 11, 2020 • edited Loading

alangenfeld commented Aug 13, 2020

alangenfeld commented Aug 14, 2020

kinghuang commented Sep 1, 2020

kinghuang commented Sep 7, 2020

kinghuang commented Sep 8, 2020 • edited Loading

kinghuang commented Sep 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alangenfeld commented Sep 11, 2020

kinghuang commented Sep 11, 2020

kinghuang commented Sep 16, 2020

kinghuang commented Sep 16, 2020

alangenfeld commented Sep 16, 2020

kinghuang commented Sep 16, 2020

kinghuang commented Aug 11, 2020 •

edited

Loading

kinghuang commented Sep 8, 2020 •

edited

Loading