-
Notifications
You must be signed in to change notification settings - Fork 16
Conversation
Co-authored-by: Michael Adkins <[email protected]>
I was able to get it to connect to existing client, but unable to get asynchronous working because if I make this is async, then users need to call Also, I couldn't get sync_compatible working on it. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your work on this @ahuang11!
I think this is almost ready; just missing the
results in:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! I have some minor questions and comments.
README.md
Outdated
However you must `await client.compute` before exiting out of the context manager. | ||
|
||
Running `await dask_collection.compute()` will result in an error: `TypeError: 'coroutine' object is not iterable`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems contrary, can you clarify? Is the second bit a Dask bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have to try with dask alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import dask
from dask.distributed import Client
async with Client(asynchronous=True) as client:
df = dask.datasets.timeseries("2000", "2001", partition_freq="4w")
print(type(df))
print(type(df.describe()))
print(type(df.describe().compute())) # errors on this line here
summary_df = df.describe().compute()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this works
import dask
from dask.distributed import Client, wait
async with Client(asynchronous=True) as client:
df = dask.datasets.timeseries("2000", "2001", partition_freq="4w")
summary_df = await df.describe().compute(sync=False)[0].result()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for dealing with all the comments!
Co-authored-by: Michael Adkins <[email protected]>
Summary
Adds
get_dask_client
.This is intended to be called within tasks that run on workers, and is useful for operating on dask collections, such as a dask.DataFrame.
Without invoking this, workers in a task do not automatically get a client to connect to the full cluster. Therefore, it will attempt perform work within the worker itself serially, and potentially overwhelming the single worker.
Internally, this context manager is a simple wrapper around distributed.worker_client with separate_thread=False fixed.
With it:
Without it (a single worker takes on the entire job):
Why a separate util function, from Michael:
"""
I think we should document our required pattern for accessing the client. If we want, we can add a utility to the library that, as you noted, just sets the required keyword argument. The utility function is nice because:
"""
Relevant Issue(s)
Closes #26
Checklist