-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struggling with dask delayed to parallelize tidal analysis #194
Comments
Haven't looked in detail, but it seems like you might have a huge number of delayed tasks? If so, it will probably run much better with fewer more coarse tasks. Or you might be able to use dask array/xarray to do the chunking for you? |
@ah, yes, there are about 125K delayed tasks. But I'm not sure how to make the task coarser. The tidal analysis function can't take larger chunks, like multiple time series at once, so I need to run it at each |
So if I subsample the grid with
|
I am only just getting into dask/xarray so more experienced people may have
a better suggestion, but it looks like one of these approaches should
assist:
https://xarray.pydata.org/en/stable/generated/xarray.apply_ufunc.html#xarray.apply_ufunc
or
http://dask.pydata.org/en/latest/array-api.html#dask.array.apply_along_axis
So rather than your section which does a list comprehension:
coefs = [usolve(t=t, u=z[:,j,i], v=None, lat=lat,
trend=False, nodal=False, verbose=False, Rayleigh_min=0.95, method='ols',
conf_int='linear') for j in range(jj) for i in range(ii)]
Open the dataset with the 'chunks' parameter set for lat and lon to
something reasonable that partitions your problem across the available
processors and then using one of those ufunc approaches it should take
care of the scheduling transparently to process your dataset rather
than flooding it with a new function call for each grid cooordinate...
I think!
I think your question may pertain more to dask usage rather than
pangeo specifically...
…On Wed, Apr 4, 2018 at 11:44 AM, Rich Signell ***@***.***> wrote:
So if I subsample the grid with nsub=4 instead of nsub=2 ( 35K delayed
tasks), the kernel doesn't die, and the notebook works.
What controls whether the kernel will die (e.g. how many delayed tasks is
"huge")?
What would I need to do differently with my workflow to get Dask to work
with nsub=2 ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AM3bQKfECBDEfneLZDjvlNN3-LzHwn9eks5tlEG5gaJpZM4TF79C>
.
|
Each task takes up about 10kB and a millisecond of overhead In [1]: from dask.distributed import Client, wait
In [2]: from distributed.utils import format_bytes
In [3]: import psutil
In [4]: client = Client()
In [5]: start = psutil.Process().memory_info().rss
In [6]: format_bytes(start)
Out[6]: '104.30 MB'
In [7]: def inc(x):
...: return x + 1
...:
In [8]: %time futures = client.map(inc, range(100000))
CPU times: user 3.41 s, sys: 72.8 ms, total: 3.48 s
Wall time: 3.49 s
In [9]: %time _ = wait(futures)
CPU times: user 32.4 s, sys: 235 ms, total: 32.6 s
Wall time: 33.1 s
In [10]: end = psutil.Process().memory_info().rss
In [11]: format_bytes(end)
Out[11]: '1.07 GB'
In [12]: format_bytes((end - start) / len(futures))
Out[12]: '9.64 kB' If your tasks deal with times that are shorter than this then you might consider bundling several such operations within a single task, perhaps using a for loop within your function. Typically I watch the diagnostic dashboard and, if I see a lot of white space in the task stream plot (the central plot on the status page) and only sporadic thin vertical bars then that is a sign that my system is spending most of its time in scheduling overhead (the 1ms cost) rather than actual computation. See dashboard video at this time |
@mrocklin, I checked out the dashboard, and it seems that dask with 35k delayed tasks is actually working very nicely (not too much white space): So I then ran the tidal analysis in serial mode to see how much slower it was compared to my Dask workflow, and it took 1 hour 15 min instead of 1min 52 s. That's a speedup of 40, so perfect linear speedup with the 40 cpus in my Dask cluster: So I'm pretty happy with the workflow. I just wished it worked with 100K or 500K tasks also. |
You might check the System tab to watch your Scheduler's memory use during
execution. This will help you to understand if you're reaching a memory
limit on your notebook/scheduler pod due to having too much administrative
data. 10kB per task * 100k tasks is around a GB. I think that notebook
pods have something like 2-4GB?
…On Wed, Apr 4, 2018 at 10:55 AM, Rich Signell ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin>, I checked out the dashboard, and
it seems that dask with 35k delayed tasks is actually working very nicely
(not too much white space):
[image: 2018-04-04_8-43-17]
<https://user-images.githubusercontent.com/1872600/38315384-60cba886-37f6-11e8-9aca-69087b05c8b2.png>
So I then ran the tidal analysis in serial mode to see how much slower it
was compared to my Dask workflow, and it took 1 hour 15 min instead of 1min
52 s. That's a speedup of 40, so perfect linear speedup with the 40 cpus in
my Dask cluster:
[image: 2018-04-04_10-14-35]
<https://user-images.githubusercontent.com/1872600/38315394-67a2fb32-37f6-11e8-84ce-55990d7239df.png>
So I'm pretty happy with the workflow. I just wished it worked with 100K
or 500K tasks also.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszCfVhpQKiI2Flh0oGZjM6tSQRA6fks5tlN7OgaJpZM4TF79C>
.
|
Or, more generally, I don't know why things are crashing. You'll probably
have to investigate typical things like "did I run out of memory?" to
diagnose further.
On Wed, Apr 4, 2018 at 10:57 AM, Matthew Rocklin <[email protected]>
wrote:
… You might check the System tab to watch your Scheduler's memory use during
execution. This will help you to understand if you're reaching a memory
limit on your notebook/scheduler pod due to having too much administrative
data. 10kB per task * 100k tasks is around a GB. I think that notebook
pods have something like 2-4GB?
On Wed, Apr 4, 2018 at 10:55 AM, Rich Signell ***@***.***>
wrote:
> @mrocklin <https://github.com/mrocklin>, I checked out the dashboard,
> and it seems that dask with 35k delayed tasks is actually working very
> nicely (not too much white space):
>
> [image: 2018-04-04_8-43-17]
> <https://user-images.githubusercontent.com/1872600/38315384-60cba886-37f6-11e8-9aca-69087b05c8b2.png>
>
> So I then ran the tidal analysis in serial mode to see how much slower it
> was compared to my Dask workflow, and it took 1 hour 15 min instead of 1min
> 52 s. That's a speedup of 40, so perfect linear speedup with the 40 cpus in
> my Dask cluster:
>
> [image: 2018-04-04_10-14-35]
> <https://user-images.githubusercontent.com/1872600/38315394-67a2fb32-37f6-11e8-84ce-55990d7239df.png>
>
> So I'm pretty happy with the workflow. I just wished it worked with 100K
> or 500K tasks also.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#194 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AASszCfVhpQKiI2Flh0oGZjM6tSQRA6fks5tlN7OgaJpZM4TF79C>
> .
>
|
@mrocklin, looks like the kernel dies when the memory on the "system" tab hits about 3.7GB: It looks like the $jovyan@jupyter-rsignell-2dusgs:~$ more myworker.yml
metadata:
spec:
restartPolicy: Never
containers:
- args:
- dask-worker
- --nthreads
- '2'
- --no-bokeh
- --memory-limit
- 6GB
- --death-timeout
- '60'
image: pangeo/worker:2018-03-28
name: dask-worker
securityContext:
capabilities:
add: [SYS_ADMIN]
privileged: true
env:
- name: GCSFUSE_BUCKET
value: pangeo-data
- name: EXTRA_CONDA_PACKAGES
value: utide -c conda-forge
resources:
limits:
cpu: "1.75"
memory: 6G
requests:
cpu: "1.75"
memory: 6G Does this info point to memory being the reason the kernel dies? |
That is the memory limit for your workers. Your scheduler is running in
the same process as your notebook (see cluster.scheduler) which I think has
a 4GB limit as defined in the JupyterHub configuration.
…On Wed, Apr 4, 2018 at 12:07 PM, Rich Signell ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin>, looks like the kernel dies when
the memory on the "system" tab hits about 3.7GB:
[image: 2018-04-04_11-45-40]
<https://user-images.githubusercontent.com/1872600/38318976-b58aa914-37fe-11e8-9328-6b25092bc31f.png>
It looks like the myworker.yml config I'm using has a 6GB limit:
***@***.***:~$ more myworker.ymlmetadata:spec:
restartPolicy: Never
containers:
- args:
- dask-worker
- --nthreads
- '2'
- --no-bokeh
- --memory-limit
- 6GB
- --death-timeout
- '60'
image: pangeo/worker:2018-03-28
name: dask-worker
securityContext:
capabilities:
add: [SYS_ADMIN]
privileged: true
env:
- name: GCSFUSE_BUCKET
value: pangeo-data
- name: EXTRA_CONDA_PACKAGES
value: utide -c conda-forge
resources:
limits:
cpu: "1.75"
memory: 6G
requests:
cpu: "1.75"
memory: 6G
Does this info point to memory being the reason the kernel dies?
If so, can I just increase the limits in myworker.yml?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszBHkDKL3G1IM97fw9-RsI85BjaNdks5tlO_EgaJpZM4TF79C>
.
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
I'm revisiting this topic to give a summary of how I eventually addressed this issue. I've been able to make it work by splitting the large (500,000) list of dask tasks into 30,000 task chunks, and letting Dask work on each chunk in series. This way Dask stays busy but we don't run out of scheduler memory. Each chunk takes about 3.25 minutes to run, with 2 minutes taken by the scheduler, then 1.5 minutes by the workers (with 120cpus) to do the tasks. The tasks each take about 300ms to run. In the Dask Documentation it says:
30,000 tasks taking 2 minutes in the scheduler works out to 4 milliseconds per task. Obviously it's not too optimal to be spending 60% of our time in the scheduler, and it would be great to be able to follow this advice from the Dask documentation to create fewer, longer running tasks:
Each task, however, is a call to a tidal analysis program with a single time series as input, so it would seem that we would need to modify the code for this program to accept multiple time series if we are to see any additional benefit. Does this seem like an accurate assessment? |
It looks like you are effectively chunking your dataset as each chunk being size 1 in lat, lon. If so, you could use |
Have you seen https://examples.dask.org/applications/embarrassingly-parallel.html#Handling-very-large-simulation-with-Bags ? I'm not sure this applies to your problem though, as your input data is more complex than in the example. Trying to launch all at once would probably overwhelm the scheduler as you experienced... |
Thanks @dcherian and @guillaumeeb. I'm going to try recasting using that bag approach and I will report back. Also, just for reference, here is my current notebook solution. |
I'd love to see a performance report if you have time to generate one.
https://docs.dask.org/en/latest/diagnostics-distributed.html#capture-diagnostics
…On Tue, Jan 7, 2020, 2:08 PM Rich Signell ***@***.***> wrote:
Thanks @dcherian <https://github.com/dcherian> and @guillaumeeb
<https://github.com/guillaumeeb>. I'm going to try recasting using that
bag approach and I will report back. Also, just for reference, here is my
current notebook solution
<https://nbviewer.jupyter.org/gist/rsignell-usgs/508e3763538877a3189e07ff2932a6b8>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194?email_source=notifications&email_token=AACKZTG7TQD2VIKJ2LQR6UDQ4T4MLA5CNFSM4EYXX5BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIKOTGQ#issuecomment-571795866>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGSRNMH6RJ2Q4MDW6DQ4T4MLANCNFSM4EYXX5BA>
.
|
@mrocklin , cool, I did not know about this capability. Here's the dask performance report you requested! |
Hrm, the scheduler and worker administrative profiles are blank. That's
odd. Have you by any chance turned off profiling?
…On Wed, Jan 8, 2020 at 6:07 AM Rich Signell ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> , cool, I did not know about this
capability. Here's the dask performance report
<https://nbviewer.jupyter.org/github/rsignell-usgs/testing/blob/master/dask-report.html>
you requested!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194?email_source=notifications&email_token=AACKZTDTAKCTPI62AGS7OVDQ4XMZPA5CNFSM4EYXX5BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMRSBQ#issuecomment-572070150>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAHHVC4XH4WRZDWFELQ4XMZPANCNFSM4EYXX5BA>
.
|
@mrocklin, I regenerated the dask performance report , and this time the scheduler and admin profiles are not blank! |
Thanks @rsignell-usgs ! I guess what we're seeing there is that the scheduler and worker administrative threads just don't seem to be busy at all. They don't seem to be under load. You might add the following to your config to make sure that communication costs end up on the main thread, and see if that changes things qualitatively. distributed
comm:
offload: False Looking at your task stream plot, I also notice that you're spending a lot of time in |
Thanks @mrocklin, I'll check both of those things out. I'm pretty sure there is no reason to invert the matrix. So for the communications, just to make sure, I just add distributed:
comm:
offload: False to the end of my |
And thanks @guillaumeeb for the tip on dask bag for this workflow. I can now run 20% faster and don't have to create ugly loops or wait for the giant list of delayed functions to get created! The bag approach is in cells [17-25] of this new notebook. |
Glad to hear that ! Just one question though, why are you using 60000 partitions ? That's quite a lot. |
You'll want to include that in any other config you already have. You
might already have a `distributed:` section, so you would want to add the
comm section to that (assuming that you don't already have one).
…On Fri, Jan 10, 2020 at 1:54 PM Rich Signell ***@***.***> wrote:
Thanks @mrocklin <https://github.com/mrocklin>, I'll check both of those
things out. I'm pretty sure there is no reason to invert the matrix. So for
the communications, just to make sure, I just add
distributed:
comm:
offload: False
to the end of my ~/.dask/config.yaml file?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194?email_source=notifications&email_token=AACKZTARKNIGL47RC3EEGETQ5DU7ZA5CNFSM4EYXX5BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIVKJAI#issuecomment-573219969>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTC3KAZSA62IGVIUCS3Q5DU7ZANCNFSM4EYXX5BA>
.
|
@guillaumeeb, I tried 6,000 partitions instead of 60,000 partitions for my 500,000 tasks, which shortened my run time by only 4%, but it made my performance report 10 times smaller. Thanks for the idea! 😸 @mrocklin, adding those lines to the dask config indeed did the trick. Now I have filled-in admin tabs in my new smaller Dask performance report! @dcherian , I did take a look at the ufunc approach you suggested, but the bag approach here seems more straightforward and likely to be comparable in performance. Do you agree? |
The administrative tabs in the performance report still only report a few
hundred milliseconds of time spent doing administrative work (like
deserializing tasks or communicating and such). So I'm not sure how much
overhead we're seeing there.
…On Sat, Jan 11, 2020 at 5:07 AM Rich Signell ***@***.***> wrote:
@guillaumeeb <https://github.com/guillaumeeb>, I tried 6,000 partitions
instead of 60,000 partitions for my 500,000 tasks, which shortened my run
time by only 4%, but it made my performance report 10 times smaller. Thanks
for the idea! 😸
@mrocklin <https://github.com/mrocklin>, adding those lines to the dask
config indeed did the trick. Now I have filled-in admin tabs in my new
smaller Dask performance report
<https://nbviewer.jupyter.org/github/rsignell-usgs/testing/blob/master/dask_6000.html>
!
@dcherian <https://github.com/dcherian> , I did take a look at the ufunc
approach
<https://github.com/pydata/xarray/blob/8e5ec06f67f7130a3cc5bd4bf9dd1207b0211a90/doc/examples/apply_ufunc_vectorize_1d.ipynb>
you suggested, but the bag approach here seems more straightforward and
likely to be comparable in performance. Do you agree?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194?email_source=notifications&email_token=AACKZTBZY7SFHVHPR3AQEK3Q5HACNA5CNFSM4EYXX5BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIWBTDA#issuecomment-573315468>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTF46KIF4XOTJTVL6UDQ5HACNANCNFSM4EYXX5BA>
.
|
I just tried running the same job on a different HPC system with the same number of cores (120) and instead of taking 2.5 minutes it takes 25 minutes. The CPUS are not the same, but what could explain an order of magnitude difference in performance for these 500 tasks? |
It looks like you're spending a ton of time calling inv. You might look at
your blas settings (over subscription of threads maybe?) and see if you can
use a solve function instead?
…On Tue, Jan 14, 2020, 9:30 AM Rich Signell ***@***.***> wrote:
I just tried running the same job on a different HPC system with the same
number of cores (120) and instead of taking 2.5 minutes it takes 25
minutes. The CPUS are not the same, but what could explain an order of
magnitude difference in performance for these 500 tasks?
Here is the slow performance report
<https://nbviewer.jupyter.org/github/rsignell-usgs/testing/blob/master/dask_6000_poseidon.html>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194?email_source=notifications&email_token=AACKZTBH67FJ4YXAMPAQ323Q5XZFDA5CNFSM4EYXX5BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI5OU3A#issuecomment-574286444>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFPY3WEEYGUMV7YIMDQ5XZFDANCNFSM4EYXX5BA>
.
|
Understanding (and using) dask bag allowed me to solve this use case. |
I am doing detiding using ttide (found it a bit faster than utide), similar problem)) I have 1.5 million points. I do chunking i.e. n points (500) per task. And another thing that helped to make workers start faster is with get_client(args) as client:
# save dashboard link to the disk
with Path("dashboard_link.html").open("w") as f:
f.write(f"""<html><body><a href="{client.dashboard_link}">Dashboard</></body></html>""")
n_tasks = len(inputs_chunks)
# launch execution
logger.info("Submitting %d tasks", n_tasks)
max_n_batches = max(int(0.01 * n_tasks), 1)
batch_size = n_tasks // max_n_batches
logger.info("batch_size for client.map: %d", batch_size)
futures = client.map(compute_surge_map_many, inputs_chunks, batch_size=batch_size)
progress(futures)
chunk_results = client.gather(futures)
vals = np.array(sum(chunk_results, start=[])).T
logger.info(f"Client gathered results: type=%s", type(vals[0])) Do you have any suggestions for better calculation of the batch size or maybe other optimizations? Currently, it takes 20 minutes to process on 800 cpus (1 thread per process) and the task flow is not so nicely filled as yours on the dashboard)). It would be cool if I could save the results without gathering into one netcdf directly from the workers, but I read that this is not supported for netcdf4/hdf5.... |
I told @rabernat I'd post this on SO, but the simplified test I made to isolate the problem for SO ended up working, so I'm posting here in case the pangeo env or workflow has something to do with the issue.
I'm trying to use dask delayed to do an embarrassingly parallel problem: apply a time series function (tidal analysis) to each
(lat, lon)
location in a 3D(time, lat, lon)
data cube.This simple example worked fine:
but my tide notebook with tidal analysis on model output loaded from Zarr looks good up to the line:
where the kernel dies.
The logs look good just before the kernel dies, but then I can't access the logs since the kernel has died. 😢
Does anyone see anything obviously wrong with my tide notebook?
The text was updated successfully, but these errors were encountered: