-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to perform compute_Sv
with dask inputs
#1212
Comments
InvestigationAfter some initial investigate, I found that the first computation of from distributed import Client
import dask.array
import xarray as xr
import numpy as np
# Use 2 workers so each gets more memory
client = Client(n_workers=2)
# Same shape as the actual data
ch, pt, rs = 5, 46187, 3957
sample_interval = xr.DataArray(dask.array.ones((ch, pt)), name="sample_interval", coords={"ch": np.arange(ch), "pt": np.arange(pt)})
sound_speed = xr.DataArray(dask.array.ones((ch, pt)) * 1053, name="sound_speed", coords={"ch": np.arange(ch), "pt": np.arange(pt)})
range_sample = xr.DataArray(dask.array.from_array(np.arange(rs)), name="range_sample", coords={"rs": np.arange(rs)})
res = range_sample * sample_interval * sound_speed / 2
res.compute() The above code WILL FAIL with similar error and memory spilling and blowup seen in the dask dashboard: 2023-11-06 17:17:59,276 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 11.84 GiB -- Worker memory limit: 16.00 GiB
2023-11-06 17:17:59,495 - distributed.worker.memory - WARNING - Worker is at 83% memory usage. Pausing worker. Process memory: 13.36 GiB -- Worker memory limit: 16.00 GiB
2023-11-06 17:17:59,663 - distributed.worker.memory - WARNING - Worker is at 43% memory usage. Resuming worker. Process memory: 6.91 GiB -- Worker memory limit: 16.00 GiB If we perform the same snippet of code with the following shape: ch, pt, rs = 5, 20000, 2000 the computation will finish just fine. |
So the most obvious problem here comes from Also, I suspect that there's some excessive broadcasting going on that may overload worker memory. I'm still looking into this. |
Using two workers doesn't seem to be helpful, since most of the memory will stick to just one worker here Chunking |
I would suggest chunking along Maybe look at arithmetic things that could simplify the ops would be helpful? like is it possible to avoid the expansion, etc. |
Yeah, I tried just chunking on And yeah, I can also look for stuff that avoids expansion altogether. |
Yeah, separating out the steps and looking at the graphs built for each step would likely be useful for figuring out exactly where the bottleneck is -- though I think you're already doing it! :) |
I'll put the (potential) fix for this issue into a separate issue since the idea is extensive enough that it needs its own separate thing. To summarize, this issue is caused by the accumulation of task products that build up due to Dask waiting for every chunk to compute before setting the end array of |
Closed by merging of #1331 |
Overview
Currently,
compute_Sv
does not work with a merged Sv file (raw_converted_combined/2013/x0097_2_wt_20130824_232034_f0031.zarr
) when opened as dask arrays. Worker memory get's overwhelmed and even spilling to disk by dask won't cut it. This is the non descriptive dask traceback:The above traceback is not very useful, so I tried to perform
compute_Sv
without using dask and that seemed to work and I was able to profile the memory usage. It seems like there are a few spots that huge spikes in memory occurs.Memory Line by Line Profile
Analysis
echopype/calibrate/calibrate_ek.py::init
We can see that during the instantiation of the EK60 Calibrator Class,
compute_echo_range
has the most increase in memory usage. In this instance, it's929.8 MiB
.echopype/calibrate/range.py::compute_range_EK
Digging into the
compute_echo_range
method of the EK60 Calibrator Class,compute_range_EK
is actually being called. From the result above, the creation of range_meter caused quite a spike in memory usage of1540.4 MiB
.A
broadcasting
operation is happening here as the shapes are:range_sample
:(3957,)
sample_interval
:(5, 46187)
sound_speed
:(5, 46187)
Additionally, the operation of getting non-null values spiked the memory usage again! This time, about 4X increase.
This is then cleaned up by a where operation on the
range_meter
variable calculated above.echopype/calibrate/calibrate_ek.py::_cal_power_samples
Once
compute_range
is finished, the process continues to calling_cal_power_samples
to perform theSv
computation. As seen in the snippet above,range_mod_TVG_EK
increased the memory usage7130.0 MiB
... that's around7GB
! Then another spike of5454.9 MiB
happens again during the computation above.echopype/calibrate/range.py::range_mod_TVG_EK
Digging into
range_mod_TVG_EK
function, we can see that the substraction b/wrange_meter
andmod_Ex60
function result blows up the memory here.The text was updated successfully, but these errors were encountered: