-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve download performance of large time extents for small time chunks #267
Comments
Hi @veenstrajelmer, This question is a bit tricky! This optimisation of the Dask chunking is new and experimental and we should remember that there might not be an optimal way to chunk the data for all the situation and datasets. Now, let's try to see what we can do in this situation and see what the toolbox can do better. First, the data for the subset (ARCO data) is replicated: service "arco-geo-series" (optimised for big areas on small time ranges) and "arco-time-series" (optimised for small areas on long time ranges). They are data organised in different Zarr chunking strategy (different than Dask chunk). Note, that the toolbox will try to choose the best service and that the service cannot be changed after you opened the dataset using "open_dataset". So one problem I see with your routine is that when you do Let's compare it. Let's download only one day with the subset:
By default the toolbox uses the "arco-geo-series" and it's relatively fast: Whereas if I force the service to "arco-time-series" then we see that it's way longer and we download a lot more data: 17 sec vs 190 sec: Then, using the proper service (by doing your What I would advice is to not use any chunking: basically not using In a nutshell opening the dataset would look like this:
Please tell me if I understand well your problem and if my answer seems to improve your process! |
This is quite amazing, thanks! I noticed it also works quite well to use
|
Super I will close this issue then! |
As originally documented in Deltares/dfm_tools#1033, the download performance is sub-optimal for large time ranged when a user attempts to download it per subset of time. In the example a dataset of several years is opened and each day is subsetted and saved in a separate file. The below example shows that when we manually loop over the separate months first (instead of taking the entire time range), the performance of downloading each day is significantly better:
This makes sense, since the chunks of an arbitrary monthly dataset are the following:
And the chunks of the entire dataset are the following:
This explains the slow performance for the latter when extracting a single timestep.
Rechunking is possible with this code:
However, this is also not optimal for performance, since then the chunking is misaligned with the original dataset. This also gives
"UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading."
. I extracted these dataset chunks from thecoordinate_for_subset
dict in theget_optimum_dask_chunking()
function:• Time: 2520
• Depth: 1
• Latitude: 16
• Longitude: 16
So this is completely different than the slice I would like to get out of it (1 time, 141 depth, 60 latitude, 96 longitude), which I realize is inefficient. Changing the
chunk_size_limit
argument forcopernicusmarine.open_dataset()
(both higher and lower) only affects the chunks for latitude and longitude, not depth and time. I can well imagine why small chunks are beneficial for the spatial dimensions, but I think it would be great if the source dataset can be chunked in smaller pieces over the time dimension also.The text was updated successfully, but these errors were encountered: