Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve download performance of large time extents for small time chunks #267

Closed
veenstrajelmer opened this issue Jan 13, 2025 · 3 comments

Comments

@veenstrajelmer
Copy link

veenstrajelmer commented Jan 13, 2025

As originally documented in Deltares/dfm_tools#1033, the download performance is sub-optimal for large time ranged when a user attempts to download it per subset of time. In the example a dataset of several years is opened and each day is subsetted and saved in a separate file. The below example shows that when we manually loop over the separate months first (instead of taking the entire time range), the performance of downloading each day is significantly better:

import copernicusmarine
import pandas as pd

# spatial extents
longitude_min, longitude_max, latitude_min, latitude_max = 12.5, 16.5, 34.5, 37

# time extents 
# be sure to start with 1st of month and end with last of month
# since monthly_periods generator is too simple for other dates
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour of dfm_tools and it is slow
monthly_periods = [(pd.Timestamp(date_min), pd.Timestamp(date_max))]

for period in monthly_periods:
    varkey = 'uo'
    dataset = copernicusmarine.open_dataset(
         dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         # temporarily convert back to strings because of https://github.com/mercator-ocean/copernicus-marine-toolbox/issues/261
         # TODO: revert, see https://github.com/Deltares/dfm_tools/issues/1047
         start_datetime = period[0].isoformat(),
         end_datetime = period[1].isoformat(),
    )
    
    freq = "D" # 1 netcdf file per day
    period_range = pd.period_range(date_min,date_max,freq=freq)
    for date in period_range:
        date_str = str(date)
        name_output = f'cmems_{varkey}_{date_str}.nc'
        dataset_perperiod = dataset.sel(time=slice(date_str, date_str))
        print(f'xarray writing netcdf file: {name_output}')
        dataset_perperiod.to_netcdf(name_output)

This makes sense, since the chunks of an arbitrary monthly dataset are the following:

dataset.chunks
Out[3]: Frozen({'time': (4, 4, 4, 4, 4, 4, 4, 3), 'depth': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'latitude': (60,), 'longitude': (97,)})

And the chunks of the entire dataset are the following:

dataset.chunks
Out[5]: Frozen({'time': (1308, 396), 'depth': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'latitude': (8, 16, 16, 16, 4), 'longitude': (4, 32, 32, 29)})

This explains the slow performance for the latter when extracting a single timestep.

Rechunking is possible with this code:

# rechunk to optimize for single time extraction
dataset = dataset.chunk(time=1, latitude=-1, longitude=-1, depth=-1)

However, this is also not optimal for performance, since then the chunking is misaligned with the original dataset. This also gives "UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.". I extracted these dataset chunks from the coordinate_for_subset dict in the get_optimum_dask_chunking() function:
• Time: 2520
• Depth: 1
• Latitude: 16
• Longitude: 16

So this is completely different than the slice I would like to get out of it (1 time, 141 depth, 60 latitude, 96 longitude), which I realize is inefficient. Changing the chunk_size_limit argument for copernicusmarine.open_dataset() (both higher and lower) only affects the chunks for latitude and longitude, not depth and time. I can well imagine why small chunks are beneficial for the spatial dimensions, but I think it would be great if the source dataset can be chunked in smaller pieces over the time dimension also.

@veenstrajelmer veenstrajelmer changed the title Improve download performance for small time chunks Improve download performance of large time extents for small time chunks Jan 13, 2025
@renaudjester
Copy link
Collaborator

renaudjester commented Jan 13, 2025

Hi @veenstrajelmer,

This question is a bit tricky! This optimisation of the Dask chunking is new and experimental and we should remember that there might not be an optimal way to chunk the data for all the situation and datasets.

Now, let's try to see what we can do in this situation and see what the toolbox can do better.

First, the data for the subset (ARCO data) is replicated: service "arco-geo-series" (optimised for big areas on small time ranges) and "arco-time-series" (optimised for small areas on long time ranges). They are data organised in different Zarr chunking strategy (different than Dask chunk). Note, that the toolbox will try to choose the best service and that the service cannot be changed after you opened the dataset using "open_dataset".

So one problem I see with your routine is that when you do open_dataset it's on a long time range so the selected service is "arco-time-series". However, the most optimal way to download only one day of the data is "arco-geo-series". Note, that the information of the selected service is in the DEBUG logs of the subset command. Unfortunately, the log does not appear with the open_dataset option, it's a bug.

Let's compare it. Let's download only one day with the subset:

import copernicusmarine
from time import time 
import logging

longitude_min, longitude_max, latitude_min, latitude_max = 12.5, 16.5, 34.5, 37
varkey = 'uo'
logging.getLogger("copernicusmarine").setLevel(logging.DEBUG)
top = time()
copernicusmarine.subset(
        dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         start_datetime = "2017-12-01",
         end_datetime = "2017-12-01",
         # service = "arco-time-series"
)
print(top - time())

By default the toolbox uses the "arco-geo-series" and it's relatively fast:
image

Whereas if I force the service to "arco-time-series" then we see that it's way longer and we download a lot more data: 17 sec vs 190 sec:
Screenshot 2025-01-13 at 15 03 20

Then, using the proper service (by doing your open_dataset with service = "arco-geo-series") doesn't seem to improve the download with your code, because of the Dask chunking. The Dask chunking is optimised for the data that you request when opening the dataset. So it's not optimised when only selecting one day. So you can optimise "manually" the chunking for the specific data (only one day of the dataset) but as you saw it's not easy.

What I would advice is to not use any chunking: basically not using dask. Because you know that downloading one day is not that big what you can do is set chunk_size_limit=0. On my computer, downloading one day ends up being 30sec (instead of 182 with the original code).

In a nutshell opening the dataset would look like this:

dataset = copernicusmarine.open_dataset(
         dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         # temporarily convert back to strings because of https://github.com/mercator-ocean/copernicus-marine-toolbox/issues/261
         # TODO: revert, see https://github.com/Deltares/dfm_tools/issues/1047
         start_datetime = period[0].isoformat(),
         end_datetime = period[1].isoformat(),
         service="arco-geo-series",
         chunk_size_limit=0,
    )

Please tell me if I understand well your problem and if my answer seems to improve your process!

@veenstrajelmer
Copy link
Author

This is quite amazing, thanks! I noticed it also works quite well to use chunk_size_limit=1, but I guess that is dataset dependent. Given the code in load_data_object_from_load_request(), None will result in the same behaviour, so I will use that. In combination with service="arco-geo-series":

optimum_dask_chunking = (
get_optimum_dask_chunking(
retrieval_service.service,
load_request.geographical_parameters,
load_request.temporal_parameters,
load_request.depth_parameters,
load_request.variables,
chunks_factor_size_limit,
)
if chunks_factor_size_limit
else None
)

@renaudjester
Copy link
Collaborator

Super I will close this issue then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants