Improve download performance of large time extents for small time chunks #267

veenstrajelmer · 2025-01-13T10:12:56Z

As originally documented in Deltares/dfm_tools#1033, the download performance is sub-optimal for large time ranged when a user attempts to download it per subset of time. In the example a dataset of several years is opened and each day is subsetted and saved in a separate file. The below example shows that when we manually loop over the separate months first (instead of taking the entire time range), the performance of downloading each day is significantly better:

import copernicusmarine
import pandas as pd

# spatial extents
longitude_min, longitude_max, latitude_min, latitude_max = 12.5, 16.5, 34.5, 37

# time extents 
# be sure to start with 1st of month and end with last of month
# since monthly_periods generator is too simple for other dates
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour of dfm_tools and it is slow
monthly_periods = [(pd.Timestamp(date_min), pd.Timestamp(date_max))]

for period in monthly_periods:
    varkey = 'uo'
    dataset = copernicusmarine.open_dataset(
         dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         # temporarily convert back to strings because of https://github.com/mercator-ocean/copernicus-marine-toolbox/issues/261
         # TODO: revert, see https://github.com/Deltares/dfm_tools/issues/1047
         start_datetime = period[0].isoformat(),
         end_datetime = period[1].isoformat(),
    )
    
    freq = "D" # 1 netcdf file per day
    period_range = pd.period_range(date_min,date_max,freq=freq)
    for date in period_range:
        date_str = str(date)
        name_output = f'cmems_{varkey}_{date_str}.nc'
        dataset_perperiod = dataset.sel(time=slice(date_str, date_str))
        print(f'xarray writing netcdf file: {name_output}')
        dataset_perperiod.to_netcdf(name_output)

This makes sense, since the chunks of an arbitrary monthly dataset are the following:

dataset.chunks
Out[3]: Frozen({'time': (4, 4, 4, 4, 4, 4, 4, 3), 'depth': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'latitude': (60,), 'longitude': (97,)})

And the chunks of the entire dataset are the following:

dataset.chunks
Out[5]: Frozen({'time': (1308, 396), 'depth': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'latitude': (8, 16, 16, 16, 4), 'longitude': (4, 32, 32, 29)})

This explains the slow performance for the latter when extracting a single timestep.

Rechunking is possible with this code:

# rechunk to optimize for single time extraction
dataset = dataset.chunk(time=1, latitude=-1, longitude=-1, depth=-1)

However, this is also not optimal for performance, since then the chunking is misaligned with the original dataset. This also gives "UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.". I extracted these dataset chunks from the coordinate_for_subset dict in the get_optimum_dask_chunking() function:
• Time: 2520
• Depth: 1
• Latitude: 16
• Longitude: 16

So this is completely different than the slice I would like to get out of it (1 time, 141 depth, 60 latitude, 96 longitude), which I realize is inefficient. Changing the chunk_size_limit argument for copernicusmarine.open_dataset() (both higher and lower) only affects the chunks for latitude and longitude, not depth and time. I can well imagine why small chunks are beneficial for the spatial dimensions, but I think it would be great if the source dataset can be chunked in smaller pieces over the time dimension also.

The text was updated successfully, but these errors were encountered:

renaudjester · 2025-01-13T14:27:31Z

Hi @veenstrajelmer,

This question is a bit tricky! This optimisation of the Dask chunking is new and experimental and we should remember that there might not be an optimal way to chunk the data for all the situation and datasets.

Now, let's try to see what we can do in this situation and see what the toolbox can do better.

First, the data for the subset (ARCO data) is replicated: service "arco-geo-series" (optimised for big areas on small time ranges) and "arco-time-series" (optimised for small areas on long time ranges). They are data organised in different Zarr chunking strategy (different than Dask chunk). Note, that the toolbox will try to choose the best service and that the service cannot be changed after you opened the dataset using "open_dataset".

So one problem I see with your routine is that when you do open_dataset it's on a long time range so the selected service is "arco-time-series". However, the most optimal way to download only one day of the data is "arco-geo-series". Note, that the information of the selected service is in the DEBUG logs of the subset command. Unfortunately, the log does not appear with the open_dataset option, it's a bug.

Let's compare it. Let's download only one day with the subset:

import copernicusmarine
from time import time 
import logging

longitude_min, longitude_max, latitude_min, latitude_max = 12.5, 16.5, 34.5, 37
varkey = 'uo'
logging.getLogger("copernicusmarine").setLevel(logging.DEBUG)
top = time()
copernicusmarine.subset(
        dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         start_datetime = "2017-12-01",
         end_datetime = "2017-12-01",
         # service = "arco-time-series"
)
print(top - time())

By default the toolbox uses the "arco-geo-series" and it's relatively fast:

Whereas if I force the service to "arco-time-series" then we see that it's way longer and we download a lot more data: 17 sec vs 190 sec:

Then, using the proper service (by doing your open_dataset with service = "arco-geo-series") doesn't seem to improve the download with your code, because of the Dask chunking. The Dask chunking is optimised for the data that you request when opening the dataset. So it's not optimised when only selecting one day. So you can optimise "manually" the chunking for the specific data (only one day of the dataset) but as you saw it's not easy.

What I would advice is to not use any chunking: basically not using dask. Because you know that downloading one day is not that big what you can do is set chunk_size_limit=0. On my computer, downloading one day ends up being 30sec (instead of 182 with the original code).

In a nutshell opening the dataset would look like this:

dataset = copernicusmarine.open_dataset(
         dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         # temporarily convert back to strings because of https://github.com/mercator-ocean/copernicus-marine-toolbox/issues/261
         # TODO: revert, see https://github.com/Deltares/dfm_tools/issues/1047
         start_datetime = period[0].isoformat(),
         end_datetime = period[1].isoformat(),
         service="arco-geo-series",
         chunk_size_limit=0,
    )

Please tell me if I understand well your problem and if my answer seems to improve your process!

veenstrajelmer · 2025-01-14T09:55:16Z

This is quite amazing, thanks! I noticed it also works quite well to use chunk_size_limit=1, but I guess that is dataset dependent. Given the code in load_data_object_from_load_request(), None will result in the same behaviour, so I will use that. In combination with service="arco-geo-series":

copernicus-marine-toolbox/copernicusmarine/python_interface/load_utils.py

Lines 75 to 86 in 7f1475b

    
           optimum_dask_chunking = ( 
        
               get_optimum_dask_chunking( 
        
                   retrieval_service.service, 
        
                   load_request.geographical_parameters, 
        
                   load_request.temporal_parameters, 
        
                   load_request.depth_parameters, 
        
                   load_request.variables, 
        
                   chunks_factor_size_limit, 
        
               ) 
        
               if chunks_factor_size_limit 
        
               else None 
        
           )

renaudjester · 2025-01-16T16:25:09Z

Super I will close this issue then!

veenstrajelmer mentioned this issue Jan 13, 2025

Improve CMEMS download performance Deltares/dfm_tools#1033

Closed

6 tasks

veenstrajelmer changed the title ~~Improve download performance for small time chunks~~ Improve download performance of large time extents for small time chunks Jan 13, 2025

veenstrajelmer mentioned this issue Jan 15, 2025

speed up copernicusmarine_dataset_timerange Deltares/dfm_tools#1058

Closed

renaudjester closed this as completed Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve download performance of large time extents for small time chunks #267

Improve download performance of large time extents for small time chunks #267

veenstrajelmer commented Jan 13, 2025 •

edited

Loading

renaudjester commented Jan 13, 2025 •

edited

Loading

veenstrajelmer commented Jan 14, 2025

renaudjester commented Jan 16, 2025

Improve download performance of large time extents for small time chunks #267

Improve download performance of large time extents for small time chunks #267

Comments

veenstrajelmer commented Jan 13, 2025 • edited Loading

renaudjester commented Jan 13, 2025 • edited Loading

veenstrajelmer commented Jan 14, 2025

renaudjester commented Jan 16, 2025

veenstrajelmer commented Jan 13, 2025 •

edited

Loading

renaudjester commented Jan 13, 2025 •

edited

Loading