Improve CMEMS download performance #1033

veenstrajelmer · 2024-10-23T15:08:37Z

Downloading long timeseries for CMEMS is slow with dfm_tools, even though the actual download happens with a daily frequency. This is probably since per default the entire requested dataset is opened, from which then daily subsets are retrieved:

dfm_tools/dfm_tools/download.py

Lines 216 to 249 in f7e5234

    
           dataset = copernicusmarine.open_dataset( 
        
                dataset_id = dataset_id, 
        
                variables = [varkey], 
        
                minimum_longitude = longitude_min, 
        
                maximum_longitude = longitude_max, 
        
                minimum_latitude = latitude_min, 
        
                maximum_latitude = latitude_max, 
        
                start_datetime = date_min, 
        
                end_datetime = date_max, 
        
           ) 
        
           Path(dir_output).mkdir(parents=True, exist_ok=True) 
        
           if freq is None: 
        
               date_str = f"{date_min.strftime('%Y%m%d')}_{date_max.strftime('%Y%m%d')}" 
        
               name_output = f'{file_prefix}{varkey}_{date_str}.nc' 
        
               output_filename = Path(dir_output,name_output) 
        
               if output_filename.is_file() and not overwrite: 
        
                   print(f'"{name_output}" found and overwrite=False, returning.') 
        
                   return 
        
               print(f'xarray writing netcdf file: {name_output}') 
        
               dataset.to_netcdf(output_filename) 
        
           else: 
        
               period_range = pd.period_range(date_min,date_max,freq=freq) 
        
               for date in period_range: 
        
                   date_str = str(date) 
        
                   name_output = f'{file_prefix}{varkey}_{date_str}.nc' 
        
                   output_filename = Path(dir_output,name_output) 
        
                   if output_filename.is_file() and not overwrite: 
        
                       print(f'"{name_output}" found and overwrite=False, continuing.') 
        
                       continue 
        
                   dataset_perperiod = dataset.sel(time=slice(date_str, date_str)) 
        
                   print(f'xarray writing netcdf file: {name_output}') 
        
                   dataset_perperiod.to_netcdf(output_filename)

This example shows that when cutting it up in monthly chunks, the download is way faster compared to retrieving at once:

dfm_tools-dependent example

import dfm_tools as dfmt
import pandas as pd

# spatial extents
lon_min, lon_max, lat_min, lat_max = 12.5, 16.5, 34.5, 37

# time extents
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour and is slow
monthly_periods = [(date_min, date_max)]

for period in monthly_periods: 
    dfmt.download_CMEMS(varkey='uo',
                        longitude_min=lon_min, longitude_max=lon_max, latitude_min=lat_min, latitude_max=lat_max,
                        date_min=period[0], date_max=period[1],
                        dir_output=".", overwrite=True, dataset_id='med-cmcc-cur-rean-d')

Example without dfm_tools dependency:

import copernicusmarine
import pandas as pd

# spatial extents
longitude_min, longitude_max, latitude_min, latitude_max = 12.5, 16.5, 34.5, 37

# time extents 
# be sure to start with 1st of month and end with last of month
# since monthly_periods generator is too simple for other dates
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour of dfm_tools and it is slow
monthly_periods = [(pd.Timestamp(date_min), pd.Timestamp(date_max))]

for period in monthly_periods:
    varkey = 'uo'
    dataset = copernicusmarine.open_dataset(
         dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         # temporarily convert back to strings because of https://github.com/mercator-ocean/copernicus-marine-toolbox/issues/261
         # TODO: revert, see https://github.com/Deltares/dfm_tools/issues/1047
         start_datetime = period[0].isoformat(),
         end_datetime = period[1].isoformat(),
    )
    
    freq = "D" # 1 netcdf file per day
    period_range = pd.period_range(date_min,date_max,freq=freq)
    for date in period_range:
        date_str = str(date)
        name_output = f'cmems_{varkey}_{date_str}.nc'
        dataset_perperiod = dataset.sel(time=slice(date_str, date_str))
        print(f'xarray writing netcdf file: {name_output}')
        dataset_perperiod.to_netcdf(name_output)

Todo:

first await copernicusmarine v2 release and update to copernicusmarine toolbox 2.0 after release #933
check if there is still a difference in performance between the two methods >> yes
requested smaller time chunks in Improve download performance of large time extents for small time chunks mercator-ocean/copernicus-marine-toolbox#267 >> use additional arguments service="arco-geo-series", chunk_size_limit=None
might be related to subset command uses a lot of RAM when downloading large subsets mercator-ocean/copernicus-marine-toolbox#111 >> not relevant anymore
add None to accepted freqs for download_CMEMS() >> dropped support for None instead
files from 2018 onwards are significantly smaller, catch this? >> were empty files. This is not an issue in dfm_tools since it is catched by copernicusmarine_get_dataset_id(), an error is raised if the user requests a period outside of the available time span.

The text was updated successfully, but these errors were encountered:

This was referenced Oct 23, 2024

Prepare 0.31.0 release #1031

Closed

Prepare 0.32.0 release #1036

Closed

veenstrajelmer mentioned this issue Jan 13, 2025

Improve download performance of large time extents for small time chunks mercator-ocean/copernicus-marine-toolbox#267

Closed

veenstrajelmer linked a pull request Jan 14, 2025 that will close this issue

added chunking arguments to copernicusmarine.open_dataset() #1049

Merged

veenstrajelmer closed this as completed in #1049 Jan 14, 2025

veenstrajelmer mentioned this issue Jan 28, 2025

Consider 12-hour offset for CMEMS data #878

Closed

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CMEMS download performance #1033

Improve CMEMS download performance #1033

veenstrajelmer commented Oct 23, 2024 •

edited

Loading

Improve CMEMS download performance #1033

Improve CMEMS download performance #1033

Comments

veenstrajelmer commented Oct 23, 2024 • edited Loading

veenstrajelmer commented Oct 23, 2024 •

edited

Loading