Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CMEMS download performance #1033

Closed
6 tasks done
veenstrajelmer opened this issue Oct 23, 2024 · 0 comments · Fixed by #1049
Closed
6 tasks done

Improve CMEMS download performance #1033

veenstrajelmer opened this issue Oct 23, 2024 · 0 comments · Fixed by #1049

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Oct 23, 2024

Downloading long timeseries for CMEMS is slow with dfm_tools, even though the actual download happens with a daily frequency. This is probably since per default the entire requested dataset is opened, from which then daily subsets are retrieved:

dataset = copernicusmarine.open_dataset(
dataset_id = dataset_id,
variables = [varkey],
minimum_longitude = longitude_min,
maximum_longitude = longitude_max,
minimum_latitude = latitude_min,
maximum_latitude = latitude_max,
start_datetime = date_min,
end_datetime = date_max,
)
Path(dir_output).mkdir(parents=True, exist_ok=True)
if freq is None:
date_str = f"{date_min.strftime('%Y%m%d')}_{date_max.strftime('%Y%m%d')}"
name_output = f'{file_prefix}{varkey}_{date_str}.nc'
output_filename = Path(dir_output,name_output)
if output_filename.is_file() and not overwrite:
print(f'"{name_output}" found and overwrite=False, returning.')
return
print(f'xarray writing netcdf file: {name_output}')
dataset.to_netcdf(output_filename)
else:
period_range = pd.period_range(date_min,date_max,freq=freq)
for date in period_range:
date_str = str(date)
name_output = f'{file_prefix}{varkey}_{date_str}.nc'
output_filename = Path(dir_output,name_output)
if output_filename.is_file() and not overwrite:
print(f'"{name_output}" found and overwrite=False, continuing.')
continue
dataset_perperiod = dataset.sel(time=slice(date_str, date_str))
print(f'xarray writing netcdf file: {name_output}')
dataset_perperiod.to_netcdf(output_filename)

This example shows that when cutting it up in monthly chunks, the download is way faster compared to retrieving at once:

dfm_tools-dependent example
import dfm_tools as dfmt
import pandas as pd

# spatial extents
lon_min, lon_max, lat_min, lat_max = 12.5, 16.5, 34.5, 37

# time extents
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour and is slow
monthly_periods = [(date_min, date_max)]

for period in monthly_periods: 
    dfmt.download_CMEMS(varkey='uo',
                        longitude_min=lon_min, longitude_max=lon_max, latitude_min=lat_min, latitude_max=lat_max,
                        date_min=period[0], date_max=period[1],
                        dir_output=".", overwrite=True, dataset_id='med-cmcc-cur-rean-d')

Example without dfm_tools dependency:

import copernicusmarine
import pandas as pd

# spatial extents
longitude_min, longitude_max, latitude_min, latitude_max = 12.5, 16.5, 34.5, 37

# time extents 
# be sure to start with 1st of month and end with last of month
# since monthly_periods generator is too simple for other dates
date_min = '2017-12-01'
date_max = '2022-07-31'

# make list of start/stop times (tuples) with monthly frequency
# TODO: this approach improves performance significantly
date_range_start = pd.date_range(start=date_min, end=date_max, freq='MS')
date_range_end = pd.date_range(start=date_min, end=date_max, freq='ME')
monthly_periods = [(start, end) for start, end in zip(date_range_start, date_range_end)]

# make list of start/stop times (tuples) to download all at once (but still per day)
# TODO: this is the default behaviour of dfm_tools and it is slow
monthly_periods = [(pd.Timestamp(date_min), pd.Timestamp(date_max))]

for period in monthly_periods:
    varkey = 'uo'
    dataset = copernicusmarine.open_dataset(
         dataset_id = 'med-cmcc-cur-rean-d',
         variables = [varkey],
         minimum_longitude = longitude_min,
         maximum_longitude = longitude_max,
         minimum_latitude = latitude_min,
         maximum_latitude = latitude_max,
         # temporarily convert back to strings because of https://github.com/mercator-ocean/copernicus-marine-toolbox/issues/261
         # TODO: revert, see https://github.com/Deltares/dfm_tools/issues/1047
         start_datetime = period[0].isoformat(),
         end_datetime = period[1].isoformat(),
    )
    
    freq = "D" # 1 netcdf file per day
    period_range = pd.period_range(date_min,date_max,freq=freq)
    for date in period_range:
        date_str = str(date)
        name_output = f'cmems_{varkey}_{date_str}.nc'
        dataset_perperiod = dataset.sel(time=slice(date_str, date_str))
        print(f'xarray writing netcdf file: {name_output}')
        dataset_perperiod.to_netcdf(name_output)

Todo:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant