Skip to content

Notes on CDS datasets

Pontus Lurcock edited this page Jul 7, 2020 · 1 revision

Summary of datasets available via the CDS API

The CDS API

The CDS API is exposed as a REST API over HTTP. The REST API itself is not officially defined or documented. According to the Copernicus API How-To, the REST API should not be used directly, but rather through the cdsapi Python client library, which is available through pip and conda-forge.

The API exposed by the cdsapi Python library is also undocumented; the officially recommended way to use it is to build a request interactively via the web interface for the dataset of interest, then clicking the "Show API request" button. The details of the available parameters and valid parameter combinations are thus not explicitly documented, and can only be determined by manual exploration of the web interface.

Available services and datasets

A total of 66 climate datasets are listed at https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset. :

Identifier Description
cems-fire-historical Fire danger indices historical data from the Copernicus Emergency Management
cems-glofas-forecast River discharge and related forecasted data by the Global Flood Awareness
cems-glofas-historical River discharge and related historical data from the Global Flood Awareness
derived-near-surface-meteorological-variables Near surface meteorological variables from 1979 to 2018 derived from
derived-utci-historical Thermal comfort indices derived from ERA5 reanalysis
ecv-for-climate-change Essential climate variables for assessment of climate variability from
efas-forecast River discharge and related forecasted data by the European Flood Awareness
efas-historical River discharge and related historical data from the European Flood Awareness
insitu-glaciers-elevation-mass Glaciers elevation and mass change data from 1850 to present from the
insitu-glaciers-extent Glaciers distribution data from the Randolph Glacier Inventory for year
insitu-gridded-observations-europe E-OBS daily gridded meteorological data for Europe from 1950 to present
projections-cmip5-daily-pressure-levels CMIP5 daily data on pressure levels
projections-cmip5-daily-single-levels CMIP5 daily data on single levels
projections-cmip5-monthly-pressure-levels CMIP5 monthly data on pressure levels
projections-cmip5-monthly-single-levels CMIP5 monthly data on single levels
projections-cordex-single-levels CORDEX regional climate model data on single levels for Europe
reanalysis-era5-land ERA5-Land hourly data from 1981 to present
reanalysis-era5-land-monthly-means ERA5-Land monthly averaged data from 1981 to present
reanalysis-era5-pressure-levels ERA5 hourly data on pressure levels from 1979 to present
reanalysis-era5-pressure-levels-monthly-means ERA5 monthly averaged data on pressure levels from 1979 to present
reanalysis-era5-single-levels ERA5 hourly data on single levels from 1979 to present
reanalysis-era5-single-levels-monthly-means ERA5 monthly averaged data on single levels from 1979 to present
reanalysis-uerra-europe-complete Complete UERRA regional reanalysis for Europe from 1961 to 2019
reanalysis-uerra-europe-height-levels UERRA regional reanalysis for Europe on height levels from 1961 to 2019
reanalysis-uerra-europe-pressure-levels UERRA regional reanalysis for Europe on pressure levels from 1961 to
reanalysis-uerra-europe-single-levels UERRA regional reanalysis for Europe on single levels from 1961 to 2019
reanalysis-uerra-europe-soil-levels UERRA regional reanalysis for Europe on soil levels from 1961 to 2019
satellite-aerosol-properties Aerosol properties gridded data from 1995 to present derived from satellite
satellite-albedo Surface albedo 10-daily gridded data from 1981 to present
satellite-carbon-dioxide Carbon dioxide data from 2002 to present derived from satellite observations
satellite-fire-burned-area Fire burned area from 2001 to present derived from satellite observations
satellite-lai-fapar Leaf area index and fraction absorbed of photosynthetically active radiation 10-daily gridded data from 1981 to present
satellite-land-cover Land cover classification gridded maps from 1992 to present derived from
satellite-methane Methane data from 2002 to present derived from satellite observations
satellite-ocean-colour Ocean colour daily data from 1997 to present derived from satellite observations
satellite-ozone Ozone monthly gridded data from 1970 to present
satellite-sea-ice Sea ice monthly and daily gridded data from 1978 to present derived from
satellite-sea-level-black-sea Sea level daily gridded data from satellite altimetry for the Black Sea
satellite-sea-level-global Sea level daily gridded data from satellite altimetry for the global
satellite-sea-level-mediterranean Sea level daily gridded data from satellite altimetry for the Mediterranean Sea from 1993 to present
satellite-sea-surface-temperature-ensemble-product Sea surface temperature daily gridded data from 1981 to 2016 derived
satellite-sea-surface-temperature Sea surface temperature daily data from 1981 to present derived from
satellite-soil-moisture Soil moisture gridded data from 1978 to present
seasonal-monthly-pressure-levels Seasonal forecast monthly statistics on pressure levels
seasonal-monthly-single-levels Seasonal forecast monthly statistics on single levels
seasonal-original-pressure-levels Seasonal forecast daily data on pressure levels
seasonal-original-single-levels Seasonal forecast daily data on single levels
seasonal-postprocessed-pressure-levels Seasonal forecast anomalies on pressure levels
seasonal-postprocessed-single-levels Seasonal forecast anomalies on single levels
sis-agroclimatic-indicators Agroclimatic indicators from 1951 to 2099 derived from climate projections
sis-agrometeorological-indicators Agrometeorological indicators from 1979 to 2018 derived from reanalysis
sis-ecv-cmip5-bias-corrected Essential climate variables for water sector applications derived from CMIP5 projections
sis-european-energy-sector Climate data for the European energy sector from 1979 to 2016 derived from ERA-Interim
sis-fisheries-ocean-fronts Ocean fronts data for the Northwest European Shelf and Mediterranean
sis-heat-and-cold-spells Heat waves and cold spells in Europe derived from climate projections
sis-ocean-wave-indicators Ocean surface wave indicators for the European coast from 1977 to 2100
sis-ocean-wave-timeseries Ocean surface wave time series for the European coast from 1976 to 2100 derived from climate projections
sis-offshore-windfarm-indicators Performance indicators for offshore wind farms in Europe from 1977 to
sis-shipping-arctic Arctic route availability and cost projection derived from climate projections
sis-shipping-consumption-on-routes Ship performance along standard shipping routes derived from reanalysis
sis-temperature-statistics Temperature statistics for Europe derived from climate projections
sis-urban-climate-cities Climate variables for cities in Europe from 2008 to 2017
sis-water-level-change-indicators Water level change indicators for the European coast from 1977 to 2100
sis-water-level-change-timeseries Water level change time series for the European coast from 1977 to 2100
sis-water-quality-swicca Water quality indicators for European rivers
sis-water-quantity-swicca Water quantity indicators for Europe

Additionally, the Atmosphere Data Store provides five CAMS datasets via the CDS API; they are listed at https://ads.atmosphere.copernicus.eu/cdsapp#!/search?type=dataset. For the other four Copernicus services listed at https://www.copernicus.eu/en (Marine, Land, Security, Emergency), I have not found any public CDS API endpoints.

Available parameters

There are 68 request parameter keys currently available for the various CDS datasets:

algorithm
area
arrival_port
bias_correction
city
dataset
day
definition
departure_port
emissions_scenario
end_year
ensemble_member
ensemble_statistics
epoch
experiment
file_version
forecast_start_month
format
gcm_model
grid_resolution
height_level
horizontal_aggregation
horizontal_resolution
indicator
leadtime_hour
leadtime_month
model
model_levels
month
nominal_day
origin
originating_centre
percentile
period
pressure_level
processing_level
processinglevel
product_type
product_version
projection
rcm_model
reference_dataset
region
return_period
satellite
sea
sensor
sensor_and_algorithm
sensor_on_satellite
simulation_version
soil_level
start_year
stat
statistics
step
system
temporal_aggregation
temporal_resolution
time
time_aggregation
type
type_of_record
type_of_sensor
variable
version
vertical_aggregation
vertical_level
year

Each dataset only supports a small subset of these keys. The subset of supported keys varies from dataset to dataset, as does the set of allowed values for each key. Additionally, there can be complex interdependencies between the parameters (e.g. the available years depend on the selected product version, or the percentile key is only supported for particular values of time_aggregation). These constraints can only be determined by manual experimentation in the web interface.

Datasets sizes and download/processing times

Depending on the request parameters and the selected dataset, download sizes can range from a few kB to multiple GB. Requests are queued before being processed, and the total time to produce a data file for download depends on both the queue time and the processing time. In many cases data is returned within a few seconds, but in my testing I occasionally encountered much longer wait times (in the most extreme cases, around 25 hours for UERRA regional reanalyses). Queue times can vary widely depending on the current load being experienced by the CDS servers. The Python API does not provide any way to query expected queue or processing times for a request without actually executing it.

Regarding the actual submission of requests, the underlying, undocumented REST API works asynchronously: after submitting a request, the client can repeatedly poll its status, then download the result once it becomes available. However, the Python API library hides this mechanism behind a synchronous interace: it only exposes a synchronous retrieve method, which blocks until the request has been processed and the data file downloaded.

Data formats in the Climate Data Store

The great majority of datasets in the Climate Data Store return data in NetCDF or GRIB format, or as a zip or .tar.gz archive of multiple NetCDF or GRIB files. In many cases, NetCDF and GRIB are both offered as options, with NetCDF automatically converted server-side from GRIB (and sometimes marked ‘experimental’). In some cases, requesting NetCDF output produces an error but GRIB files are successfully produced from the same dataset.

In cases where an archive is produced, it usually contains one NetCDF file per variable, or per unique combination of parameters. For instance, a request for data from the ‘Water quantity indicators for Europe’ dataset for two horizontal aggregation levels, two percentile levels, and two emissions scenarios produces an archive containing eight (= 2 × 2 × 2) individual NetCDF files.

A few of the datasets are by their nature not suitable for representation as xcubes – for example, ‘Ship performance along standard shipping routes derived from reanalysis and seasonal forecasts’ and ‘Performance indicators for offshore wind farms in Europe’, which do not contain any geographical co-ordinates.

The API does not offer any information on the exact format of the data within the returned NetCDF or GRIB files, but many of the dataset web pages claim conformance with the Climate and Forecast (CF) conventions at various versions from 1.3 to 1.6. Nevertheless, for each dataset, it will be necessary to manually examine output files to confirm that the format can be normalized into an xcube and determine the variable names and metadata that will be returned by the store plugin's describe_data method. For data that are returned as an archive of multiple files, these files must be unpacked, individually read, and merged into a single cube, which may be challenging if (for example) they have differing resolutions.

Steps required to support a new dataset

  1. Determine request format and valid parameters by manual experimentation in the web API.

  2. Based on the request parameters, write a JSON schema which can be returned by get_open_data_params_schema. The JSON schema will usually correspond to a superset of the actual valid parameters, since there are often restrictions on parameters and parameter combinations which are too complex to be representable in a JSON schema.

  3. Write code to transform the request parameters supplied to the CDS Plugin into the corresponding parameters for the Python CDS API library.

  4. Examine output files from the CDS API to determine their structure and naming conventions, and use this information to write a DatasetDescriptor for the xarray Dataset which will be returned from the CDS Plugin.

  5. Write code to process the data returned from the CDS API into a normalized xcube, which may involve operations such as combining multiple data files, editing variable metadata, or rasterizing vector data.

Further information

For each of the 66 CDS datasets available at https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset , I have created a directory within //fs1/file/home/pont/projects/xcube/cds/datasets/dirs containing the following:

  1. An info.yaml file containing some essential information about the dataset: short identifier (also used as the directory name), description, URL for web interface, output container format, etc.

  2. A request.py file produced using the web interface, containing Python code for an example request via the CDS API library. These requests are constructed to include as many values for as many request parameters as possible, to serve as a partial description of the request syntax and starting point for developing support in the plugin. (In some cases, this means that the request cannot be executed as such because it exceeds the limits on maximum amount of data which may be requested.)

  3. A data output file in a data subdirectory, to serve as an example of the output format and a starting point for the design of a data descriptor and file import code. The file is produced from a request for one or a few variables over a limited temporal and geographical range, in order to demonstrate the file format without producing an excessive amount of data. Nevertheless, the minimum requestable amount of data for some datasets exceeds 1 GB.

  4. For a few of the datasets, a notes.txt file containing additional information.