Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider 12-hour offset for CMEMS data #878

Closed
22 tasks done
veenstrajelmer opened this issue Jul 4, 2024 · 0 comments · Fixed by #1088
Closed
22 tasks done

Consider 12-hour offset for CMEMS data #878

veenstrajelmer opened this issue Jul 4, 2024 · 0 comments · Fixed by #1088

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Jul 4, 2024

The functions copernicusmarine.subset() and copernicusmarine.open_dataset() always return start-of-interval time samples (e.g. start of hour, day, month, year) because of the underlying ARCO format. Native datasets (retrieved with copernicusmarine.get() or with opendap before December 2023) use a mix of start-of-interval and center-of-interval. We had mid-day timestamps when using opendap, now we have midnight values, but the actual data is the same. This is documented in https://help.marine.copernicus.eu/en/articles/8656000-differences-between-netcdf-and-arco-formats. In dfm_tools the copernicus opendap server was used to retrieve data until approximately December 2023 (so noon timestamps, center-of-interval). After that (v0.18.0 onwards), the copernicusmarine toolbox was used (so midnight timestamps, start-of interval).

Most of the datasets we use are daily means, so consider to correct for this by adding an offset of 12 hours. Frequently used datasets are documented in dfmt.copernicusmarine_get_dataset_id().

def copernicusmarine_get_dataset_id(varkey, date_min, date_max):
#TODO: maybe get dataset_id from 'copernicusmarine describe --include-datasets --contains <search_token>'
product = copernicusmarine_get_product(date_min, date_max)
if varkey in ['bottomT','tob','mlotst','siconc','sithick','so','thetao','uo','vo','usi','vsi','zos']: #for physchem
# resolution is 1/12 degrees in lat/lon dimension, but a bit more/less in alternating cells
if product == 'analysisforecast': #forecast: https://data.marine.copernicus.eu/product/GLOBAL_ANALYSISFORECAST_PHY_001_024/description
if varkey in ['uo','vo']: #anfc datset is splitted over multiple urls
dataset_id = 'cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m'
elif varkey in ['so']:
dataset_id = 'cmems_mod_glo_phy-so_anfc_0.083deg_P1D-m'
elif varkey in ['thetao']:
dataset_id = 'cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m'
else:
dataset_id = 'cmems_mod_glo_phy_anfc_0.083deg_P1D-m'
else: #reanalysis: https://data.marine.copernicus.eu/product/GLOBAL_MULTIYEAR_PHY_001_030/description
dataset_id = 'cmems_mod_glo_phy_my_0.083deg_P1D-m'
elif varkey in ['nppv','o2','talk','dissic','ph','spco2','no3','po4','si','fe','chl','phyc']: # for bio
# resolution is 1/4 degrees
if product == 'analysisforecast': #forecast: https://data.marine.copernicus.eu/product/GLOBAL_ANALYSISFORECAST_BGC_001_028/description
if varkey in ['nppv','o2']:
dataset_id = 'cmems_mod_glo_bgc-bio_anfc_0.25deg_P1D-m'
elif varkey in ['talk','dissic','ph']:
dataset_id = 'cmems_mod_glo_bgc-car_anfc_0.25deg_P1D-m'
elif varkey in ['spco2']:
dataset_id = 'cmems_mod_glo_bgc-co2_anfc_0.25deg_P1D-m'
elif varkey in ['no3','po4','si','fe']:
dataset_id = 'cmems_mod_glo_bgc-nut_anfc_0.25deg_P1D-m'
elif varkey in ['chl','phyc']:
dataset_id = 'cmems_mod_glo_bgc-pft_anfc_0.25deg_P1D-m'
else: #https://data.marine.copernicus.eu/product/GLOBAL_MULTIYEAR_BGC_001_029/description
dataset_id = 'cmems_mod_glo_bgc_my_0.25_P1D-m'
else:
raise KeyError(f"unknown varkey for cmems: {varkey}")
return dataset_id

The PUM states the daily averaged products are centered at noon, not at midnight, this issue restores this behaviour:
Image

Some usecases:

  • downloading data and interpolate to boundaries to serve as boundary conditions for models, in this case it could make sense to move the daily average to noon, since this is then representative for the entire day and in the middle.
  • when using the data as validation data for a model, it would be best to compare to daily averages of the model also. With xarray this would most probably end up at midnight also, so no timeshift is desired. When comparing to instantaneous model values, it is slightly more convenient to have the cmems data on midday, but it does not matter much and comparing a daily mean to an instantaneous value on midnight or noon is not accurate anyway.

Check performance and behaviour (like file names and extents) with:

import dfm_tools as dfmt

# spatial extents
lon_min, lon_max, lat_min, lat_max = 12.5, 16.5, 34.5, 37

# time extents
date_min = '2015-11-01'
# date_max = '2020-07-31'
date_max = '2015-11-02'

dataset_id = 'cmems_mod_glo_phy_my_0.083deg_P1D-m' # daily means, corrected
dataset_id = 'cmems_mod_glo_phy_my_0.083deg_P1M-m' # monthly means, not corrected, intermediate days are also downloaded as empty files in case of freq="D"
# dataset_id = 'med-cmcc-cur-rean-d' # daily means, corrected
freq = "Y"
# freq = "M"
# freq = "D"

varkey_dict = {'cmems_mod_glo_phy_my_0.083deg_P1D-m':'uo',
               'cmems_mod_glo_phy_my_0.083deg_P1M-m':'so',
               'med-cmcc-cur-rean-d':'vo'}

dfmt.download_CMEMS(varkey=varkey_dict[dataset_id],
                    longitude_min=lon_min, longitude_max=lon_max, latitude_min=lat_min, latitude_max=lat_max,
                    date_min=date_min, date_max=date_max, freq=freq,
                    dir_output=".", overwrite=True, dataset_id=dataset_id)


# import xarray as xr; ds = xr.open_dataset(r"c:\DATA\checkouts\dfm_tools\tests\uo_2015.nc"); print(ds.time)

Todo:

The new implementation:

  • downloads required timerange (outside buffered) in files per day/month/year
  • all files are called to the period, so "2020" for "Y", "2020-11" for "M" and "2020-11-06" for "D". The 12 hour timestamp is not visible in the filenames.
  • so monthly files of 1nov to 2nov 2020 for daily means at noon (corrected) would result in one "2020-10" file with 1 timestep (31oct 12:00) and one "2020-11" file with 2 timesteps (1nov 12:00 and 2 nov 12:00)
  • when downloading monthly means with daily freq (or yearly with monthly/daily freq), empty files are created, this was also the case before. This can be avoided by downloading monthly means with monthly or yearly freq, easy as that.
  • Currently, only daily means are corrected with an offset. The yearly/monthly/hourly/3hourly/6hourly averaged datasets are not corrected.
  • some products have datasets that have names not according to the convention. A daily mean dataset called *rean-d is also corrected with 12 hours. This is temporary hardcoding that will be removed in copernicusmarine remove hardcoded offset for "rean-d" datasets #1090.

Alternative approach
Alternatively, request a argument for copernicusmarine.open_dataset() to get averaged values in mid-time or start-time of the average. That would completely solve all complexity around this issue. Also request attributes, at the moment it is not clear in the dataset that the time is not instantaneous but averaged. Check if insitu timeseries are instantaneous and not averaged. Requested new argument and/or metadata via [email protected] on 10-7-2024, the request is registered under ticket [MDSOP-179] and mercator-ocean/copernicus-marine-toolbox#271.

Potential projects: BES>>Malta, EDITO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant