Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

"Dimension time already exists" error when opening aerosol dataset #634

Closed
HelenClifton opened this issue May 2, 2018 · 10 comments
Closed
Assignees

Comments

@HelenClifton
Copy link

HelenClifton commented May 2, 2018

Expected behavior

@forman, @JanisGailis Executing use case 22 happy path with cate GUI
Downloaded dataset esacci.AEROSOL.mon.L3C.AER_PRODUCTS.AATSR.Envisat.ORAC.03-02.r1
Expected it to open

Actual behavior

An error message is seen

uc22_time_error
Traceback here:
uc22_time_error.txt

Notes:

  • From the cate CLI, the dataset will be downloaded without error (command cate ds copy) but the error occurs when it is opened (command cate res open)
  • I was using a clean install of cate-2.0.0-dev.10 (not the auto-updated version)
  • Downloading and opening of this dataset with the dev.9 had worked.
  • It may be related to Generate time dim if it doesn't exist #623

Steps to reproduce the problem

Select data store: ESA CCI Open Data Portal
Highlight data source: esacci.AEROSOL.mon.L3C.AER_PRODUCTS.AATSR.Envisat.ORAC.03-02.r1
Click on “Download and/or open remote dataset”
In “Download Data Source” window:
• Select Time Constraint and enter start time of 2010-01-01 and end time of 2010-12-31
• Select Region Constraint, enter:
o Lon. from: -100
o Lon. from: 80
o Lat. from: -90
o Lat. to: 90
• Select Variables Constraint, select variable AOD550_mean from the list
• Select “Download and make local data source“ and enter Unique identifier UC22_aerosol
• Click on “Download & Open Local”

Specifications

Windows 7 Professional
cate-2.0.0-dev.10

@forman forman self-assigned this May 3, 2018
@forman
Copy link
Member

forman commented May 3, 2018

It is related to #623.

@forman
Copy link
Member

forman commented May 3, 2018

@papesci, it is this piece of code that I marked with a TODO in my branch that causes this. Obviously @kbernat added these lines once to fix the "missing time dimension" problem but instead of really fixing it, we now have a "dead" time dimension in all of our localized datasets without the actual time coordinate variable in there. Aaargh! 👎

@papesci, @JanisGailis we have to review how we normalize datasets opened by open_dataset in the future. I am strongly voting for doing any normalisation by default, including adding a missing time coordinate variable / dimension plus, plus correcting (see #620) and renaming lat/lon coordinate variables / dimensions. The current situation is less than unsatisfying, we have operations open_dataset, normalize, adjust_spatial_attrs, adjust_temporal_attrs, fix_lon_360 that all do some kind of normalisation (see also #555 for single grid cells with lat/lon dimensions retained).

@JanisGailis
Copy link
Member

I completely agree that all normalization must be done by default when the dataset is opened.

However, I don't see a problem with splitting normalization routines in many smaller methods doing a single thing. For one, I actually need adjust_spatial_attrs and adjust_temporal_attrs as there are quite a few operations that can potentially change spatiotemporal parameters of a dataset, in which case I need to rerun some parts of the normalization, such as adjusting the attributes, so that the dataset stays normalized with up to date attributes. Running some generic normalize that does many other things aside from what I need would not be ideal. A different matter is if we need to have separate operation interfaces, or if I should just use _opimpl instead. But then, I don't think it hurts to have these as util operations.

@papesci
Copy link
Contributor

papesci commented May 4, 2018

Yes indeed, we have to support the user as much as possible making easier its work. This mean also executing automatically all the phases preparatory to access data.
By the way we have to deal also with dataset that are not normalizable, for example DS with no lat or log dimension, in this case i think the system should always open the dataset with a warning notifying the user his about to access a DS with a unrecognized structure.

@forman
Copy link
Member

forman commented May 7, 2018

@JanisGailis I agree as long as you are referring to software design - compose generic normalize() operation from of smaller specialized operations each useful on its own. That's absolutely fine. Then normalize() should have a set of kwargs that control the behaviour, e.g. gen_missing_time_dim=True.

However, I was referring to the individual normalisation steps that should happen in Cate's open_dataset() so that a maximum number of operations can be applied without forcing users to add intermediate normalisation steps. By default, open_dataset() should apply the most common normalisations that we are aware off so that we have a guarantee that if there are time coordinates, there is a 1-D variable "time" of type datetime64 with the dimension "time". And if there are longitude coordinates, there is a 1-D variable "lon" etc and that corresponding CF attributes are set correctly.

Leaving it up to the users to decide which normalisation to apply to any dataset before another operation can be used is very unfriendly and frustrating for them. This is what I can report from a 2-days workshop at the climate office last week. We cannot expect that users know what's in a dataset before they can open it and inspect it. Otherwise we would need to provide a per-ECV handbook with the software that explains which kind of normalisation should be applied to each dataset.

@forman
Copy link
Member

forman commented May 7, 2018

This is the current confusing situation (as of 07.05.2018):

  • normalize is called
    • in open_dataset after cate.core.ds.open_dataset, only if the normalize kwarg is set (default is True)
    • in read_netcdf after xr.open_dataset, only if the normalize kwarg is set (default is True)
  • normalize_impl is always called
    • in EsaCciOdpDataSource.open_dataset after open_xarray_dataset and before subset_spatial_impl
    • in EsaCciOdpDataSource._make_local after xr.open_dataset and before subset_spatial_impl
    • in LocalDataSource.open_dataset and LocalDataSource._make_local for the same reason as above, btw. with a lot of unbeatiful code duplication.
  • adjust_spatial_attrs_impl is always called
    • as a post-processor in EsaCciOdpDataSource._make_local after subset_spatial_impl
    • as a post-processor in LocalDataSource._make_local after subset_spatial_impl (code duplication!)
  • adjust_spatial_attrs is always called
    • as a post-processor in coregister
    • as a post-processor in pearson_correlation
    • as a post-processor in subset_spatial
  • adjust_temporal_attrs_impl is always used
    • as a post-processor in DatasetLike.convert after converting a DataFrame to Dataset
  • adjust_temporal_attrs is called
    • as a post-processor after(?) normalize in open_dataset and read_netcdf, if their normalize kwarg is set
    • always as a post-processor in temporal_aggregation
    • always as a post-processorin subset_temporal

@kbernat
Copy link
Collaborator

kbernat commented May 7, 2018

@forman @papesci
I'm happy to help if you want, even if it's just a refactoring orremoving some code duplication.
Give me a shout with your thoughts.

@forman
Copy link
Member

forman commented May 8, 2018

@kbernat Thanks Chris, much appreciated.

@forman
Copy link
Member

forman commented May 8, 2018

This issue should be fixed after last commit.

@HelenClifton
Copy link
Author

@forman , @JanisGailis , @papesci
Confirmed, it is fixed in dev.11.
Closing the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants