Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/Implement data source mirrors #26

Open
observingClouds opened this issue Dec 23, 2020 · 7 comments
Open

Add/Implement data source mirrors #26

observingClouds opened this issue Dec 23, 2020 · 7 comments

Comments

@observingClouds
Copy link
Collaborator

Hi guys,
I'm just in the process of uploading a new version of the radiosonde dataset. This time, it is not a tar archive, but the level1 and level2 data can be directly accessed through the AERIS THREDDS server.

@leifdenby do you want to update your zarr files, or change to the AERIS THREDDS server (https://observations.ipsl.fr/thredds/catalog/EUREC4A/PRODUCTS/MERGED-MEASUREMENTS/RADIOSOUNDINGS/v3.0.0/level2/catalog.html), or even better add both sources for a better availability in case a server is down.

I make an announcement in the data-channel, when the upload is final.
Cheers!

@leifdenby
Copy link
Collaborator

thanks @observingClouds! I was actually thinking that maybe I should remove my zarr-based mirrors from the main repository and we just use AERIS directly instead. What do you think? I'm happy to keep my zarr-based catalog available, but maybe I'll put that on a separate repository that we can link to from this main one? Maybe in mirrors.leifdenby_zarr or something like that? What do you think @d70-t?

@observingClouds
Copy link
Collaborator Author

Well, as long as you could keep the files up-to date (and I don't see that I should reprocess them soon) and/or make sure they see which version they are using (DOI), it might actually be great to still have that resource in case AERIS is down. It would be great, if one could have several possible resources in the catalog and intake switches (semi-)automatically, but I guess this is not yet implemented? You guys probably know more.

@d70-t
Copy link
Contributor

d70-t commented Jan 19, 2021

I think references to Aeris should go into the catalog. However, having an active backup is also a very good idea. There is already some progress in intake/intake#557 on providing multiple locations for one dataset, but it is not done yet.

Having a mirror structure could be an addition, but I am not so sure if we really want to have that. A result of this would be that users would have to specify some form of path manually again and most likely we'll end up in having a couple of scripts passed around which only access the "mirror" tree. This can become particularly problematic if the mirror is not complete, such that some datasets will effectively work only on the main tree while others will probably only work on the mirror tree...

@leifdenby
Copy link
Collaborator

So, in the meantime (before mirroring is available) we could just go ahead and replace the entry backed by my server with the data on AERIS? I think adding a data_mirrors for now might be quite nice to keep this "backup" available. Does that sound ok?

@d70-t
Copy link
Contributor

d70-t commented Mar 15, 2021

Puh... I really find this one hard to decide.

  • mirroring is absolutely something we should have. The OPeNDAP endpoint at Aeris had an uptime of 67% during the last two weeks.
  • having more than one possible path to a dataset of which sometimes one and sometimes the other works kind of defeats the purpose of the catalog (which to my mind is saving the user from pasting in urls or custom root folders or the like)

I have to 🤷 and hope that others have better arguments.

@leifdenby
Copy link
Collaborator

having more than one possible path to a dataset of which sometimes one and sometimes the other works kind of defeats the purpose of the catalog (which to my mind is saving the user from pasting in urls or custom root folders or the like)

Ah yes, you're absolutely right. I hadn't thought of that. We could instead adopt a convention of adding {product}__mirror entries in the catalog? E.g. we'd have radiosondes/bco__mirror. It's not pretty, but at least it's "nearby" in the catalog tree, so should make it easier to find.

@d70-t
Copy link
Contributor

d70-t commented Mar 15, 2021

We could instead adopt a convention of adding {product}__mirror entries in the catalog?

I don't know if this makes the situation better or worse... If we'e implement this, then a user would need to access the data using something like:

def reliable_to_dask(cat, entry):
    try:
        return cat[entry].to_dask()
    except:
        return cat[f"{entry}__mirror"].to_dask()

cat = eurec4a.get_intake_catalog()
### some more code
ds = reliable_to_dask(cat.ATR, "track")

This has the potential of not creating a ton of hard-coded cat = cat.mirror lines, but it also is not entirely beautiful. And if in stead people start to sprinkle around things like ds = cat.ATR.track__mirror18 or the like, this will become horrible.

d70-t added a commit to d70-t/eurec4a-intake that referenced this issue Jul 27, 2021
This includes a change from denby.io to Aeris.

see eurec4a#26 for some discussion about this
@observingClouds observingClouds changed the title New version of radiosonde dataset available Add/Implement data source mirrors Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants