Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

versioning catalog #30

Open
leifdenby opened this issue Jan 18, 2021 · 21 comments
Open

versioning catalog #30

leifdenby opened this issue Jan 18, 2021 · 21 comments

Comments

@leifdenby
Copy link
Collaborator

to avoid breaking dependent code what do people think about versioning the catalog? It would be nice to be able to refer to a specific git-tagged version when fetching the catalog.

I think with what we've got now we could create a version v1.0.0. If we adopt semver (https://semver.org/) we could update the MAJOR version we if we remove/add an endpoint and update the MINOR version when we add new endpoints, and finally update the PATCH version when we make fixed to existing endpoints. Adding/removing endpoint arguments could also be considered a breaking change I think.

Thoughts?

@d70-t
Copy link
Contributor

d70-t commented Jan 19, 2021

I'll try to write down some of my thoughts on that topic.

Requirements

  • We need a form of stable references to a catalog, as changes will happen, but users certainly want to re-run their old scripts exactly as they are with exactly the original data.
  • We need to be able to release updates to the catalog quickly, such that newly available datasets can be used properly without delay or workarounds.
  • We want people to use the newest available data by default.

Semver for the intake catalog

  • If a new dataset is added to the catalog, this should be a non-breaking change which adds functionality -> MINOR change
    • except a user enumerates the catalog and operates on the result of that enumeration, then this could be a breaking change -> MAJOR change
  • If a dataset is updated, this is quite probably a breaking change for some users (even if it is only correcting a typo) -> MAJOR change
  • Datasets should not be removed. If they are -> MAJOR change
  • Datasets should ideally not be renamed, if they are -> MAJOR change
  • Currently there are substantial changes happening to files on the Aeris server (even if they already have a DOI allocated) these would be a MAJOR change for semver but can't be addressed as the catalog is not touched (to be fair, that's an issue which must be solved outside of the catalog)
  • I see very little opportunities for PATCH updates, but of course, that's not required 🤷

Current status

Currently there are some active efforts to fill the catalog with more and more data around HALO, so I'd expect quite a few more changes in the next days. I am also not yet sure if the catalog hierarchy is already settled firmly enough so that renaming isn't necessary anymore, but I hope that we can reach that state soon.

Thoughts

If we adopt semver, I think that we'll have to increase versions quite rapidly in order to be able to both comply with semver and provide new datasets quickly. At the current state, it is already possible to access the catalog via the commit hash (i.e. https://raw.githubusercontent.com/eurec4a/eurec4a-intake/b6efdf3c57df9cbea014989b51a1c956d27c136c/catalog.yml) so in principle there are persistent identifiers available already. However, directly referring to that url is cumbersome, probably we'd have to add support for this into get_intake_catalog. I agree that semver versions look prettier, but I am not yet convinced that they provide more benefit than additional maintainance cost.

@observingClouds
Copy link
Collaborator

You might want to read through the possibility of packaging as well

@d70-t
Copy link
Contributor

d70-t commented Dec 1, 2022

Ok, I think I've to re-warm this thread. As @observingClouds and @fjansson mentionded, for some journals, it's now mandatory to provide DOIs to reference data. Although I still doubt the technical usefullness of having a DOI on specific versions of the intake catalog (e.g. because the data gets moved away and thus old versions of the catalog will become broken), we might need them because of those requirements. On the upside, we might also use just the collection DOI if we only need one DOI and this can then be updated to a newer version.

In order to get some DOIs (e.g. via zenodo) we need to make releases. And for making releases, it seems to be useful to have version numbers. These days, the rate of major changes (e.g. data version changed) seems to be much lower than when this issue was opened, so I'm now thinking that applying semver might become more useful. Would you agree?


I'd try to formalize some rules for changing version numbers, my try would be:

  • add a new endpoint -> minor update
  • delete a previously existing endpoint -> major update
  • change the content returned by an endpoint (e.g. data version updates) -> major update
  • change the source location of an endpoint (e.g. from one server to another) -> minor update (❗)
  • moving / renaming an endpoint should be considered as adding and deleting -> major update
  • changing endpoint metadata (e.g. description) -> patch update
  • changes to CI -> patch update
  • changes to requirements.txt (and similar) -> patch update (❓)

☝️ as an endpoint, I'd consider anything specifying some dataset in intake language, e.g.: cat["foo"]["bar"](arg="baz") would be an endpoint. A notable consequence would be, that adding arguments can be minor if the defaults would result in the same dataset being retrieved if the new arguments are not specified by the user.

❗ I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot.

❓ this probably depends on how we see the requirements file. If it's a mostly internal thing (to drive the CI) I'd definitely go for the same level as other changes to CI, if we see it as a user facing API, this might be minor or even major...


Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out...
We might want to think about requiring successfull checks? 👈 I kind of like that, but that may lead to long standing blocking situations if servers are offline or we can't find some data quickly but don't want to remove their endpoints.

@fjansson
Copy link
Contributor

fjansson commented Dec 1, 2022

Sounds good to me, I like the thought of having the catalog citable with a collection DOI. The semver rules above sound like a reasonable way of achieving that. Link rot of old catalog versions seems somewhat unavoidable, unless the data itself is also properly, permanently archived and given a DOI.

@d70-t
Copy link
Contributor

d70-t commented Dec 1, 2022

unless the data itself is also properly, permanently archived and given a DOI.

Having a DOI on the data itself unfortunately is not a solution here. As by the DataCite documentation DOIs should resolve to a landing page (and not the content) and furthermore, "Humans should be able to reach the item being described from the landing page"... So by design there's no way to reference the content through a DOI in a machine readable way. Thus, to make the eurec4a-intake catalog work, we've had to circumvent the DOI system and had to go back to plain old links even for data which actually has a DOI.

In our setup, DOIs really are pretty useless things 🤷‍♂️ ...

@observingClouds
Copy link
Collaborator

observingClouds commented Dec 1, 2022

The rules that @d70-t is suggesting seem reasonable to me as well. To keep track of the changes between versions and ultimately decide on its increment, we should start using a CHANGELOG or whatsnew.md and make it mandatory for all PRs. Otherwise we easily loose track of the changes and have a hard time to figure out the version of the release candidate.

Comments on the nuances of the rules

if we see it as a user facing API, this might be minor or even major

I argue that changes to the requirements (or similar) are no minor/major increase, because this catalog is not installable and we should assume that users take care of the dependencies as well. Practically, we will likely not even release a new version just because of a change to the requirements.

I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location.

This depends if we trust the move. The past experience has shown that these moves are often not communicated and the maintainers of this project find out only afterwards due to failing tests. It might be the easiest/safest option to increase the major version in these cases.

how to decide when to do a release? I'd probably go for on-demand first

Me too and in case of major changes to prevent citations of an outdated catalog.

Usage of collection DOI for citation

This is twofold:

  • Ignoring the manifests here and encouraging the citation of the collection DOI might be a better practise in this particular (!) case from a user perspective. Users will be directed to the current and working catalog and will gain access to the data. If a dataset has been removed, both a version-DOI and the collection-DOI will return catalogs that fail.
  • For reproducibility, the user might want to access a particular version of the data. This information is only preserved in the version-DOI.

My suggestion would be to use the collection DOI whenever the DOI is given directly, e.g. in data availability statements:
The data used in this paper and all its future versions can be accessed at doi.org/XX.XXX/XXXXXX.
When the data is cited, e.g. in the body of the manuscript, I would tend to use the exact DOI:
We use the XYZ-Dataset from the EUREC4A-Intake catalog (Kölling et al., 2022).

@leifdenby
Copy link
Collaborator Author

leifdenby commented Dec 12, 2022

semver might become more useful. Would you agree?

Great! YES! I think we need to introduce a CHANGELOG though once we start versioning. Version numbers in them selves aren't very useful without a changelog. Although the commit history contains the same info it is much more convenient to have a text file. I tend do follow something like xarray (https://github.com/pydata/xarray/blob/main/doc/whats-new.rst), for example https://github.com/EUREC4A-UK/lagtraj/blob/master/CHANGELOG.md. This is also a useful reference: https://keepachangelog.com/en/1.0.0/

I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot.

Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version). I would suggest if endpoints are moved we should view that in the same way as deletions. If something is no longer in the same place it will function equivalently to not being there.

  • changes to requirements.txt (and similar) -> patch update (question)

I don't see any harm in these being in patch updates.

Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out...
We might want to think about requiring successfull checks? point_left I kind of like that, but that may lead to long standing blocking situations if servers are offline or we can't find some data quickly but don't want to remove their endpoints.

I think based on demand for releases sounds like a good idea. But really it will come down to who has time to do this maintenance work. Similarly for requiring that tests pass. I would opt for tests always needing to pass, otherwise the maintenance burden can become very big.

@observingClouds
Copy link
Collaborator

Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version).
(@leifdenby)

This is a great idea if a dataset is moved within the catalog e.g. from cat.X.Z to cat.X.Y. If the source location is moved though, I don't think it will be possible. Most of the time it seems that datasets just disappear and we just notice that afterwards when our tests fail. Because we do not have access to the hosts, we cannot influence the time of deletion or introduce some grace-period.

I would opt for tests always needing to pass
(@leifdenby)

Ideally, yes, but what do we do with data sources that are unstable (e.g. #131 )? Shall we remove those datasets from the catalog after a grace-period? I think users might still benefit from those entries.

@observingClouds
Copy link
Collaborator

observingClouds commented May 18, 2023

Regarding the removal of catalog entries, I thought we could also write an additional intake driver that allows us to add messages into the catalog. It is very much work in progress and I don't know if I have time to follow-up on this much further, but I'm curious what you guys think about the idea https://github.com/observingClouds/intake_warning

@d70-t
Copy link
Contributor

d70-t commented Aug 10, 2023

I like the idea 👍.
We'd have to ensure though, that people will install the warning driver (but that should be possible and if they don't have it, the worst thing which could happen is, they's get a less useful warning).

@observingClouds
Copy link
Collaborator

Hi everyone (@leifdenby @d70-t),

What is hindering us to move this forward and publish our first version? I have a paper in the last stages before it gets published and I could start as an example. As long as we only have http-links in the catalog and don't have control over the linked datasets the version might be less meaningful, but it might be a step forward?! I don't see us to provide/link all datasets in an unmutable way in the near future. Those datasets in the catalog that do have a DOI I will also cite explicitly, something that we probably should encourage on e.g. the readme page and/or on howto.eurec4a.eu as well.

Any thoughts? Could we try to release a first version by the end of the week? This might help some other papers as well (e.g. publications of @fjansson,@PouriyaTUDelft)

Cheers,
Hauke

@fjansson
Copy link
Contributor

fjansson commented Oct 6, 2023

I'd like having a DOI for the intake catalog. The Cloud Botany paper is near the proof stage now, I'd happily cite the DOI there, if we can have it within a few days :)

@d70-t
Copy link
Contributor

d70-t commented Oct 9, 2023

I'll try to give it a shot. I'm however not yet sure what to put especially in fields like license and authors.. (see #147)

@observingClouds
Copy link
Collaborator

@d70-t before doing the first release, we should probably clean-up the current CHANGELOG.md in some way. Not sure what the best solution is, but the easiest might be to empty it completely with only the section headers remaining.

@d70-t
Copy link
Contributor

d70-t commented Oct 9, 2023

@d70-t before doing the first release, we should probably clean-up the current CHANGELOG.md in some way. Not sure what the best solution is, but the easiest might be to empty it completely with only the section headers remaining.

Probably that comment came in too late... I just did what's been written in RELEASING...

@observingClouds
Copy link
Collaborator

Yeah sorry, but I think it is fine the way you did it. Thanks @d70-t so much for this afternoon/morning hack-session. I think we got a lot of things done and moved this project forward by a good margin.

@d70-t
Copy link
Contributor

d70-t commented Oct 9, 2023

Here's the collection DOI. I think if we reference any, we should use this one (as discussed above, this gives a change of keeping up with the movement of datasets).

https://doi.org/10.5281/zenodo.8422321

@observingClouds
Copy link
Collaborator

observingClouds commented Oct 9, 2023

So next, I think we should

  • convert our discussion here into a HOW_TO_RELEASE.md including both our version semantics and the actual steps on how to do a release
  • add citation guidelines e.g. into the howto.eurec4a.eu book and link to there at other places (e.g. the README of this repo)
  • open a new issue to discuss how to move forward to achieve an immutable catalog

@d70-t
Copy link
Contributor

d70-t commented Oct 10, 2023

  • convert our discussion here into a HOW_TO_RELEASE.md including both our version semantics and the actual steps on how to do a release

There's RELEASING.md which I guess covers most of the semantics and the actual steps (I followed the steps while doin the 1.0.0 release). Probably we'll have to do another pass over this thread and the RELEASING.md to check if it actually reflects the outcome of this thread.

@d70-t
Copy link
Contributor

d70-t commented Oct 10, 2023

Probably we'll also want to have a more complete description text on the zenodo page for upcoming releases (I guess, at lest a mention of the howto.eurec4a.eu would be good).

Bildschirmfoto 2023-10-10 um 18 04 15

@observingClouds
Copy link
Collaborator

linking intake/intake#775

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants