-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
versioning catalog #30
Comments
I'll try to write down some of my thoughts on that topic. Requirements
Semver for the intake catalog
Current statusCurrently there are some active efforts to fill the catalog with more and more data around HALO, so I'd expect quite a few more changes in the next days. I am also not yet sure if the catalog hierarchy is already settled firmly enough so that renaming isn't necessary anymore, but I hope that we can reach that state soon. ThoughtsIf we adopt semver, I think that we'll have to increase versions quite rapidly in order to be able to both comply with semver and provide new datasets quickly. At the current state, it is already possible to access the catalog via the commit hash (i.e. |
You might want to read through the possibility of packaging as well |
Ok, I think I've to re-warm this thread. As @observingClouds and @fjansson mentionded, for some journals, it's now mandatory to provide DOIs to reference data. Although I still doubt the technical usefullness of having a DOI on specific versions of the intake catalog (e.g. because the data gets moved away and thus old versions of the catalog will become broken), we might need them because of those requirements. On the upside, we might also use just the collection DOI if we only need one DOI and this can then be updated to a newer version. In order to get some DOIs (e.g. via zenodo) we need to make releases. And for making releases, it seems to be useful to have version numbers. These days, the rate of major changes (e.g. data version changed) seems to be much lower than when this issue was opened, so I'm now thinking that applying semver might become more useful. Would you agree? I'd try to formalize some rules for changing version numbers, my try would be:
☝️ as an endpoint, I'd consider anything specifying some dataset in intake language, e.g.: ❗ I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot. ❓ this probably depends on how we see the requirements file. If it's a mostly internal thing (to drive the CI) I'd definitely go for the same level as other changes to CI, if we see it as a user facing API, this might be minor or even major... Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out... |
Sounds good to me, I like the thought of having the catalog citable with a collection DOI. The semver rules above sound like a reasonable way of achieving that. Link rot of old catalog versions seems somewhat unavoidable, unless the data itself is also properly, permanently archived and given a DOI. |
Having a DOI on the data itself unfortunately is not a solution here. As by the DataCite documentation DOIs should resolve to a landing page (and not the content) and furthermore, "Humans should be able to reach the item being described from the landing page"... So by design there's no way to reference the content through a DOI in a machine readable way. Thus, to make the eurec4a-intake catalog work, we've had to circumvent the DOI system and had to go back to plain old links even for data which actually has a DOI. In our setup, DOIs really are pretty useless things 🤷♂️ ... |
The rules that @d70-t is suggesting seem reasonable to me as well. To keep track of the changes between versions and ultimately decide on its increment, we should start using a CHANGELOG or whatsnew.md and make it mandatory for all PRs. Otherwise we easily loose track of the changes and have a hard time to figure out the version of the release candidate. Comments on the nuances of the rules
I argue that changes to the requirements (or similar) are no minor/major increase, because this catalog is not installable and we should assume that users take care of the dependencies as well. Practically, we will likely not even release a new version just because of a change to the requirements.
This depends if we trust the move. The past experience has shown that these moves are often not communicated and the maintainers of this project find out only afterwards due to failing tests. It might be the easiest/safest option to increase the major version in these cases.
Me too and in case of major changes to prevent citations of an outdated catalog. Usage of collection DOI for citationThis is twofold:
My suggestion would be to use the collection DOI whenever the DOI is given directly, e.g. in data availability statements: |
Great! YES! I think we need to introduce a CHANGELOG though once we start versioning. Version numbers in them selves aren't very useful without a changelog. Although the commit history contains the same info it is much more convenient to have a text file. I tend do follow something like xarray (https://github.com/pydata/xarray/blob/main/doc/whats-new.rst), for example https://github.com/EUREC4A-UK/lagtraj/blob/master/CHANGELOG.md. This is also a useful reference: https://keepachangelog.com/en/1.0.0/
Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version). I would suggest if endpoints are moved we should view that in the same way as deletions. If something is no longer in the same place it will function equivalently to not being there.
I don't see any harm in these being in patch updates.
I think based on demand for releases sounds like a good idea. But really it will come down to who has time to do this maintenance work. Similarly for requiring that tests pass. I would opt for tests always needing to pass, otherwise the maintenance burden can become very big. |
This is a great idea if a dataset is moved within the catalog e.g. from
Ideally, yes, but what do we do with data sources that are unstable (e.g. #131 )? Shall we remove those datasets from the catalog after a grace-period? I think users might still benefit from those entries. |
Regarding the removal of catalog entries, I thought we could also write an additional intake driver that allows us to add messages into the catalog. It is very much work in progress and I don't know if I have time to follow-up on this much further, but I'm curious what you guys think about the idea https://github.com/observingClouds/intake_warning |
I like the idea 👍. |
Hi everyone (@leifdenby @d70-t), What is hindering us to move this forward and publish our first version? I have a paper in the last stages before it gets published and I could start as an example. As long as we only have http-links in the catalog and don't have control over the linked datasets the version might be less meaningful, but it might be a step forward?! I don't see us to provide/link all datasets in an unmutable way in the near future. Those datasets in the catalog that do have a DOI I will also cite explicitly, something that we probably should encourage on e.g. the readme page and/or on howto.eurec4a.eu as well. Any thoughts? Could we try to release a first version by the end of the week? This might help some other papers as well (e.g. publications of @fjansson,@PouriyaTUDelft) Cheers, |
I'd like having a DOI for the intake catalog. The Cloud Botany paper is near the proof stage now, I'd happily cite the DOI there, if we can have it within a few days :) |
I'll try to give it a shot. I'm however not yet sure what to put especially in fields like license and authors.. (see #147) |
@d70-t before doing the first release, we should probably clean-up the current CHANGELOG.md in some way. Not sure what the best solution is, but the easiest might be to empty it completely with only the section headers remaining. |
Probably that comment came in too late... I just did what's been written in RELEASING... |
Yeah sorry, but I think it is fine the way you did it. Thanks @d70-t so much for this afternoon/morning hack-session. I think we got a lot of things done and moved this project forward by a good margin. |
Here's the collection DOI. I think if we reference any, we should use this one (as discussed above, this gives a change of keeping up with the movement of datasets). |
So next, I think we should
|
There's RELEASING.md which I guess covers most of the semantics and the actual steps (I followed the steps while doin the 1.0.0 release). Probably we'll have to do another pass over this thread and the RELEASING.md to check if it actually reflects the outcome of this thread. |
linking intake/intake#775 |
to avoid breaking dependent code what do people think about versioning the catalog? It would be nice to be able to refer to a specific git-tagged version when fetching the catalog.
I think with what we've got now we could create a version
v1.0.0
. If we adopt semver (https://semver.org/) we could update the MAJOR version we if we remove/add an endpoint and update the MINOR version when we add new endpoints, and finally update the PATCH version when we make fixed to existing endpoints. Adding/removing endpoint arguments could also be considered a breaking change I think.Thoughts?
The text was updated successfully, but these errors were encountered: