-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion of CF compliance checker #53
Comments
I think, this can be a very valuable addition. In particular if this also includes further specific tests as you suggest. I did noch check it thoroughly but there seems to be a Plugin system for the IOOS Checker on which we might want to jump on. What we would need is to have a script which goes through all the items in the catalog and runs the check on it. This would probably be something similar to the check for the availability of datasets as it is currently in place. If the script would generate the output as a set of HTML pages, one could also show the current state of the datasets in a more readable way. I am wondering if there's an easy option to run zarr based datasets through the compliance checker as well? --- Apart from converting them to local netCDF files and then running them through the checker. Another question relevant for netCDF based sources might be if we want to run the checks through OPeNDAP only or if we also want to run them over the original netCDF files? Another thing to keep in mind might be that the catalog is often not created by the people creating the datasets. But if the compliance checker or any additional tests fail, it would most likely be the original dataset which needs to be fixed. Thus maybe this repository is not exactly the right place for this tool. Maybe we could also run a script which periodically checks for new datasets at the usual places (i.e. Aeris server, but maybe others as well) for new datasets and runs those checks on the datasets to create an overview of issues within the datasets even before the effort is made to include them into the catalog? |
It looks like we can simply construct a
Yes, it is possible to set up periodic CI jobs: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#scheduled-events, the syntax is basically a crontab. I can create a pull-request for that. How often should we run? Once daily at 3am say? |
That would be great! However, as far as I can tell, the
That would also be nice! But do you currently think about checking the datasets from the intake catalog once per night, or do you think about checking all datasets on Aeris and monitor for changes? |
I was just thinking of checking the intake catalog :) We don't store anything relating to version/content of the data actually on AERIS in the catalog (I don't think?) so checking for a new version would mean implementing something for that. We could enforce that a Properly monitoring for changes on AERIS would require being able to "walk" the entire AERIS catalog to check for new/changed/deleted files, but I don't think we can do that? |
Hmm, I've lost quite a bit of trust in identifiers which are not cryptographic hashes of their contents. Things like
(paraphrased) do occur. Also I'd expect at least that if someone is kind enough to change the version inside of the dataset, then the reference (i.e. filename) to the dataset is changed as well (checking is of course better). And if that happens, then the filename which is inside the intake catalog will either still point to the old dataset or point to nothing anymore. That said, Aeris is providing the The nice thing about actively searching for datasets would be that one could inform authors earlier about potential issues regarding CF (or other) conventions and maybe increase the chance of getting them in a mood to still change some things :-) ... but probably that should also be part of ingress checking of the data archive? |
One more note: If we can verify that a dataset did not change and it has been CF compliant (or not), then this status will not change over time, so there is probably no need to check it over and over again. |
Hi,
I'm often using the IOOS compliance checker and just recently tried it in the command line. Wouldn't it be valuable if we would use it here ( and maybe add e.g. a EUREC4A metadata compliance test ) in the CI or maybe even as a GitHub bot, so that it just shows its findings like errors and warnings in the Pull request discussion but not necessarily stops a merge, but leaves the judgement to the reviewers? This would make it also easy to find issues with e.g.
int64
or missingunits
etc.The code for the checker is available on GitHub
The text was updated successfully, but these errors were encountered: