Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGFF json validation #69

Merged
merged 35 commits into from
Nov 26, 2021
Merged

NGFF json validation #69

merged 35 commits into from
Nov 26, 2021

Conversation

will-moore
Copy link
Member

@will-moore will-moore commented Oct 25, 2021

In this PR we have evaluated 3 options for validating the JSON data within OME-NGFF files:

This PR includes a bunch of example files, both valid and invalid samples used in the tests.
Tests for JSON-schema and shacl are run with:

$ cd 0.3
$ tox

SALAD

$ pip install schema_salad

# run tests on a bunch of valid and invalid files
# NB: some tests fail because an empty list (e.g. for axes, datasets) is not invalid.
$ pytest test_salad_schema.py

# CLI validate test image json
$ schema-salad-tool image.yml image.json

# validated labels json
$ schema-salad-tool image.yml path/to/image.zarr/labels/0/.zattrs

JSON-schema

$ pip install jsonschema

# run tests on a bunch of valid and invalid files - All should be passing
$ pytest test_json_schema.py

This is a checklist of rules for multiscales that a valid OME-Zarr MUST obey to satisfy the current 0.3 schema (not including HCS). Checked indicates they are checked by this validation. (NB: there are lots of rules that are not indicated by a MUST in the spec):

  • group level attributes must also contain "_ARRAY_DIMENSIONS" if this group directly contains multi-scale arrays.
  • All image arrays must be up to 5-dimensional with dimension order (t, c, z, y, x)
  • Each dictionary contained in the [multiscales] list MUST contain the field "datasets"
  • Each dictionary in "datasets" MUST contain the field "path"...
  • ...whose value contains the path to the array for this resolution relative to the current zarr group.
  • The "path"s MUST be ordered from largest (i.e. highest resolution) to smallest.
  • It MUST contain the field "axes"
  • The axes values MUST be one of {"t", "c", "z", "y", "x"}. (doesn't check order or unique)
  • The number of values MUST be the same as the number of dimensions of the arrays corresponding to this image.
  • In addition, the "axes" values MUST be repeated in the field "_ARRAY_DIMENSIONS" of all scale groups
  • It SHOULD contain the field "name".
  • It SHOULD contain the field "version", which indicates the version of the multiscale metadata of this image (current version is 0.3).
  • It SHOULD contain the field "type", which gives the type of downscaling method used to generate the multiscale image pyramid.
  • It SHOULD contain the field "metadata", which contains a dictionary with additional information about the downscaling method.

To discuss or look into...

  • Currently no warning for missing SHOULD rules, although we do check the data structure.
  • Versioning - Do we try to handle versioning by creating a schema for each version, then possibly ome-zarr-py picks the correct schema to validate with, based on the version number. This wouldn't be able to handle different version numbers for different parts of the spec. I don't see a way to do "if version==0.3 then axes must be a list of strings, but it's optional for version 0.1 and 0.2 and for version 0.4 is should be a list of axes objects"
  • Splitting schema - e.g. image.yml split out omero schema

@will-moore
Copy link
Member Author

will-moore commented Oct 26, 2021

I don't see how to make a field 'optional' to fit with the SHOULD semantics of the spec.
E.g. validating against some data created by omero-cli-zarr, it fails with fields that are not essential:

$ schema-salad-tool $schema test/6001247.zarr/.zattrs 
/opt/anaconda3/envs/schema_salad/bin/schema-salad-tool Current version: 8.2.20211020114435
While validating document `test/6001247.zarr/.zattrs`:
test/6001247.zarr/.zattrs:1:1:   Invalid
                                  tried `Image` but
test/6001247.zarr/.zattrs:6:5:       * the `multiscales` field is not valid because
test/6001247.zarr/.zattrs:7:9:         item is invalid because
                                        * missing required field `metadata`
                                        * missing required field `name`
                                        * missing required field `type`
...

EDIT - fixed below

@will-moore
Copy link
Member Author

will-moore commented Oct 27, 2021

Opened issue common-workflow-language/schema_salad#460 to ask about optional fields needed at #69 (comment) - EDIT: fixed!

@will-moore
Copy link
Member Author

That last commit was testing JSON_schema for validation (see #75).

Need to serve the schemas at their URLs. e.g. image.schema has "$ref": "http://localhost:8000/omero.schema.json" so
we need to serve the json_schema directory contents at http://localhost:8000.

$ cd json_schema
$ python -m http.server

Then in a different Terminal:

$ pip install jsonschema
$ jsonschema --instance image.json json_schema/image.schema

],
"type": "gaussian",
"metadata": {
"method": "skimage.transform.pyramid_gaussian",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, since this is very Python focus, so if one is reading the file in Java for example a similar function will need to be found

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's just documenting how the file was created. It doesn't affect how you'd read the data in Python or Java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we should follow a similar strategy for the rendering settings e.g. Fiji.salt&Pepper cc @dominikl

@will-moore
Copy link
Member Author

@joshmoore I added a couple more examples.
One is v0.2 which is valid without axes. I managed to get this passing with JSON-schema and SALAD, but not shacl.
Also added invalid_axes_count.json (and made it pass for shacl and JSON-schema) and invalid_axes_order.json which doesn't have axes in tczyx order and don't know how to make that fail for any methods.
Current failing tests:

jsonschema-invalid_axes_order
jsonld-image_v0.2
jsonld-invalid_axes_order
salad-invalid_axes_order
# These are all due to no min-length for arrays in SALAD
salad-invalid_axes_count
salad-no_datasets
salad-no_axes
salad-no_multiscales

@joshmoore
Copy link
Member

Branch updated so all files live under 0.3 and tests are run automatically. The general plan would be:

  • for "current versions" (i.e. 0.3 and perhaps 0.4 depending on @constantinpape's willingness to buy in to validation early), we provide https://json-schema.org/ files that can be used by end users for validation.
  • Moving forward (0.4+ or 0.5+), we add minimal jsonld markup (top-level @type and potentially @context) so that we can begin using the more powerful shacl validation but we will attempt to upgrade the json-schema to continue to be applicable.
  • Eventually a framework may be found (openminds, schema_salad) or written/adapted (ome_types/pydantic) to allow us to generate both json-schema and shacl and perhaps also model objections for working with the extensible model.

Description will need updating.

@joshmoore joshmoore changed the title Salad schema validation NGFF json validation Nov 24, 2021
@joshmoore
Copy link
Member

Hmm... ok. I had tried to make v0.2 disallowed in the shacl ("if version != 0.3, fail()"), but I guess that was failing for some reason.

@will-moore
Copy link
Member Author

Yes, the v0.2 file fails for shac1 (which is not skipped).
I just wanted to put the v0.2 file into the valid group (since it is valid) and skip the shacl test, instead of skipping the JSON schema test which (correctly) finds the v0.2 is valid.

@joshmoore
Copy link
Member

Guess I'm not overly convinced that we want to consider a v0.2 versioned file as a valid v0.3. i.e. I'd almost (even in this PR) copy the schema to v0.2 and do that validation under 0.2/.

@will-moore
Copy link
Member Author

OK, that comes back to my previous question of how to handle versions. Does the validation code need to inspect the file, read the version and pick a schema, or can a single top-level schema handle that?

"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "collection_schema.json",
"name": "JSON Schema for NGFF Collection",
"description": "Attempt at a JSON Schema to create valid JSON-LD descriptions. Limited to using a few schema.org properties.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was used to drive the validation but having a collection.schema under the 0.3 namespace feels fairly incorrect. Can we split into a separate branch or potentially a separate directory e.g. {draft,in-progress}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it for now and keep a local copy until we're ready to work on this again.

@@ -0,0 +1,37 @@
{
"$id": "http://localhost:8000/omero.schema.json",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not 0.3/schemas/json_schema/omero.schema to match the previous naming scheme?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually used for validation any more, so I'll remove it.

@joshmoore
Copy link
Member

@will-moore Does the validation code need to inspect the file, read the version and pick a schema, or can a single top-level schema handle that?

I think it will be the former. Certainly in the JSON-LD space, you use the context to decide what version it is. For the moment, ome-zarr-py already detects and can then use the correct json-schema file.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the addition of a first set of validation schemas together with valid/invalid examples as well as the validation code makes sense and I am all for getting this merged so that more can be added.

The newly introduced validation GitHub workflow is invalid as noted inline. Can we get this CI check to pass before merging?

A few other longer term thoughts:

  • are the additional validation todos from the description going to be turned into an issue?
  • I assume the <version>/schemas/{json_schema,jsonld,salad_schema} hierarchy is largely justified by the fact we are still investigating validation framework. I assume we would simply put schemas under <version>/schemas/ in the long run

As a final point, the new schemas effectively starts defining a ngff:Image concept which will probably force us to clarify certain aspects in the specification. It certainly raises a few immediate questions e.g. must a ngff:Image always contain multiscales metadata or are label images subclasses of ngff:Image?

.github/workflows/validation.yml Outdated Show resolved Hide resolved
@joshmoore
Copy link
Member

======================== 35 passed, 3 skipped in 1.71s =========================

@joshmoore
Copy link
Member

I assume we would simply put schemas under /schemas/ in the long run

Likely, unless we definitely will be keeping multiple over time.

@joshmoore
Copy link
Member

It certainly raises a few immediate questions e.g. must a ngff:Image always contain multiscales metadata or are label images subclasses of ngff:Image?

You mean from the shacl work? I think that would be introduced in 0.4+, they are only in the examples so that the validation works at all. (But generally, yes: I assume images always have multiscales and Labels are subclasses of Image)

@sbesson sbesson merged commit 37a4238 into ome:main Nov 26, 2021
github-actions bot added a commit that referenced this pull request Nov 26, 2021
NGFF json validation

SHA: 37a4238
Reason: push, by @sbesson

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@joshmoore joshmoore mentioned this pull request Nov 26, 2021
@sbesson sbesson added this to the 0.4 milestone Feb 1, 2022
@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-2022-01-27/60885/11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants