Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator should allow extensions to data_types which only define data_type_inc #585

Closed
5 tasks done
dsleiter opened this issue Apr 22, 2021 · 10 comments · Fixed by #609
Closed
5 tasks done

Validator should allow extensions to data_types which only define data_type_inc #585

dsleiter opened this issue Apr 22, 2021 · 10 comments · Fixed by #609
Labels
category: bug errors in the code or code behavior

Comments

@dsleiter
Copy link
Contributor

Description

This comes out of the schema discussion here: hdmf-dev/hdmf-schema-language#13
in the context of how to handle extra fields #542

If a spec group/dataset is defined with only a data_type_inc, it should be allowed to extend the data_type without defining a new data type via data_type_def. For example, a group can contain a dataset which both includes/inherits from VectorData and extends to add a new attribute without defining data_type_def.

- data_type_def: MyTable
  datasets:
  - name: new_column
    data_type_inc: VectorData
    attributes:
    - name: new_attr
      dtype: text
      doc: a new attribute

The validator should allow for these circumstances and validate against all changes/extensions to the original data_type.

Ref:

implementation notes:

The current validator creates a map of specs/validators (ValidatorMap) using the defined data_type (via data_type_def) as the key. When validating a builder against the above spec, it validates new_column against the original spec for VectorData rather than the modified version above, and not return a MissingError if the attribute is missing. Furthermore, as of #542, an ExtraFieldWarning will be returned saying that new_attr is not part of the spec.

The validator will need to keep track of where it is in the spec tree rather than relying on specific data types. Perhaps the ValidatorMap will only be used for base-level data_types.

Steps to Reproduce

The following should return a MissingError because the dataset builder does not have the new_attr attribute.

from hdmf.build import GroupBuilder, DatasetBuilder
from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace, DatasetSpec, AttributeSpec
from hdmf.validate import ValidatorMap

spec_catalog = SpecCatalog()
dataset_spec = DatasetSpec(doc='A test dataset', data_type_def='Bar')
group_spec = GroupSpec(
    doc='A test group',
    data_type_def='Foo',
    datasets=[
        DatasetSpec(
            'A modified Bar',
            data_type_inc='Bar',
            attributes=[
                AttributeSpec(name='new_attr', doc='An extra attribute', dtype='text')
            ]
        )
    ]
)
spec_catalog.register_spec(dataset_spec, 'test.yaml')
spec_catalog.register_spec(group_spec, 'test.yaml')
namespace = SpecNamespace('a test namespace', 'test_core', [{'source': 'test.yaml'}],
                          version='0.1.0', catalog=spec_catalog)
vmap = ValidatorMap(namespace)

builder = GroupBuilder('foo', attributes={'data_type': 'Foo'},
                       datasets=[DatasetBuilder('bar', attributes={'data_type': 'Bar'})])
result = vmap.validate(builder)
print(result)

Furthermore, after re-merging of #542, if the new_attr attribute is present, it should not return an ExtraFieldWarning.

Environment

Python Executable: Conda
Python Version: Python 3.8
Operating System: Linux
HDMF Version: 2.4.0

Checklist

  • Have you ensured the bug was not already reported ?
  • Have you included a brief and descriptive title?
  • Have you included a clear description of the problem you are trying to solve?
  • Have you included a minimal code snippet that reproduces the issue you are encountering?
  • Have you checked our Contributing document?
@dsleiter dsleiter added the category: bug errors in the code or code behavior label Apr 22, 2021
@dsleiter
Copy link
Contributor Author

@rly @oruebel is there any difference in what types of extensions should be allowed when a spec defines data_type_inc but not data_type_def vs when it defines both?

@oruebel
Copy link
Contributor

oruebel commented Apr 22, 2021

I think additions (i.e., new attributes, datasets, etc.) and restrictions (i.e., defining a more specific datatype or shape for a dataset that are compliant with the parent spec) should be allowed. Changes to existing fields that violate the parent schema should not be allowed, however.

For the validator, I think it could be useful if this check would be a configurable behavior (e.g., --warn-on-type-modifications). I.e., by default let this pass and allow a user to enable more strict checking to warn if a data_type is being modified without creating a data_type_def. Although for the validation of data files this may not be necessary, but is something that would be useful for schema validator not a data validator.

@dsleiter
Copy link
Contributor Author

Thinking further about this, it seems like this would essentially allow for diamond inheritance.

Example: imagine groups A and B where B inherits from A. Then you have another group Foo which has a child-group A where it adds an attribute. What validation should be done if a Foo builder has a child group of type B?

Example spec:

- data_type_def: A
  doc: group of type A
- data_type_def: B
  data_type_inc: A
  doc: group of type B
  attributes:
  - name: attr_b
    dtype: text
    doc: attribute of type B
- data_type_def: Foo
  doc: an example group
  groups:
  - data_type_inc: A
    doc: an extended A group
    attributes:
    - name: attr_a_ext
      dtype: text
      doc: attribute of type A

Now we have a GroupBuilder of type Foo which has a child-GroupBuilder of type B. What attributes should be expected on B? It seems to me that the most consistent expectation is that it should have both attr_b and attr_a_ext, do you agree @rly and @oruebel ?

What should happen now if B and the extended A define an attribute named attr_conflict but with a different dtype? The spec is valid because attr_conflict is an addition to A in both cases, but there doesn't seem to be any way to satisfy the spec when passing a subgroup of type B to Foo.

@dsleiter
Copy link
Contributor Author

I think additions (i.e., new attributes, datasets, etc.) and restrictions (i.e., defining a more specific datatype or shape for a dataset that are compliant with the parent spec) should be allowed. Changes to existing fields that violate the parent schema should not be allowed, however.

Makes sense. Is this not the case when data_type_def is used? i.e. could a spec change the dtype of a dataset when using data_type_inc as long as it includes data_type_def?

For the validator, I think it could be useful if this check would be a configurable behavior (e.g., --warn-on-type-modifications). I.e., by default let this pass and allow a user to enable more strict checking to warn if a data_type is being modified without creating a data_type_def. Although for the validation of data files this may not be necessary, but is something that would be useful for schema validator not a data validator.

Ok, I'll keep that in mind as I look for an approach.

@oruebel
Copy link
Contributor

oruebel commented Apr 22, 2021

i.e. could a spec change the dtype of a dataset when using data_type_inc as long as it includes data_type_def?

I believe the same restrictions apply with data_type_def, i.e., you can't make changes that violate the parent schema.

@oruebel
Copy link
Contributor

oruebel commented Apr 22, 2021

Thinking further about this, it seems like this would essentially allow for diamond inheritance.

I'm not sure that this example shows diamond inheritance, but I agree that this is a tricky case in that you modify a type but still want to allow other subtypes with the same modifications. I don't think this case occurs in any current schema so I think it should be fine to not allow subtypes of A to be used if A is being modified. This behavior I believe is identical to what would happen if you declared a new data_type_def C in Foo (instead of just modifying A), because once you create a subtype C of A you could no longer use B inside Foo because it does not inherit from the new type.

Ideally, we would not allow these kind of modifications without declaring a new type, but unfortunately that would break a number of existing parts of NWB.

@dsleiter
Copy link
Contributor Author

@oruebel thanks for the discussion. I'm diving into this deeper, and am uncovering even more questions. I'll lay them out below.


Schema vs builder validator

Do you agree that the builder validator should take whatever spec it is given to be the source of truth, that any problems with the spec should be caught by a schema validator, and that if there is a problem with the spec that the builder validator behavior is undefined?

eg. if dataset of type A has a dtype of int, and dataset of type B which inherits from A has a dtype of text - this should be a shema validation error and we don't need to specify behavior for the builder validator

Does a schema validator already exist?


Validation of inherited requirements

It looks like the current validator doesn't validate against inherited requirements.

e.g. I would expect that the following should return a validation error because attribute foo is not present on the builder of type G2.

from hdmf.build import GroupBuilder
from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace, AttributeSpec
from hdmf.validate import ValidatorMap

spec_catalog = SpecCatalog()
g1_spec = GroupSpec(doc='g1', data_type_def='G1',
                    attributes=[AttributeSpec(name='foo', doc='an attribute', dtype='text')])
g2_spec = GroupSpec(doc='g2', data_type_def='G2', data_type_inc='G1')

spec_catalog.register_spec(g1_spec, 'test.yaml')
spec_catalog.register_spec(g2_spec, 'test.yaml')
namespace = SpecNamespace('a test namespace', 'test_core', [{'source': 'test.yaml'}],
                          version='0.1.0', catalog=spec_catalog)
vmap = ValidatorMap(namespace)
builder = GroupBuilder('bar', attributes={'data_type': 'G2'})

result = vmap.validate(builder)
print(result)

But no error is returned. Is this expected behavior?


Two approaches

I've been thinking about two approaches to resolving this issue:

  1. treat extended data types without data_type_def as anonymous types in that they have a single well-defined spec (and data type inheritance tree) but without an explicit name. The validator will need to know how to combine the specs together into a single spec, and then validate any builders against that anonymous spec.
  2. treat the base data type spec and the extended spec as separate specs and validate a builder against both during validation

At first, I was thinking of (1) (and that's where I think the diamond inheritance question arises), but now I think (2) would be less complex.

(2) is nice because it:

  • avoids the complexity of trying to merge specs together
  • avoids logic needed to distinguish between providing a fixed name to a child (which doesn't extend a data type) and adding an attribute (which does extend a data type)

(2) could also be extended to resolve the issue with validating inherited requirements by just validating against the entire inheritance tree. So in that case the builder would be validated against both G2 and G1. Furthermore, this could also handle the above discussion on situations where you provide a child data type in a location where the spec extends the parent (eg. in the example above, the data type would be validated against the spec for A, B, and the extension for A. It would be impossible to satisfy the spec with a data type of type B, but we wouldn't have to create a special rule about providing a child data type in place of a parent with extended rules)

Downsides with (2) are:

  • it will be more difficult clearly communicate to the user the source of the error (eg. the user provided a data type of type B, but the error comes from validation against an extension of type A).
  • the validator will now need to manage the list of specs to validate against for each builder

(1) would require defining and programming rules for merging specs, but then allow each data type to be validated against just 1 spec.

Do you have any additional thoughts on the approach and tradeoffs?

@dsleiter
Copy link
Contributor Author

Ideally, we would not allow these kind of modifications without declaring a new type, but unfortunately that would break a number of existing parts of NWB.

Yeah, no worries, I understand why we need to make it work here. :-)

Is there any possibility of explicitly forbidding extension of data types without defining a new data type in the next major version of the hdmf-schema-language? I understand that would take extra work to update the existing specs to be in compliance (and new major release versions for those), but it also seems like untyped extensions create a larger surface area for bugs and for users to do unexpected things.

As a recent new spec creator, I can say that even after doing my best to read the documentation, it wasn't always clear what was the right way to go about defining a spec, and I often used the existing specs from hdmf-common-schema and nwb-schema as examples. So even if we need to support old spec versions in parts of hdmf, it might be good in the long-term to prevent future specs from being able to extend without defining a type.

@oruebel
Copy link
Contributor

oruebel commented Apr 28, 2021

Does a schema validator already exist?

Only in the sense of that there is a JSON-schema that you can use to validate and that the schema can be read by the spec reader. That however only means compliance with the schema language itself but does not validate inheritance rules. Having a true spec validator that also checks for compliance with a number of additional rules would be very useful to have, e.g., to check:

  • Are types always defined at the root (nesting specs can be problematic)
  • Are modifications in inherited types compliant with the base type
  • etc.

It looks like the current validator doesn't validate against inherited requirements.

This sounds like a bug. @rly thoughts?

I've been thinking about two approaches to resolving this issue:

I have to admit, this is a non-trivial issue. I'm not sure what the best solution for this is right now.

it might be good in the long-term to prevent future specs from being able to extend without defining a type.

I agree, that would be useful. One way to get (at least partially there) is to create a SpecValidator that we can set in the constructor of the spec classes (e.g,. GroupSpec) to optionally validate the spec on init and each time it is being modified.

Is there any possibility of explicitly forbidding extension of data types without defining a new data type in the next major version of the hdmf-schema-language?

The problem is less with the next major version of the language as it is with breaking NWB. I'm not against the idea, but I'm not sure right now how to do this without causing chaos.

There are a number of complicated issues here, and I think it may be best to plan for chat at some point to discuss options via videochat.

dsleiter added a commit to agencyenterprise/hdmf that referenced this issue May 13, 2021
* Fix hdmf-dev#585
* Ref hdmf-dev#542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated
dsleiter added a commit to agencyenterprise/hdmf that referenced this issue May 13, 2021
dsleiter added a commit to agencyenterprise/hdmf that referenced this issue Jun 8, 2021
* Ref hdmf-dev#585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach
dsleiter added a commit to agencyenterprise/pynwb that referenced this issue Jun 9, 2021
* These unit tests will begin to fail after the update to the hdmf
  validator that increase validation coverage
* Ref hdmf-dev/hdmf#585
* Ref hdmf-dev/hdmf#609
* See discussion: hdmf-dev/hdmf#609 (comment)
dsleiter added a commit to agencyenterprise/hdmf that referenced this issue Jun 17, 2021
* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix hdmf-dev#585
rly added a commit that referenced this issue Jun 25, 2021
…ecs (#609)

* Validate builders against both top level data type specs and inner specs

* Fix #585
* Ref #542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Update changelog

* Ref #585

* Fix pynwb validation errors related to reference and compound data types

* Ref #585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach

* Remove validator reference to H5-specific classes and add unit tests

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix #585

* Update tests/unit/validator_tests/test_validate.py

* Update tests/unit/validator_tests/test_validate.py

Co-authored-by: Ryan Ly <[email protected]>
@dsleiter
Copy link
Contributor Author

Closing this via #609 (I think it wasn't automatically closed because the merge was to rc/3.0.0)

rly added a commit that referenced this issue Jul 6, 2021
…ecs (#609)

* Validate builders against both top level data type specs and inner specs

* Fix #585
* Ref #542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Update changelog

* Ref #585

* Fix pynwb validation errors related to reference and compound data types

* Ref #585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach

* Remove validator reference to H5-specific classes and add unit tests

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix #585

* Update tests/unit/validator_tests/test_validate.py

* Update tests/unit/validator_tests/test_validate.py

Co-authored-by: Ryan Ly <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants