Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation finds unexpected elements #542

Merged

Conversation

dsleiter
Copy link
Contributor

Motivation

The validator now returns warnings on any extra fields that it finds.

How to test the behavior?

Simple example:

from hdmf.build import GroupBuilder
from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace
from hdmf.validate import ValidatorMap

spec_catalog = SpecCatalog()
group_spec = GroupSpec('A test group spec with no attributes', data_type_def='Foo')
spec_catalog.register_spec(group_spec, 'test.yaml')
namespace = SpecNamespace('a test namespace', 'test_core', [{'source': 'test.yaml'}],
                          version='0.1.0', catalog=spec_catalog)
vmap = ValidatorMap(namespace)
extra_attribute = {'x': 42}
builder = GroupBuilder('foo', attributes={'data_type': 'Foo', 'extra_field': 'bar'})
result = vmap.validate(builder)
print(result)

Review and run all new unit tests for full coverage.

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

* Ref hdmf-dev#288
* Starting a new PR but taking appropriate changes from the above PR
* Ref NeurodataWithoutBorders/pynwb#942
* GroupValidator returns ExtraFieldWarning for extra groups, datasets, and links
* Includes unit tests covering the new warning
* Attribute validation WIP
* Ref hdmf-dev#288
* Starting a new PR but taking appropriate changes from the above PR
* Ref NeurodataWithoutBorders/pynwb#942
* Temporary solution for handling reserved attributes, will be fixed in later commit
* Includes unit tests for attribute validation
…ation type issue

* Ref hdmf-dev#288
* Starting a new PR but taking appropriate changes from the above PR
* Ref NeurodataWithoutBorders/pynwb#942
* Validator now skips validation of reserved attributes provided by the spec
* Discovered and resolved an issue where the validator was validating a group
  child based on the parent groups spec instead of the actual type of the builder.
  This is important when an inheriting class has additional attributes.
* Added a regression test case for this issue
Copy link
Contributor

@ajtritt ajtritt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great, thanks for taking this on. A few change requests. There's only one real issue (See SpecMatch object). The rest are trivial docstring request.

src/hdmf/validate/validator.py Show resolved Hide resolved
src/hdmf/validate/validator.py Show resolved Hide resolved
src/hdmf/validate/validator.py Show resolved Hide resolved
src/hdmf/validate/validator.py Show resolved Hide resolved
src/hdmf/validate/validator.py Show resolved Hide resolved
src/hdmf/validate/validator.py Outdated Show resolved Hide resolved
src/hdmf/validate/validator.py Outdated Show resolved Hide resolved
src/hdmf/validate/validator.py Outdated Show resolved Hide resolved
@dsleiter
Copy link
Contributor Author

@ajtritt thanks for the review! Good suggestions.

In working through this, I began wondering if it could be beneficial to introduce an AttributeBuilder class. I don't know if this would help or hurt across other parts of hdmf, but in the validator here I think it would help with two things:

  1. it would provide a separation between attributes coming from the user api layer (containers) and attributes coming from the data translation layer (builders) so that there wouldn't be a need to define "reserved" attributes as they could be separated
  2. attribute names and values would be paired together and would provide functionality for things like builder location and accessing an attribute's parent

If you think this could potentially be beneficial, I can create an issue to discuss in more detail

@dsleiter dsleiter requested a review from ajtritt March 22, 2021 06:25
@dsleiter
Copy link
Contributor Author

@ajtritt @rly I merged in changes from dev, but unfortunately now we have failing tests.

The cause seems to be due to a combination of three things:

  1. the new EnumData which is part of the hdmf-experimental namespace Integrate experimental namespace #545
  2. the existing validator can't handle multiple namespaces (were you working on something related to this recently @rly?)
  3. because EnumData is not part of the hdmf-common namespace, the validator doesn't know that it inherits from VectorData and thinks that it is an extra field when validating a DynamicTable containing an EnumData - so as of this PR it now returns a warning

I think the ideal solution is to make the ValidatorMap be aware of multiple namespaces, but I think that is probably better as a separate issue, do you both agree?

Until that issue is complete, how would you feel about adjusting H5RoundTripMixin.validate to only fail on validation errors and pass / warn on validation warnings? Or do you have any other suggestions?

@rly
Copy link
Contributor

rly commented Mar 30, 2021

Nice job hunting down the source of the issue.

Until that issue is complete, how would you feel about adjusting H5RoundTripMixin.validate to only fail on validation errors and pass / warn on validation warnings? Or do you have any other suggestions?

Yes, I think that is a great idea.

@rly
Copy link
Contributor

rly commented Mar 30, 2021

the existing validator can't handle multiple namespaces (were you working on something related to this recently @rly?)

I was, but it is not yet ready. If you want, you can take a stab at it, but I won't be able to get to it until at least a week post-hackathon.

* Responding to: hdmf-dev#542 (comment)
* Create severity level constants for warnings and errors
@dsleiter
Copy link
Contributor Author

dsleiter commented Apr 6, 2021

I updated the H5RoundTripMixin to only fail on validation errors and warn on validation warnings

Until that issue is complete, how would you feel about adjusting H5RoundTripMixin.validate to only fail on validation errors and pass / warn on validation warnings?

I'm not entirely certain the warning inside the unit test is valuable and might add noise, so let me know if you'd prefer without the warning and I'll update.

@dsleiter
Copy link
Contributor Author

dsleiter commented Apr 6, 2021

I see that many pynwb tests failed due to the new validation warning. Most of them look like they'll be fixed by adding neurodata_type to the reserved attributes, but there might be other extra field warnings beyond that.

As a longer-term strategy (once we fix validation of multiple namespaces), do we want tests to fail on validator warnings or only on errors?

@t-b
Copy link
Contributor

t-b commented Apr 13, 2021

I see that many pynwb tests failed due to the new validation warning. Most of them look like they'll be fixed by adding neurodata_type to the reserved attributes, but there might be other extra field warnings beyond that.

Do you have some examples about pynwb validation warnings due to this PR here?

@dsleiter
Copy link
Contributor Author

@t-b the errors I was referring to are from the ci/circleci: pynwb-dev-pythons38 job here: https://app.circleci.com/pipelines/github/hdmf-dev/hdmf/2866/workflows/bb203ca4-2953-4094-8247-c80247e99006/jobs/33759

There were many errors happening because the validator was not recognizing neurodata_type as the type_key for pynwb, but thanks for your question, as in describing the cause I determined how to fix the issue. It is now resolved via this commit.

There are a few remaining pynwb tests failing, but most don't appear to have anything to do with this PR. There are 4 failing tests related to extra field warnings, all of them from the integration tests in hdf5.test_misc.TestUnitsIO, but I think these are actually valid warnings:

  ERROR in test test_get_obs_intervals (hdf5.test_misc.TestUnitsIO):
  Traceback (most recent call last):
    File "/home/circleci/project/pynwb/tests/integration/hdf5/test_misc.py", line 63, in test_get_obs_intervals
      ut = self.roundtripContainer()
    File "/home/circleci/project/pynwb/src/pynwb/testing/testh5io.py", line 101, in roundtripContainer
      self.validate()
    File "/home/circleci/project/pynwb/src/pynwb/testing/testh5io.py", line 172, in validate
      raise Exception(err)
  Exception: VectorData (acquisition/UnitsTest/spike_times.resolution): encountered extra field

The schema for Units has a dataset named spike_times which is of type VectorData. spike_types has an additional attribute defined: resolution.
https://github.com/NeurodataWithoutBorders/nwb-schema/blob/e13c24ca24d6d17fa4ce8f3e6f40bd8894432f83/core/nwb.misc.yaml#L177-L200
However, when the validator runs, it compares the spike_times builder against the spec for VectorData, which does not include resolution, and therefore raises the warning.

Should specs that only define neurodata_type_inc be allowed to add fields to a data type?

If not, then I think either spike_times needs to define a new type via neurodata_type_def, or resolution needs to be moved somewhere else.

Do you agree with my evaluation?

@t-b
Copy link
Contributor

t-b commented Apr 14, 2021

@t-b the errors I was referring to are from the ci/circleci: pynwb-dev-pythons38 job here: https://app.circleci.com/pipelines/github/hdmf-dev/hdmf/2866/workflows/bb203ca4-2953-4094-8247-c80247e99006/jobs/33759

There were many errors happening because the validator was not recognizing neurodata_type as the type_key for pynwb, but thanks for your question, as in describing the cause I determined how to fix the issue. It is now resolved via this commit.

Okay, question answered ;)

The schema for Units has a dataset named spike_times which is of type VectorData. spike_types has an additional attribute defined: resolution.
https://github.com/NeurodataWithoutBorders/nwb-schema/blob/e13c24ca24d6d17fa4ce8f3e6f40bd8894432f83/core/nwb.misc.yaml#L177-L200
However, when the validator runs, it compares the spike_times builder against the spec for VectorData, which does not include resolution, and therefore raises the warning.

Should specs that only define neurodata_type_inc be allowed to add fields to a data type?

If not, then I think either spike_times needs to define a new type via neurodata_type_def, or resolution needs to be moved somewhere else.

Do you agree with my evaluation?

Yes I agree. Specifically I would say to

Should specs that only define neurodata_type_inc be allowed to add fields to a data type?

No. This is also my understanding when rereading https://schema-language.readthedocs.io/en/latest/specification_language_description.html#group-specification-keys.

I don't know what the fallout from changing something deep like that is though. If we fix that I think we want to complain already when parsing the spec YAML files.

@rly Any thoughts?

@dsleiter
Copy link
Contributor Author

No. This is also my understanding when rereading https://schema-language.readthedocs.io/en/latest/specification_language_description.html#group-specification-keys.

Ok, then assuming the spec for Units does violate the schema language, I would propose to:

  1. create a bug issue to correct the Units spec (and associated container and object mapper)
  2. update the failing pynwb tests to only fail for validation errors and raise a warning on validation warnings as part of Validation find unexpected elements NeurodataWithoutBorders/pynwb#1178

That way we can be unblocked from including the new warnings in hdmf and pynwb and discuss how to fix the spec separately.

If we fix that I think we want to complain already when parsing the spec YAML files.

This seems like a great idea to me!

@t-b
Copy link
Contributor

t-b commented Apr 15, 2021

@dsleiter Sounds good!

The failing pynwb tests are validating old NWB files without cached spec. So yes making the error currently encountered into a warning sounds like a good approach.

EDIT:
I also just tested this branch against files NWBv2 files created by MIES, https://github.com/AllenInstitute/MIES, and validation fails as expected due to additional fields. Nice!

ajtritt
ajtritt previously approved these changes Apr 21, 2021
Copy link
Contributor

@ajtritt ajtritt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, nice work!

Also try to trigger GH Action for codecov...
@rly
Copy link
Contributor

rly commented Apr 21, 2021

@dsleiter I'm reviewing this and so far so good. For this PR and the future, could you please enable GitHub Actions in your fork of HDMF if it is not already enabled? I know we've had a lot of back and forth regarding how to trigger CI on forks with varying success. I think doing this would allow the codecov CI run, but I am not 100% sure. I might still need to update the workflow or do something like what is mentioned here https://securitylab.github.com/research/github-actions-preventing-pwn-requests

@rly
Copy link
Contributor

rly commented Apr 21, 2021

@dsleiter Oh, never mind. It looks like GitHub Actions does run and uploads to https://app.codecov.io/gh/agencyenterprise/hdmf and probably for security reasons, codecov does not comment in this PR and the checks are not shown here. I might look into this more, but it's not that important.

@rly
Copy link
Contributor

rly commented Apr 21, 2021

Looks all good to me! Thanks @dsleiter!

@rly rly merged commit 67cffec into hdmf-dev:dev Apr 21, 2021
@rly
Copy link
Contributor

rly commented Apr 21, 2021

Should specs that only define neurodata_type_inc be allowed to add fields to a data type?

If not, then I think either spike_times needs to define a new type via neurodata_type_def, or resolution needs to be moved somewhere else.

I'll look into this before we release the next version of nwb-schema.

@dsleiter
Copy link
Contributor Author

Thanks for the reviews and merge @ajtritt and @rly!

@rly RE: codecov - nice to see that the github actions are working and the report is at least uploaded to our codecov. Looks like the comment on the PR is made by a codecov bot - do you have this github app configured for this repo? https://github.com/apps/codecov I will see if we can get it installed on agencyenterprise/hdmf, in which case maybe it'll post from our codecov to any of my PRs here.

@t-b
Copy link
Contributor

t-b commented Apr 21, 2021

@dsleiter Yes, nice work. Especially thanks for picking it up where I left it.

rly added a commit that referenced this pull request Apr 22, 2021
rly added a commit that referenced this pull request Apr 22, 2021
rly added a commit that referenced this pull request Apr 22, 2021
dsleiter added a commit to agencyenterprise/hdmf that referenced this pull request May 13, 2021
* Fix hdmf-dev#585
* Ref hdmf-dev#542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated
rly added a commit that referenced this pull request Jun 25, 2021
…ecs (#609)

* Validate builders against both top level data type specs and inner specs

* Fix #585
* Ref #542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Update changelog

* Ref #585

* Fix pynwb validation errors related to reference and compound data types

* Ref #585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach

* Remove validator reference to H5-specific classes and add unit tests

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix #585

* Update tests/unit/validator_tests/test_validate.py

* Update tests/unit/validator_tests/test_validate.py

Co-authored-by: Ryan Ly <[email protected]>
rly added a commit that referenced this pull request Jul 6, 2021
…ecs (#609)

* Validate builders against both top level data type specs and inner specs

* Fix #585
* Ref #542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Update changelog

* Ref #585

* Fix pynwb validation errors related to reference and compound data types

* Ref #585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach

* Remove validator reference to H5-specific classes and add unit tests

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix #585

* Update tests/unit/validator_tests/test_validate.py

* Update tests/unit/validator_tests/test_validate.py

Co-authored-by: Ryan Ly <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants