Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a researcher, I want my uploaded provenance file to be validated so that I know it's in an expected format #4378

Closed
matthew-a-dunlap opened this issue Dec 12, 2017 · 19 comments

Comments

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Dec 12, 2017

After a researcher enters or uploads their provenance information, we should ensure that the provenance json matches the formatting we expect. We will need to do some sort of validation to ensure our users are uploading and sharing usable provenance metadata.

There is an open question as to whether there will be a schema file for us to use to validate. If there is, we will need to take that schema in and use it for validation. If not, we will likely do some more manual hardcoded validation based upon objects we expect.

Relates to #4343

@pdurbin
Copy link
Member

pdurbin commented Dec 12, 2017

There is a schema file. It's at https://www.w3.org/Submission/prov-json/schema

It's mentioned at https://www.w3.org/Submission/prov-json/#validation

Here's what it says:

A schema for PROV-JSON is provided, which defines all the valid PROV-JSON constructs described in this document. The schema was written using the schema language specified in [JSON-SCHEMA] (Version 4). It can be used for the purpose of validating PROV-JSON documents. A number of libraries for JSON schema validation are available at json-schema.org/implementations.html.

My understanding is that we should look for a suitable Java library at http://json-schema.org/implementations.html and see if we can get it to validate the JSON Schema above or some other schema.

I think it'll be nice to be able to validated JSON Schemas. We've even been asked by the community to provide some sort of schema for "native" JSON we use in Dataverse but I can't find the thread at the moment.

@djbrooke djbrooke changed the title Validate provenance information added by researchers As a researcher, I want my uploaded provenance file to be validated so that I know it's in an expected format Jan 25, 2018
@matthew-a-dunlap
Copy link
Contributor Author

We should maybe prioritize this issue @djbrooke . Right now we do some validation but not something that actually looks to the full json schema.

The longer we leave our code out without it the more likely we'll have invalid prov files in our system that could break things later on. If we do it soon enough we may be able to decrease the complexity of checking validity at different points. It should be a quick, self-contained change.

Hearing about Prov interest during the community meeting reminded me that this was broken off from the first release of work.

@djbrooke
Copy link
Contributor

@matthew-a-dunlap let's bring it to the backlog grooming for estimation. I'm more inclined to wait until we see use of this feature before working on it, but if it comes back small I could be be convinced. Thanks for tagging this!

@matthew-a-dunlap matthew-a-dunlap self-assigned this Jul 2, 2018
matthew-a-dunlap added a commit that referenced this issue Jul 3, 2018
Also code cleanup
Likely IT tests are now broken as some test prov we used may actually be broken...
matthew-a-dunlap added a commit that referenced this issue Jul 9, 2018
matthew-a-dunlap added a commit that referenced this issue Jul 9, 2018
@dlmurphy dlmurphy removed their assignment Jul 9, 2018
@jacksonokuhn
Copy link

jacksonokuhn commented Jul 16, 2018 via email

@jacksonokuhn
Copy link

I should honestly just take a look at the validation code though

@matthew-a-dunlap
Copy link
Contributor Author

@jacksonokuhn The code really just checks the schema against the json, nothing more. We felt that we should apply some strictness on intake of the provenance, otherwise we may end up with a lot of data that has no use.

@jacksonokuhn
Copy link

jacksonokuhn commented Jul 16, 2018 via email

@jacksonokuhn
Copy link

jacksonokuhn commented Jul 23, 2018 via email

@matthew-a-dunlap
Copy link
Contributor Author

@jacksonokuhn Thanks for looking into this! I agree that would be nice to have as well, as often folks won't really be able to easily fix their broken provenance data...

@matthew-a-dunlap
Copy link
Contributor Author

Note: The prov junit tests are commented out in this branch, as they are slow. When we take on #4896 they should be uncommented and added to a full test suite

@kcondon kcondon self-assigned this Jul 27, 2018
kcondon added a commit that referenced this issue Jul 27, 2018
@kcondon kcondon closed this as completed Jul 27, 2018
@pdurbin pdurbin added this to the 4.9.2 - Stata Upgrades, etc. milestone Aug 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants