-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AG-1314] JSON Schema Validation GX Prototyping #111
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM! Just a small comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like a reasonable way to handle validating nested data. A couple of general questions occurred to me.
- For more complex nested objects, would we generate a single complex schema or is there a way to extract various sub-sub-objects to validate them independently via multiple less complex schemas?
- Are there disadvantages to using this approach vs the regular GE expectations? For simplicity, (sh/c)ould we consider using json schema based validation for all of our post-processing json file validation, and leverage the regular GE expecatations for pre-processing validation once we get there?
src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json
Outdated
Show resolved
Hide resolved
...oradatatools/great_expectations/gx/json_schemas/genes_biodomains/gene_biodomains_schema.json
Show resolved
Hide resolved
I think as long as it is possible to create a schema that sets appropriate expectations for a more complex nested structure, we should stick with one schema. We considered the method of extracting sub-objects and nested structures for validation but opted against it because it would result in many different GX reports being produced from one dataset validation which would make them much less human-readable/friendly.
I don't see many disadvantages when talking about nested structures. I think it makes the most sense to use the built-in or simple custom expectations when the data is not nested, and the JSON Schema validation expectation when it is nested. Off the top of my head, I also think that same approach may also be appropriate for pre-processing validation. |
...oradatatools/great_expectations/gx/json_schemas/genes_biodomains/gene_biodomains_schema.json
Outdated
Show resolved
Hide resolved
src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json
Show resolved
Hide resolved
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Problem:
agora-data-tools
processes datasets that contain nested fields. We currently have nothing in place to evaluate the content of those nested fields during our GX data validation step.Potential Solution:
Implement the
expect_column_values_to_match_json_schema
expectation for nested data fields. This PR is a POC demonstrating the types of changes we would need to implement to the codebase to support such a change.Notes:
genes_biodomains
expectation suite, since it only has one nested field.nested_columns
attribute and a method to convert nested fields into JSON-parseable strings toGreatExpectationsRunner
.gx_nested_columns
field to the configuration files to pass information about which columns need to be converted to the JSON string.