[AG-1314] JSON Schema Validation GX Prototyping #111

BWMac · 2023-12-20T21:53:22Z

Problem:

agora-data-tools processes datasets that contain nested fields. We currently have nothing in place to evaluate the content of those nested fields during our GX data validation step.

Potential Solution:

Implement the expect_column_values_to_match_json_schema expectation for nested data fields. This PR is a POC demonstrating the types of changes we would need to implement to the codebase to support such a change.

Notes:

For simplicity, I added JSON schema validation to the existing genes_biodomains expectation suite, since it only has one nested field.
I generated (and then edited) a JSON schema from here.
To support the JSON schema validation expectation, I added a nested_columns attribute and a method to convert nested fields into JSON-parseable strings to GreatExpectationsRunner.
I added the gx_nested_columns field to the configuration files to pass information about which columns need to be converted to the JSON string.

…n_prototyping

src/agoradatatools/process.py

src/agoradatatools/gx.py

thomasyu888

🔥 LGTM! Just a small comment.

src/agoradatatools/gx.py

BryanFauble

LGTM!

JessterB

Overall this looks like a reasonable way to handle validating nested data. A couple of general questions occurred to me.

For more complex nested objects, would we generate a single complex schema or is there a way to extract various sub-sub-objects to validate them independently via multiple less complex schemas?
Are there disadvantages to using this approach vs the regular GE expectations? For simplicity, (sh/c)ould we consider using json schema based validation for all of our post-processing json file validation, and leverage the regular GE expecatations for pre-processing validation once we get there?

src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json

...oradatatools/great_expectations/gx/json_schemas/genes_biodomains/gene_biodomains_schema.json

BWMac · 2024-01-03T20:49:30Z

@JessterB

For more complex nested objects, would we generate a single complex schema or is there a way to extract various sub-sub-objects to validate them independently via multiple less complex schemas?

I think as long as it is possible to create a schema that sets appropriate expectations for a more complex nested structure, we should stick with one schema. We considered the method of extracting sub-objects and nested structures for validation but opted against it because it would result in many different GX reports being produced from one dataset validation which would make them much less human-readable/friendly.

Are there disadvantages to using this approach vs the regular GE expectations? For simplicity, (sh/c)ould we consider using json schema based validation for all of our post-processing json file validation, and leverage the regular GE expecatations for pre-processing validation once we get there?

I don't see many disadvantages when talking about nested structures. I think it makes the most sense to use the built-in or simple custom expectations when the data is not nested, and the JSON Schema validation expectation when it is nested. Off the top of my head, I also think that same approach may also be appropriate for pre-processing validation.

…data_validation_prototyping

...oradatatools/great_expectations/gx/json_schemas/genes_biodomains/gene_biodomains_schema.json

src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json

sonarcloud · 2024-01-04T16:45:50Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

jaclynbeck-sage

Looks great!

BWMac added 8 commits December 20, 2023 14:47

updates genes_biodomains expectation suite

9dd8ac4

adds nested column support in GreatExpectationsRunner class

baa81c8

adds gene_biodomains JSON schema

1f3c9c4

updates process for nested columns

91d26aa

adds nested columns config for genes_biodomains

b278a5b

updates gx unit tests

67fa5dc

clean up expectation suite

1237a89

merge dev Merge branch 'dev' into bwmac/AG-1314/nested_data_validatio…

069955d

…n_prototyping

thomasyu888 reviewed Dec 21, 2023

View reviewed changes

src/agoradatatools/process.py Outdated Show resolved Hide resolved

thomasyu888 reviewed Dec 21, 2023

View reviewed changes

src/agoradatatools/gx.py Outdated Show resolved Hide resolved

BWMac added 5 commits December 21, 2023 11:12

change nested_columns to None default

f055d12

adds missing type hint

5ee80c9

preperly test nested_columns

de09151

adds type hints

98dbebd

revert type hints - Python 3.8

4b1ecd1

BWMac marked this pull request as ready for review December 21, 2023 18:33

BWMac requested review from JessterB and jaclynbeck-sage December 21, 2023 18:34

adds type hints

959c5fe

thomasyu888 approved these changes Dec 22, 2023

View reviewed changes

src/agoradatatools/gx.py Show resolved Hide resolved

BryanFauble approved these changes Dec 28, 2023

View reviewed changes

JessterB reviewed Jan 3, 2024

View reviewed changes

src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json Outdated Show resolved Hide resolved

...oradatatools/great_expectations/gx/json_schemas/genes_biodomains/gene_biodomains_schema.json Show resolved Hide resolved

BWMac requested a review from JessterB January 3, 2024 20:49

BWMac added 3 commits January 3, 2024 13:52

merge pre-commit changesMerge branch 'dev' into bwmac/AG-1314/nested_…

a246c2f

…data_validation_prototyping

addresses sonarcloud issues

7b3ddec

shortens function names for sonarcloud

384efe0

jaclynbeck-sage reviewed Jan 3, 2024

View reviewed changes

...oradatatools/great_expectations/gx/json_schemas/genes_biodomains/gene_biodomains_schema.json Outdated Show resolved Hide resolved

jaclynbeck-sage reviewed Jan 3, 2024

View reviewed changes

src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json Show resolved Hide resolved

BWMac requested a review from jaclynbeck-sage January 4, 2024 16:43

BWMac added 2 commits January 4, 2024 09:43

updates expectation suite

d60b182

updates for pre-commit

6862af8

jaclynbeck-sage approved these changes Jan 4, 2024

View reviewed changes

JessterB approved these changes Jan 4, 2024

View reviewed changes

BWMac merged commit 6ad9e10 into dev Jan 4, 2024
9 checks passed

BWMac deleted the bwmac/AG-1314/nested_data_validation_prototyping branch January 4, 2024 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AG-1314] JSON Schema Validation GX Prototyping #111

[AG-1314] JSON Schema Validation GX Prototyping #111

BWMac commented Dec 20, 2023 •

edited

Loading

thomasyu888 left a comment

BryanFauble left a comment

JessterB left a comment

BWMac commented Jan 3, 2024 •

edited

Loading

sonarcloud bot commented Jan 4, 2024

jaclynbeck-sage left a comment

[AG-1314] JSON Schema Validation GX Prototyping #111

[AG-1314] JSON Schema Validation GX Prototyping #111

Conversation

BWMac commented Dec 20, 2023 • edited Loading

thomasyu888 left a comment

Choose a reason for hiding this comment

BryanFauble left a comment

Choose a reason for hiding this comment

JessterB left a comment

Choose a reason for hiding this comment

BWMac commented Jan 3, 2024 • edited Loading

sonarcloud bot commented Jan 4, 2024

Quality Gate passed

jaclynbeck-sage left a comment

Choose a reason for hiding this comment

BWMac commented Dec 20, 2023 •

edited

Loading

BWMac commented Jan 3, 2024 •

edited

Loading