Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AG-1314] JSON Schema Validation GX Prototyping #111

Merged
merged 19 commits into from
Jan 4, 2024

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Dec 20, 2023

Problem:

agora-data-tools processes datasets that contain nested fields. We currently have nothing in place to evaluate the content of those nested fields during our GX data validation step.

Potential Solution:

Implement the expect_column_values_to_match_json_schema expectation for nested data fields. This PR is a POC demonstrating the types of changes we would need to implement to the codebase to support such a change.

Notes:

  • For simplicity, I added JSON schema validation to the existing genes_biodomains expectation suite, since it only has one nested field.
  • I generated (and then edited) a JSON schema from here.
  • To support the JSON schema validation expectation, I added a nested_columns attribute and a method to convert nested fields into JSON-parseable strings to GreatExpectationsRunner.
  • I added the gx_nested_columns field to the configuration files to pass information about which columns need to be converted to the JSON string.

@BWMac BWMac marked this pull request as ready for review December 21, 2023 18:33
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! Just a small comment.

src/agoradatatools/gx.py Show resolved Hide resolved
Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a reasonable way to handle validating nested data. A couple of general questions occurred to me.

  1. For more complex nested objects, would we generate a single complex schema or is there a way to extract various sub-sub-objects to validate them independently via multiple less complex schemas?
  2. Are there disadvantages to using this approach vs the regular GE expectations? For simplicity, (sh/c)ould we consider using json schema based validation for all of our post-processing json file validation, and leverage the regular GE expecatations for pre-processing validation once we get there?

@BWMac
Copy link
Contributor Author

BWMac commented Jan 3, 2024

@JessterB

  1. For more complex nested objects, would we generate a single complex schema or is there a way to extract various sub-sub-objects to validate them independently via multiple less complex schemas?

I think as long as it is possible to create a schema that sets appropriate expectations for a more complex nested structure, we should stick with one schema. We considered the method of extracting sub-objects and nested structures for validation but opted against it because it would result in many different GX reports being produced from one dataset validation which would make them much less human-readable/friendly.

  1. Are there disadvantages to using this approach vs the regular GE expectations? For simplicity, (sh/c)ould we consider using json schema based validation for all of our post-processing json file validation, and leverage the regular GE expecatations for pre-processing validation once we get there?

I don't see many disadvantages when talking about nested structures. I think it makes the most sense to use the built-in or simple custom expectations when the data is not nested, and the JSON Schema validation expectation when it is nested. Off the top of my head, I also think that same approach may also be appropriate for pre-processing validation.

@BWMac BWMac requested a review from JessterB January 3, 2024 20:49
@BWMac BWMac requested a review from jaclynbeck-sage January 4, 2024 16:43
Copy link

sonarcloud bot commented Jan 4, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@BWMac BWMac merged commit 6ad9e10 into dev Jan 4, 2024
9 checks passed
@BWMac BWMac deleted the bwmac/AG-1314/nested_data_validation_prototyping branch January 4, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants