Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AG-1386] Implement Error Catching for GX Data Validation #129

Merged
merged 10 commits into from
Mar 7, 2024

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Mar 5, 2024

Problem:

Previously, failures in GX data validation were not causing pipeline failures. As a result, visibility for these failures is poor and our data validation pipeline is relatively ineffective.

Solution:

Leverage our existing error-catching strategy to capture data validation failures by introducing a new error type which is triggered during GX data validation if validation for a dataset fails.

Notes:

  • ADTDataValidationError is raised when data validation fails. This is caught by the try/except logic in process_all_files and a formatted string is printed when the pipeline concludes. Example.
  • Needed tests are added for GreatExpectationsRunner.run and the new_function GreatExpectationsRunner.get_failed_expectations which is used to parse out which expectations failed and produce the error message.

@BWMac BWMac marked this pull request as ready for review March 6, 2024 17:46
Copy link

sonarcloud bot commented Mar 6, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
3.1% Duplication on New Code

See analysis details on SonarCloud

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more general issue, but I can't align the example CI run you linked with a specific file or manifest version in Synapse, at least not without doing a ton of manual digging around that no one has time for.

The Synapse UI doesn't display full timestamps (just the date portion), there were 6 different "Agora Testing Data" processing runs on 3/5/23, and there is nothing logged in the CI job's output about what version the uploaded files or manifest (and also GE reports, once we collapse the reports into versioned files) end up with.

Would it be an easy update to log the version of each file after it's uploaded?

src/agoradatatools/process.py Show resolved Hide resolved
@BWMac
Copy link
Contributor Author

BWMac commented Mar 7, 2024

@JessterB There wouldn't be a manifest uploaded when that run failed. Nor would there be an output data file for neuropath_corr uploaded because that particular iteration of process_dataset would not have made it to the upload step. Therefore, I think the versioning strategy would only really be useful for GX reports.

@BWMac BWMac merged commit bbc1adc into dev Mar 7, 2024
9 checks passed
@BWMac BWMac deleted the bwmac/AG-1386/GX_CI branch March 7, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants