Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-792] Implement Great Expectations for the proteomics_distribution_data Dataset #118

Merged
merged 3 commits into from
Feb 8, 2024

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Feb 2, 2024

Description:

This PR implements a GX expectation suite for the proteomics_distribution_data dataset.

Please let me know if there are any additional expectations I should add, or if there are any adjustments to be made to the expectations I have already implemented.

Notes:

  • Values I have chosen for expectations like expect_column_values_to_be_between are based on looking at the example dataset on Synapse.
  • This change has been tested by running adt test_config.yaml. Example results can be downloaded from here.
  • I also fixed a variable naming typo in the neuropath_corr notebook which was not affecting performance/proper functioning.

Copy link

sonarcloud bot commented Feb 2, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@BWMac BWMac marked this pull request as ready for review February 5, 2024 16:28
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! Will wait for @jaclynbeck-sage or @JessterB to comment on the rules themselves

@jaclynbeck-sage
Copy link
Contributor

Looks good! I'm a little torn on having min/max values for the numerical columns. On the one hand, generically these values could be anything and may vary based on the type of proteomics data, so it doesn't entirely make sense to bound them. On the other hand, this matches our current data, and if our current data goes outside these bounds that means something is wrong. Last time this came up we decided to leave it in for that very reason so, it's probably ok?

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @BWMac! I agree with @jaclynbeck-sage about leaving in the min/max checks on the numeric values - we don't expect these to change unless we add new data, and if we add new data we'll likely need to adjust the expectations anyway. Having a check in place that lets us know if things change unexpectedly is a good thing.

@BWMac BWMac merged commit 016acc3 into dev Feb 8, 2024
9 checks passed
@BWMac BWMac deleted the bwmac/IBCDPE-792/proteomics_distribution_data branch February 8, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants