-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IBCDPE-769] Adds gene_info
Expectation Suite
#139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM! I'll leave it to the Agora team to provide feedback on the expectations themselves!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just had a few questions before approving.
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
Description:
This PR implements a GX expectation suite for the
gene_info
dataset.The expectations chosen and limit values for the expectations (where applicable) come from the rules defined in the R scripts here, with a few exceptions.
total_nominations
was expected to not benull
according to the rules in the R script, but when I checked withexpect_column_values_to_not_be_null
this was not the case ~97.5% of the time so I left that expectation off.name
,summary
, andhgnc_symbol
were also expected to not be null, but I found that they were to varying degrees. For these fields I introduced themostly
keyword which sets a percentage threshold for the expectation passing rather than the expectation failing if even one record does not satisfy it. Themostly
keyword will be used in future work to add "warning" functionality to our data validation.target_nominations
field, there were approximately 1.2% of the nested values which did not meet my JSON schema expectation (based on the rules here). This is caused by a combination of thedata_used_to_support_target_selection
,predicted_therapeutic_direction
, andtarget_choice_justification
fields containing invalid unicode characters, andnull
values in thedata_synapseid
andvalidation_study_details
. So I set themostly
value for this expectation to0.98
.Please let me know if there are any questions, suggestions for adjustments to expectations used/expectation configurations, or if any further discussion is needed with regards to decisions I have made here. This is by far the most complex GX suite we have implemented.
Notes:
Elapsed time: 00:10:43 for gene_info dataset
.