AG-1143/transform distribution data testing #82

jaclynbeck-sage · 2023-07-06T23:56:26Z

This PR addresses 2 JIRA tasks:
AG-1116: Removes the literature score from the distribution_data transform since it is no longer used.
AG-1143: Adds data-driven tests for the distribution_data transform.

I wrote the tests first, made sure they all passed, then removed the literature score from the transform and updated the tests accordingly. I've confirmed that the output of transform_distribution_data on the real data was identical to the previous output except for deletion of the literature score section in the JSON file, prior to bug fixes. After bug fixes (listed below), the output of transform_distribution_data is no longer identical to the previous output because dropping duplicate rows changes the distribution slightly, however the new output makes sense and looks correct when compared with the previous output.

There are several "passing" test cases:

Good input using the same max_score parameters as defined in config.yaml
Good input using larger max_score parameters than defined in config.yaml -- bins should widen and go up to the new max, but the distribution above the old max should all be 0's since there's no data in that range.
Imperfect input with missing values in key columns: target_risk_score, genetics_score, multi_omics_score, literaturescore (ignored), neuropathscore (ignored), isscored_genetics, isscored_omics, isscored_lit (ignored), isscored_neuropath (ignored)

The "good" input file has rows covering the following cases:

One row is duplicated except it has a different neuropath score than the original row, which happens in the real data.
All of the isscored columns are "Y"
All of the isscored columns are "N"
A mixture of "Y" and "N" isscored values

There are 3 related "failure" cases:

For each of target_risk_score, genetics_score, and multi_omics_score, one value is a string instead of a number. Throws a ValueError.

NOTE: I did find some bugs in the transform while writing these tests. I have fixed the following bugs / clean up in this PR and adjusted the tests to match:

The distribution data transform now drops duplicate rows before calculating distribution
First/third quartile values are rounded to 4 decimal places instead of the nearest integer value
Manual column renames in the transform were moved to column_rename in the config file
pd.cut is now called with the same arguments each time it is called (previously it was called a second time with a few args left off, but the intention seemed to be that it would produce the same bins as the first call).

I confirmed that after bug fixes, all the distribution counts are equal or less than the previous output counts (as they should be if duplicate rows are now dropped), and that the new 4-decimal first/third quartile values could reasonably be rounded to the previous output integer values.

…d updated tests to match

jaclynbeck-sage · 2023-07-07T00:03:58Z

tests/transform/test_distribution_data.py

+        # Writing to JSON changes the "bins" entry in this dict from tuples to lists, so 
+        # output_dict and expected_dict would not be equal since expected_dict is read from JSON. 
+        # We solve this by turning output_dict into a JSON string and reading back into a dict.
+        output_dict = json.loads(json.dumps(output_dict))


I'm not sure this is the best way to handle this problem but it's all I could come up with. Open to better suggestions.

…ounds the quartiles to 4 decimal places

…ve been moved to column_rename in the config file. Tests have been updated to match

JessterB

looks great to me!

jaclynbeck-sage added 4 commits July 6, 2023 14:22

Added some clarifying comments to the distribution_data transform

c589dda

Added data-driven test for distribution_data transform

2abf571

AG-1116: Removed literature score from distribution_data transform an…

0f3d66d

…d updated tests to match

Made a comment in the distribution_data test more clear

66b3012

jaclynbeck-sage commented Jul 7, 2023

View reviewed changes

jaclynbeck-sage added 2 commits July 10, 2023 14:07

Bug fixes: distribution data transform now drops duplicate rows and r…

cc069a6

…ounds the quartiles to 4 decimal places

Code cleanup: manual column renames in distribution data transform ha…

7ab0fc0

…ve been moved to column_rename in the config file. Tests have been updated to match

jaclynbeck-sage marked this pull request as ready for review July 10, 2023 21:47

jaclynbeck-sage requested review from BWMac and JessterB July 11, 2023 19:43

JessterB approved these changes Jul 11, 2023

View reviewed changes

BWMac approved these changes Jul 11, 2023

View reviewed changes

jaclynbeck-sage merged commit 8fccf16 into dev Jul 13, 2023

jaclynbeck-sage deleted the jbeck/AG-1143/transform_distribution_data_testing branch July 13, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AG-1143/transform distribution data testing #82

AG-1143/transform distribution data testing #82

jaclynbeck-sage commented Jul 6, 2023 •

edited by jira bot

Loading

jaclynbeck-sage Jul 7, 2023

JessterB left a comment

AG-1143/transform distribution data testing #82

AG-1143/transform distribution data testing #82

Conversation

jaclynbeck-sage commented Jul 6, 2023 • edited by jira bot Loading

jaclynbeck-sage Jul 7, 2023

Choose a reason for hiding this comment

JessterB left a comment

Choose a reason for hiding this comment

jaclynbeck-sage commented Jul 6, 2023 •

edited by jira bot

Loading