Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AG-1143/transform distribution data testing #82

Merged

Conversation

jaclynbeck-sage
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage commented Jul 6, 2023

This PR addresses 2 JIRA tasks:
AG-1116: Removes the literature score from the distribution_data transform since it is no longer used.
AG-1143: Adds data-driven tests for the distribution_data transform.

I wrote the tests first, made sure they all passed, then removed the literature score from the transform and updated the tests accordingly. I've confirmed that the output of transform_distribution_data on the real data was identical to the previous output except for deletion of the literature score section in the JSON file, prior to bug fixes. After bug fixes (listed below), the output of transform_distribution_data is no longer identical to the previous output because dropping duplicate rows changes the distribution slightly, however the new output makes sense and looks correct when compared with the previous output.

There are several "passing" test cases:

  • Good input using the same max_score parameters as defined in config.yaml
  • Good input using larger max_score parameters than defined in config.yaml -- bins should widen and go up to the new max, but the distribution above the old max should all be 0's since there's no data in that range.
  • Imperfect input with missing values in key columns: target_risk_score, genetics_score, multi_omics_score, literaturescore (ignored), neuropathscore (ignored), isscored_genetics, isscored_omics, isscored_lit (ignored), isscored_neuropath (ignored)

The "good" input file has rows covering the following cases:

  • One row is duplicated except it has a different neuropath score than the original row, which happens in the real data.
  • All of the isscored columns are "Y"
  • All of the isscored columns are "N"
  • A mixture of "Y" and "N" isscored values

There are 3 related "failure" cases:

  • For each of target_risk_score, genetics_score, and multi_omics_score, one value is a string instead of a number. Throws a ValueError.

NOTE: I did find some bugs in the transform while writing these tests. I have fixed the following bugs / clean up in this PR and adjusted the tests to match:

  • The distribution data transform now drops duplicate rows before calculating distribution
  • First/third quartile values are rounded to 4 decimal places instead of the nearest integer value
  • Manual column renames in the transform were moved to column_rename in the config file
  • pd.cut is now called with the same arguments each time it is called (previously it was called a second time with a few args left off, but the intention seemed to be that it would produce the same bins as the first call).

I confirmed that after bug fixes, all the distribution counts are equal or less than the previous output counts (as they should be if duplicate rows are now dropped), and that the new 4-decimal first/third quartile values could reasonably be rounded to the previous output integer values.

# Writing to JSON changes the "bins" entry in this dict from tuples to lists, so
# output_dict and expected_dict would not be equal since expected_dict is read from JSON.
# We solve this by turning output_dict into a JSON string and reading back into a dict.
output_dict = json.loads(json.dumps(output_dict))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best way to handle this problem but it's all I could come up with. Open to better suggestions.

…ve been moved to column_rename in the config file. Tests have been updated to match
@jaclynbeck-sage jaclynbeck-sage marked this pull request as ready for review July 10, 2023 21:47
Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me!

@jaclynbeck-sage jaclynbeck-sage merged commit 8fccf16 into dev Jul 13, 2023
@jaclynbeck-sage jaclynbeck-sage deleted the jbeck/AG-1143/transform_distribution_data_testing branch July 13, 2023 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants