Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biomarkers transform for ModelAD #148

Merged
merged 41 commits into from
Oct 4, 2024

Conversation

beatrizsaldana
Copy link
Member

@beatrizsaldana beatrizsaldana commented Sep 25, 2024

Creates a new transform for the biomarkers dataset. The transform will re-structure data as described in this jira ticket.

This is my first PR in this repo. Please review carefully and be as brutally honest as is necessary. It's better for me to learn things now than for us to have to go back and fix or add things later because nobody wanted to tell me I was doing something sub-optimally.

Expected Changes

  1. Added modelad_test_config.yaml
  2. Added a biomarkers transform function
  3. Added test cases and test data

Unexpected Changes

  1. The transform_biomarkers() function outputs the transform as a list instead of dict or pd.DataFrame as is expected.
  2. I added a list_to_json() function in src/agoradatatools/etl/load.py to acomodate the new output type
  3. I added elif isinstance(df, list): in the process_dataset() function in src/agoradatatools/process.py.

@BWMac what do you think about the Unexpected Changes? Would it be better for the transform_biomarkers() function to output a dict or pd.DataFrame and prevent any of these extra changes? All feedback is welcome.

UPDATE

No unexpected changes were implemented.

@thomasyu888 thomasyu888 requested a review from BWMac September 25, 2024 23:56
@beatrizsaldana beatrizsaldana self-assigned this Sep 25, 2024
@beatrizsaldana beatrizsaldana marked this pull request as draft September 26, 2024 00:01
@beatrizsaldana beatrizsaldana added the enhancement New feature or request label Sep 26, 2024
Comment on lines 177 to 180
temp_json = open(os.path.join(staging_path, filename), "w+")
json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
temp_json.close()
return temp_json.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, a context managed open is preferred like:

    with open(os.path.join(staging_path, filename), "w+") as temp_json:
          json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
          return temp_json.name

This is so you don't need to be concerned about calling .close(), which is a valid way of accomplishing this, however, if this is the approach you want to take the .close() should be within a finally block so it's guaranteed to execute.

Copy link
Member Author

@beatrizsaldana beatrizsaldana Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...

  1. Update just this one function with the preferred context managed open
  2. Update all of the X to json functions with the preferred context managed open
  3. Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open

Thoughts? @BryanFauble

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would:

  1. Update any of the code you are already touching to following this approach
  2. Log a tech debt ticket to go back and look at the other areas of the code

Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JessterB I need help figuring out where to create this Jira ticket 🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll jump in: Go to JIRA (https://sagebionetworks.jira.com/), click on "Create" in the top bar, for project select "Model AD Explorer (MG)", Issue type = "Task" (Jess will change it if she wants something different), assign it to Jess, don't worry about filling in Team or Validator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jaclynbeck-sage, I missed this while I was out last week. What Jaclyn said works fine, once the ticket exists I can take it from there.

pass_test_data = [
( # Pass with good real data
"biomarkers_good_input.csv",
"biomarkers_good_output.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see a need to test both real data and fake data if they're both good input. Usually for my tests I just subset to a small number of rows from the real data as my test input, and then tweak a few things from there if I need to check what happens with missing values or duplicates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true. I removed the "real" data test since it is hardest for us to visually validate.

@BWMac
Copy link
Contributor

BWMac commented Sep 30, 2024

@beatrizsaldana Check this out if you haven't seen it yet for using pre-commit locally.

@@ -1,5 +1,6 @@
import logging
import typing
import warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to get removed per pre-commit.

src/agoradatatools/process.py Outdated Show resolved Hide resolved
Beatriz Saldana added 3 commits October 2, 2024 13:28
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 LGTM! I'll defer to others for final review, but the code looks great. Good looking tests!

Copy link

sonarcloud bot commented Oct 3, 2024

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Everything looks great.

Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM

@beatrizsaldana beatrizsaldana merged commit e02c29b into dev Oct 4, 2024
9 checks passed
@beatrizsaldana beatrizsaldana deleted the beatrizsaldana/MG-99/model-ad-transform branch October 4, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants