-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Biomarkers transform for ModelAD #148
Conversation
…funcitonal at the moment
…puts a list and so a new list_to_json() function was added to the load module and logic to handle this was added to the process_dataset function
src/agoradatatools/etl/load.py
Outdated
temp_json = open(os.path.join(staging_path, filename), "w+") | ||
json.dump(df, temp_json, cls=NumpyEncoder, indent=2) | ||
temp_json.close() | ||
return temp_json.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, a context managed open is preferred like:
with open(os.path.join(staging_path, filename), "w+") as temp_json:
json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
return temp_json.name
This is so you don't need to be concerned about calling .close()
, which is a valid way of accomplishing this, however, if this is the approach you want to take the .close()
should be within a finally
block so it's guaranteed to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...
- Update just this one function with the preferred context managed open
- Update all of the X to json functions with the preferred context managed open
- Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open
Thoughts? @BryanFauble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would:
- Update any of the code you are already touching to following this approach
- Log a tech debt ticket to go back and look at the other areas of the code
Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JessterB I need help figuring out where to create this Jira ticket 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll jump in: Go to JIRA (https://sagebionetworks.jira.com/), click on "Create" in the top bar, for project select "Model AD Explorer (MG)", Issue type = "Task" (Jess will change it if she wants something different), assign it to Jess, don't worry about filling in Team or Validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jaclynbeck-sage, I missed this while I was out last week. What Jaclyn said works fine, once the ticket exists I can take it from there.
…m_biomarkers() function.
…problems with puthon 3.8
tests/transform/test_biomarkers.py
Outdated
pass_test_data = [ | ||
( # Pass with good real data | ||
"biomarkers_good_input.csv", | ||
"biomarkers_good_output.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see a need to test both real data and fake data if they're both good input. Usually for my tests I just subset to a small number of rows from the real data as my test input, and then tweak a few things from there if I need to check what happens with missing values or duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true. I removed the "real" data test since it is hardest for us to visually validate.
… process_dataset() for converting to json
@beatrizsaldana Check this out if you haven't seen it yet for using |
src/agoradatatools/process.py
Outdated
@@ -1,5 +1,6 @@ | |||
import logging | |||
import typing | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to get removed per pre-commit.
… functions that were being used to convert a list to a json
…f the apply_custom_transformations() output types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 LGTM! I'll defer to others for final review, but the code looks great. Good looking tests!
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Everything looks great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM
Creates a new transform for the biomarkers dataset. The transform will re-structure data as described in this jira ticket.
This is my first PR in this repo. Please review carefully and be as brutally honest as is necessary. It's better for me to learn things now than for us to have to go back and fix or add things later because nobody wanted to tell me I was doing something sub-optimally.
Expected Changes
Unexpected Changes
transform_biomarkers()
function outputs the transform as alist
instead ofdict
orpd.DataFrame
as is expected.list_to_json()
function insrc/agoradatatools/etl/load.py
to acomodate the new output typeelif isinstance(df, list):
in theprocess_dataset()
function insrc/agoradatatools/process.py
.@BWMac what do you think about the Unexpected Changes? Would it be better for the
transform_biomarkers()
function to output adict
orpd.DataFrame
and prevent any of these extra changes? All feedback is welcome.UPDATE
No unexpected changes were implemented.