Review Feedback Consolidation #82

JennyonOort · 2024-12-12T19:47:55Z

Hi Team,

I've consolidated the feedback from TA (aka Gradescope) and from peer review and categorized them by theme for easier discussion this afternoon. To date, all 4 peer reviewers have given their feedbacks.

We can walk through these comments one by one and assign person in charge and cross-reviewer.

Note some comments may have already been incorporated along the way or no longer applicable as workflow changed.

Analysis and Code Related

Abstract & Summary: Does not clearly state the question being asked (-0.25). Does not discuss importance and limitations of findings (-0.25).
6d90b86
Introduction: Did not clearly identify the question being asked, or the question was not clear in some way. Did not clearly identify and describe the dataset that was used to answer the question. This should be done at least at a high-level in the introduction.
6d90b86
Introduction: Did not clearly identify the question being asked, or the question was not clear in some way (-1). Did not clearly identify and describe the dataset that was used to answer the question. This should be done at least at a high-level in the introduction (-1). Did not reference the data set when referring to it (-0.25).
6d90b86
Methods & Results: Important methodology descriptions missing (e.g., did not explain in narrative what metric was being used for model parameter optimization). t=The first question that came into my head and could be further specified in the report is : "is this a balanced dataset?", how many patients were recorded in each category? this needs to be stated in the method section as it's important info about the dataset. (-1)
725fc5f
Methods & Results: The data should be reproducibly downloaded from it's source using code in the analysis code document, and a copy saved in the data directory (and ideally in the data/raw subdirectory). (-1)
Amended in an earlier Milestone
https://github.com/UBC-MDS/diabetes_predictor_py/blob/b51334551073b80786aa1339960580e45465d34b/scripts/download_data.py
Methods & Results: the method and results should be listed seperately in 2 different sections, not merged as one.
371a970
EDA: Missing figure or table legends/descriptions. (-1)
https://github.com/UBC-MDS/diabetes_predictor_py/blob/1a705b08226680570fc3a69c30b76157eb03cc7d/reports/diabetes_analysis.qmd
Discussion: Key findings from the project are not presented. (-1)
https://github.com/UBC-MDS/diabetes_predictor_py/blob/1a705b08226680570fc3a69c30b76157eb03cc7d/reports/diabetes_analysis.qmd
There were many spelling or grammatical errors. They did not impact the understanding of the document. (-0.5)
https://github.com/UBC-MDS/diabetes_predictor_py/blob/530cfcb738c1e8a19c0958be6664f0f472e6f604/reports/diabetes_analysis.html

10 It would be better to use a confusion matrix to show the results instead of prediction probabilities. The prediction probabilities plot can be confusing for general audience. / Consider presenting the model results using a confusion matrix or a logistic regression curve for better clarity.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/7994a29ba2be714390120beb7700eb567d6f84e8/scripts/evaluate_predictor.py

Consider adding PR curve and ROC curve for a more detailed evaluation of the results.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/7994a29ba2be714390120beb7700eb567d6f84e8/scripts/evaluate_predictor.py
The pairwise scatter plots extend beyond the page in the PDF. Additionally, each individual plot in the pairwise comparisons is too small, making it difficult to read the axes.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/530cfcb738c1e8a19c0958be6664f0f472e6f604/reports/diabetes_analysis.pdf
Many of the references are missing their DOI link / Some references in the documentation are missing DOI information.
6d90b86
Discuss the range of C used for hyperparameter optimization, and explain the rationale behind your chosen range.
371a970
Some libraries loaded in individual scripts are not utilized within the code. Removing these unused libraries can streamline the scripts, reduce potential confusion, and improve code efficiency.
bb540cc
The text references a heatmap, but the corresponding figure is absent from the report.
Note: text was fixed and it is no longer referring the table as heatmap.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/530cfcb738c1e8a19c0958be6664f0f472e6f604/reports/diabetes_analysis.qmd
The figure 1 in the report is formatted into two columns but I believe it would be better if it was a 3 by 3 figure.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/1a705b08226680570fc3a69c30b76157eb03cc7d/scripts/eda_deepchecks.py
In Table 1 of the analysis report, the coefficients are displayed with varying decimal places. Consider standardizing the number of decimal places for clarity and consistency.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/1a705b08226680570fc3a69c30b76157eb03cc7d/reports/diabetes_analysis.qmd
Certain numbers in the analysis report appear to be hard-coded. It is recommended to replace these with inline Python code.
NOTE: this was one of the peer review feedback received. However, proofreader has walked through the entire report and noticed that the only numbers remain hard coded are those cited from research report from introduction or any factual ranges related to data validation. All results related to analysis was properly transformed and linked. Providing the final report edit commit as reference.
b14df6c

Process Related

The Dockefile could be simplified. For instance, multiple RUN instructions could be merged into one. (-1)
5e37027
the instructions for how to set up environment is missing a step to change into the cloned repository directory after cloning it. user will not be able to recreate the virtual environment if the "cd ..." part is missing (-0.5)
NOTE: already incorporated in earlier milestone. See a final commit link for final result.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/8f48f6b8df8812069e1933829a3ed2c35812d9bb/README.md
environment.yml: versions are missing from environment files(s) for some R or Python packages (-1)
NOTE: already incorporated in earlier milestone. See a final commit link for final result.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/019af1f5cd2dd1a33f9d79a0e1f90ad23c5f96fd/environment.yml
environment.yml: Programming language and/or package versions are pinned using >= instead of =. This means that each time the environment is built in the future, the most recent version of the programming language and/or package will be installed in the environment. This will lead to the environment not being able to be reproducibly built in the future. (-1)
NOTE: already incorporated in earlier milestone. See a final commit link for final result.
https://github.com/UBC-MDS/diabetes_predictor_py/blob/019af1f5cd2dd1a33f9d79a0e1f90ad23c5f96fd/environment.yml
Reproducibility Miletone 1: I cannot find the "Python [conda env:diabetes-predictor" selection for kernel after following your instructions. (-1)
NOTE: no longer applicable and already fixed in Docker.
Reproducibility Milestone 2: instructions is a bit confusing if I am a complete new user to docker "Once the container is running, access the server by opening the link shown in the terminal (e.g., http://127.0.0.1:8888/lab?token={your_token})" => essentially including an image (-0.5)
ce885de
Workflow: branch name is not meaningful: kuo4230-patch-1 (-0.5)
NOTE: no longer applicable
Github issues communication: GitHub issues were rarely used for project communication. (-4)
NOTE: fixed since Milestone 2

Structure Related

the index.html should not be living in the root directory, it belongs to the analysis folder (-0.5)
fc9a16d
CONTRIBUTING.md: the links are broken for "Project's license", "UBC-MDS Code of Conduct" (-1)
fabaabc
The some of the processed data files are simply call df.csv and test_df.csv. I cannot tell what data is in the file. A more descriptive or meaningful name would be nice.
NOTE: this involves changes in multiple scripts so instead of pointing to the scripts, I am pointing to the commits where showcase generated files with new meaningful names
7994a29
The "Option 1" header in the README is unnecessary since there is no corresponding "Option 2." Consider removing it to avoid confusion.
8b24aa9
There are two validation_errors.log files—one in the report folder and another in the root directory. One of these files may be redundant.
8822725

Summary of Points from TA for reference on ideas to how improve:

Abstract/Summary:
The summary is well-written overall, and i like how you gave lots of ideas for future work! However, I deducted points because (1) although you talked about the consequences of having lots false negatives, the summary does not explicitly list limitations in your analysis methods (2) a deeper explanation of importance of your work would be needed in the begining (3) although your analysis goals and methods are clearly listed, the analysis question is not clearly and explicitly stated in the beginning of the summary.

Introduction:
I like how you explained the importance of this project! However, I deducted points for the introduction because (1) a description of the dataset is missing, it needs to be more than just "a real-world dataset focused specifically on Pima Indian women aged 21 and older", it should include more details like how many women are included, how balanced is the dataset, what's the input variables, etc. (2) a clear research question is missing. I understand that you are trying to use ML to predict diabetes, but what are the input variables that are relevant to be used here? what are the output variables (how many levels do you have for defining diabetes )? (3) the dataset is not referenced when you mention it.

Methods:
The methods & results section is beautiful, and well-written! However, I deducted points because (1) the method and results should be listed seperately in 2 different sections, not merged as one. (2) I treat the begining of the methods & results section as the "section" part, and the this section did not explain what metric was being used for choosing model hyperparameter, or evaluating model performance (3) the first question that came into my head and could be further specified in the report is : "is this a balanced dataset?", how many patients were recorded in each category? this needs to be stated in the method section as it's important info about the dataset (4) the data was read from the data folder, not downloaded reproducibly from the source

Results:
Beautiful visualizations in your EDA! I deducted points because (1) figure descriptions are missing

Discussion:
Great work on the discussion highlighting some limitations and future work! However, i deducted points because the key findings (how accurate is your model) is not mentioned here, and did not tie back to the big picture application. the discussion jumped directly into talking about future steps, without first discussing the results from this study.

Spelling and Grammar:
Overall it's a well-written report! However, (1) there are sentences that are confusing. For example , "It has become a significant global health issue, with its prevalence nearly doubling since 1980, and in 2022, 14% of adults aged 18 and older were diagnosed with diabetes, doubling from 7% in 1990 ", i'm confused which year are you comparing against 1980 when you say "prevalence nearly doubling since 1980", because 2022's prevalence doubled compared to 1990's, but it's unclear which year you are comparing to 1980? (2) the results section is written more in a format of a code explanation rather than a formal report that is similar to a journal article. I would suggest that you do not need to explain which functions you used, and please try not to use variable names when writing paragraphs reporting findings. It's confusing for the user when you use variable names, using phrases that are commonly understandable would be more ideal. (3) there is a table where it says "mean_test_score", but there is a lack of explanation of what does that mean, the reader might confuse this score with your test dataset accuracy.

Reproducibility:
great reproducibility overall! However, I can not find the "Python [conda env:diabetes-predictor" selection for kernel after following your instructions. I was able to run everything after I conda activate your environment and run jupyter lab. So I would recommend providing instructions in that way instead.

I was able to run your docker image and analysis. However, I deducted -0.5 point because one of the instructions is a bit confusing if I am a complete new user to docker "Once the container is running, access the server by opening the link shown in the terminal (e.g., http://127.0.0.1:8888/lab?token={your_token})" you did not tell the user where to find their token. It's probably better to direct the user to find the link as what Tiffany did: https://github.com/ttimbers/breast-cancer-predictor.git . Moreover, your user might benefit from simply using "docker compose run"

Workflow:
great overflow! I see lots of pull request and collaboration which is amazing! However, i deducted points because a branch name is not meaningful: kuo4230-patch-1. Please use some easy to understand name for your branch

Github Issues for Communication:
I saw one issue opened for release 0.0.1, we would encourage more usage of github issues in the future!

Dockerfile:
great work overall on the specification of environment! However, I deducted 1 point because the RUN commands in dockerfile should be simplified. For example, the "RUN pip install deepchecks==0.18.1
RUN pip install altair_ally==0.1.1" should be merged as one: "RUN pip install deepchecks==0.18.1 altair_ally==0.1.1"

The text was updated successfully, but these errors were encountered:

JennyonOort · 2024-12-12T21:35:22Z

Based on in person meeting today, we have divided the above tasks based on following arrangements:

Analysis and Code Related:
1 - 3, 13: @kuo4230 6d90b86
4 - 6, 14: @InderKhera
10 - 11, 16: @JennyonOort
17 - 19: @javiermtzo99

9:

@kuo4230 @javiermtzo99 @InderKhera Each person should check for their respective parts. 6d90b86
Proofreader @JennyonOort for checks on entire report.
15: @InderKhera @javiermtzo99 @JennyonOort each check on scripts responsible for

Already incorporated: 7, 8
Ignore: 12 (uncontrollable)

Process Related:
6: @kuo4230 ce885de
Already incorporated: 1 - 5, 7, 8

Structure Related
1, 3, 5: @JennyonOort
2: @javiermtzo99
4: @InderKhera

JennyonOort added enhancement New feature or request communication labels Dec 12, 2024

JennyonOort assigned kuo4230, InderKhera, javiermtzo99 and JennyonOort Dec 12, 2024

JennyonOort mentioned this issue Dec 12, 2024

Review Feedback Task Assignment and Interim Deadline #83

Closed

JennyonOort closed this as completed Dec 18, 2024

JennyonOort reopened this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Feedback Consolidation #82

Review Feedback Consolidation #82

JennyonOort commented Dec 12, 2024 •

edited

Loading

JennyonOort commented Dec 12, 2024 •

edited by kuo4230

Loading

Review Feedback Consolidation #82

Review Feedback Consolidation #82

Comments

JennyonOort commented Dec 12, 2024 • edited Loading

JennyonOort commented Dec 12, 2024 • edited by kuo4230 Loading

JennyonOort commented Dec 12, 2024 •

edited

Loading

JennyonOort commented Dec 12, 2024 •

edited by kuo4230

Loading