generated from opensafely/research-template
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stage1 data cleaning #80
Open
ZoeMZou
wants to merge
32
commits into
main
Choose a base branch
from
Stage1_data_cleaning
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Further consider secondary care diagnosis
1. remove opa diagnosis for hospitable admission. 2. revise covid-19 severity variable by focusing on primary diagnosis only.
To include diagnosis in any position in the function. Previously we just included primary diagnosis and first code of secondary diagnosis. The definition of Covid-19 hospitalisation `sub_date_covid19_hospital` did not use any created function as its definition is very unique and not worth creating a function for it self.
Dataset definition revision
…post-covid-respiratory into Stage1_data_cleaning
update the code for extracting death data
Exclude death from definition for qa_bin_prostate_cancer, as people who die from it before index date will be excluded from the study anyway
Hi @venexia, This PR is now ready for your review. It runs successfully locally. Since the revisions in this script are quite detailed, I’ve listed the key changes at the beginning of the PR to make it easier to navigate and review. Thank you very much. Best wishes, |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR: Refinements for Script Structure & Data Processing
(Based on discussions on 28th January 2025)
🔹 1. Structural Changes
📌 1.1 Naming Conventions
Removed
_stage1
from scripts to improve clarity and consistency.The old
stage1_data_cleaning.R
script was handling:The new structure separates these tasks into three dedicated functions:
fn-ref.R
→ Sets reference levelsfn-qa.R
→ Applies quality assurancefn-inex.R
→ Applies inclusion/exclusion criteriaThe main script data_cleaning.R calls these functions, ensuring: Project-specific edits are made in function scripts and
data_cleaning.R
remains clear and workflow-focused.📌 1.2 Output Dataset Naming
The cleaned dataset output is now: input_{cohort}.rds
The preprocessed dataset output is now:
input_{cohort}.rds → input_{cohort}_0.rds (to avoid YAML conflicts)
🔹 2.
fn-ref.R
— Reference Level Settings📌 2.1 Data Formatting Fixes
preprocess_data.R
to ensure correct formatting:🔗 See fix in preprocess_data.R
📌 2.2 Code Deletions from Old Repo
Index Date: Already correct—no correction needed.
🔗 Old repo reference
Deprivation (IMD): No need for re-categorization; new repo includes IMD (1–5) directly.
🔗 Old repo reference
📌 2.3 Sex Category Adjustments
Old repo: Only male (M) and female (F).
New repo: Four levels → female, male, intersex, unknown.
Fix:
Set non-male/non-female values to missing. Retain three levels: female, male, unknown.
🔹 3.
fn-qa.R
— Quality Assurance Fixes📌 3.1 Fixing "Date of Death" Message
The original message incorrectly stated that NA values were being removed.
Fix: Message now correctly states:
"Quality assurance: Date of death is invalid (on or before 1/1/1900 or after current date)"
🔗 Old repo reference
📌 3.2 Fixing Missing Data Handling
Issue: Patients with missing sex were marked as missing for the entire record, which was unintended.
Fix: The new script preserves records and only marks sex as missing.
🔗 Updated QA code
📌 3.3 Update to
qa_bin_prostate_cancer
Issue:
The current method introduced missing values due to logic operations with NA values. NA values were assigned to alive patients who had no recorded prostate cancer diagnosis but were affected by the logic.
The check for cause of death due to prostate cancer was unnecessary. Patients who died from prostate cancer before the index date are already excluded later in the study. Removing this check ensures all alive participants receive a valid classification (TRUE/FALSE) rather than NA.
🔗 Updated method
🔹 4.
fn-inex.R
— Inclusion/Exclusion Criteria Fixes📌 4.1 Code Corrections
The new repo includes
inex_bin_alive
, so we can use it directly instead of recalculating.🔗 Old repo reference
🔗 New implementation
📌 4.2 Variable Renaming
death_dat
e →cens_date_death
has_follow_up_previous_6months
→inex_bin_6m_reg
deregistration_date
→cens_date_dereg
📌 4.3 Removal of Unnecessary Code
Issue: The old repo had an unnecessary step for active registration at index.
Fix:
Removed redundant filtering for active registration, as inex_bin_6m_reg already ensures this. I rewrote the print message to reflect this.
🔗 Old repo reference
🔗 Updated function
📌 4.4 Fix for Exclusion Criteria in the Vax Cohort
Issue: Errors in the vax cohort exclusion criteria due to incorrect variable types.
Fix:
Redefined data types to ensure they are numeric before calculating vax_mixed.
🔗 Updated fix
🔹 5. Other Updates
📌 5.1 Cause of Death Extraction Simplified
A more efficient method for extracting cause of death has become available in OpenSAFELY.
Fix: Updated our code to align with the latest OpenSAFELY documentation.
🔗 Updated function