Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add treatment timeline to pancan studies #1597

Merged
merged 24 commits into from
Apr 17, 2022
Merged

Conversation

rmadupuri
Copy link
Collaborator

@rmadupuri rmadupuri commented Feb 1, 2022

What?

Related issue #241

Testing instances:
BRCA : https://triage.cbioportal.mskcc.org/study/summary?id=brca_tcga_pan_can_atlas_2018
COADREAD: https://triage.cbioportal.mskcc.org/study/summary?id=coadread_tcga_pan_can_atlas_2018
OV: https://triage.cbioportal.mskcc.org/study/summary?id=ov_tcga_pan_can_atlas_2018

Tracks Added:

  1. TREATMENT - Drug therapy, Radiation therapy data
  2. SAMPLE ACQUISITION - days to sample collection
  3. STATUS - Added days to, (not all cancer types have all the attributes)
    - Initial Diagnosis
    - Stage_events
    - New tumor events (Met, Recurrence..)
    - Death
    - Days to Last Follow Up
    - Diabetes Onset
    - Diagnostic Computed Tomography
    - Diagnostic MRI
    - FDG or CT_PET
    - First Biochemical Recurrence
    - Last Known Alive
    - Pancreatitis Onset
    - Performance Status Assessment
    - Scan
    - Stem Cell Transplantation
    - Submitted Specimen Dx

Data Source:
GDAC Firehose: https://gdac.broadinstitute.org/
File used: Merge_Clinical.Level_1.20160128 (clin.merged.txt) for each cancer type.
TCGA Forms and Documents (for info on variables collected and index dates): https://www.nationwidechildrens.org/research/areas-of-research/biopathology-center/nci-ccg-project-team/the-cancer-genome-atlas/tcga-forms-and-documents

DATA TRANSFORMATION PROCESS:
The TCGA clinical data is organized hierarchically. The hierarchical data is transformed to flat table using a custom script and the patient nodes were filtered by terms
- drug
- radiation
- days_to
- omf
- followup
- samples
- new tumor events

Details on extracting the treatment data:

The treatment data for a patient (for a single drug, radiation) is organized as below (image taken from Enrico et al. medrxiv 2021)
image

  1. All the records starting with patient.drugs and patient.radiations are filtered for each patient and all the nodes (spanning multiple levels) are selected.
  2. Records with no therapy start date specified are dropped.
  3. In the raw data from GDAC firehose, the drug name’s were not harmonized, had typo's, named differently etc., The medrxiv paper (enrico et al, 2021) has normalized these names (i.e, brand names were changed to generic names and typo’s were corrected. In case of multiple generic names for one drug, the name found in Drugbank was selected). The mapping is found here: harmonized_drug_names_mapping.txt. The same mapping was used to fix the drug names when curating the data for cbioportal.

Treatment timeline test: https://triage.cbioportal.mskcc.org/patient?studyId=brca_tcga_pan_can_atlas_2018&caseId=TCGA-A2-A0EW

Details on extracting the SAMPLE ACQUISITION track data:

  1. All the records starting with patient.tumor_samples were filtered for each patient.
  2. Each patient has multiple samples (primary, normals.. ). Only the samples in the cbioportal pancan studies are selected.
  3. The days_to_sample_procurement attribute is used as the start_date.

Details on extracting the STATUS track data:

  1. Deceased values: From the followup records pulled the days_to_death attribute for each patient (patient.follow_ups.follow_up-2.days_to_death). And also pulled, patient.days_to_death from each patient. Used the days_to_death from followup records if both the values exist. (CDR paper used the death values from followup)
  2. Initial Diagnosis: the days_to_initial_pathologic_diagnosis and all the stage_event patient records are extracted.
  3. New tumor events: all the patient.new_tumor_events records were extracted (gives info on recurrence, metastasis..)

STATUS Track Test: https://triage.cbioportal.mskcc.org/patient?studyId=brca_tcga_pan_can_atlas_2018&caseId=TCGA-A2-A04P

Details on extracting the followup data:

  1. All the records starting with patient.days_to_last_followup and patient.follow_ups are filtered for each patient.
  2. There are cases where, patient.days_to_last_followup is given and followup-1, followup-2... are NA's. In that case, patient.days_to_last_followup is picked as followup-1. If there are followup-1, followup-2 timepoints listed, patient.days_to_last_followup is ignored. All the other records with no followup days specified are dropped.
  3. Only the last follow up date for each patient is selected and the data is added to STATUS tab.

Followup timeline test: https://triage.cbioportal.mskcc.org/patient?studyId=brca_tcga_pan_can_atlas_2018&caseId=TCGA-AR-A0TQ

@yichaoS yichaoS merged commit 1cb9767 into master Apr 17, 2022
@yichaoS yichaoS deleted the pancan_treatment_data branch April 17, 2022 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants