Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDM Data Issues Tracker #1411

Open
8 of 14 tasks
callachennault opened this issue Nov 12, 2024 · 0 comments
Open
8 of 14 tasks

CDM Data Issues Tracker #1411

callachennault opened this issue Nov 12, 2024 · 0 comments
Labels
backend-scrum items centered around engineering activities interrupt

Comments

@callachennault
Copy link
Collaborator

callachennault commented Nov 12, 2024

Done Condition (What do we need? Why do we need it? Keep this is small as possible!)

  • Change CDM sample list to read from data_clinical_cvr file to minimize data lag
  • 'UNKNOWN' values for CDM elements in mskarcher_unfiltered - reached out to Chris 11/12
  • Create new Airflow connection for server running SLURM - update from Chris on 11/15: there will be no change to the server running the DAGs for this release
  • Possibly - Help get new dags set up on AWS Airflow
  • Verify if we encounter data lag based on time the databricks query DAG is run each day
  • Update CDM metadata

Data issues to discuss

  • Unknown values - Discuss with Chris
  • How to handle data lag in clinical files - Discuss with Ben/Avery
  • archer, hemepact, and access clinical patient and sample files - not being updated by CDM processes. Discuss with Chris
  • Should cdm repo be updated? https://github.mskcc.org/cdsi/cdm
  • Importer does not read meta_timeline_*.txt files - only meta_timeline.txt (Data still getting in, but not parsing each individual meta file)

Process improvements

  • Only merge in list of predefined CDM files - to prevent extra/unneeded data from being committed/imported
  • Check number of 'unknown' values in clinical file and do not merge if above certain threshold
  • Store backup of yesterday's data from s3 bucket

Technical Description (How are we going to achieve the above)

Potential Issues

Dependencies

Technical Requirements

Outside People/Teams

Changes

@callachennault callachennault added backend-scrum items centered around engineering activities interrupt labels Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend-scrum items centered around engineering activities interrupt
Projects
None yet
Development

No branches or pull requests

1 participant