Consolidate SQL files used to create CH Tables #10867

haynescd · 2024-06-26T17:49:36Z

No description provided.

sonarqubecloud · 2024-06-26T17:56:38Z

Quality Gate passed

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

sheridancbio

Looks good, and just in time (I will need to create these views soon on the new version of the genie database, so I can run this).

I point out a couple of concerns, but don't think they need to hold up this work.

sheridancbio · 2024-06-26T18:45:54Z

src/main/resources/db-scripts/clickhouse/clickhouse.sql

+    concat(cs.cancer_study_identifier, '_', sample.stable_id) AS sample_unique_id,
+    genetic_alteration_type AS alteration_type,
+    -- If a mutation is found in a gene that is not in a gene panel we assume Whole Exome Sequencing WES
+    ifnull(gene_panel.stable_id, 'WES') AS gene_panel_id,


Recently I looked into this during the fixing of the 4 non-panel but profiled genie samples. According to the docs, the value 'NA' is specified to be used "When the sample is not profiled on a gene panel, or if the sample is not profiled at all"
So I'm not sure whether NA is supposed to be used for whole exome sequencing or not ... but it sounds like maybe. But then I guess the 'NA' value would pass through in this logic. I do wonder whether having a mixture of 'WES' and 'NA' would be confusing in the downstream logic. Perhaps our docs should be changed to reserve 'NA' only for cases where a sample was not profiled and to reserve 'WES' or something similar for whole exome sequencing.
https://docs.cbioportal.org/file-formats/#values

This comment applies to several tables below as well.

Yea, this isn't finalized and would be interested jumping on a call and discussing this in the future. To figure out what the best approach is

sheridancbio · 2024-06-26T18:50:01Z

src/main/resources/db-scripts/clickhouse/clickhouse.sql

+    'WES' AS gene_panel_id,
+    gene.hugo_gene_symbol AS gene
+FROM gene
+WHERE gene.entrez_gene_id > 0;


some of our databases use negative entrez_gene_id values (mainly for microRNAmature/precursor combinations related to expression profiling). I'm not sure if that matters here. Maybe a note should be added about the possible future need to support negative values? Or maybe those are intentionally being excluded here.

Consolidate SQL files used to create CH Tables

fdb65f5

haynescd added performance backend labels Jun 26, 2024

haynescd requested review from alisman and sheridancbio June 26, 2024 17:49

haynescd self-assigned this Jun 26, 2024

alisman approved these changes Jun 26, 2024

View reviewed changes

sheridancbio approved these changes Jun 26, 2024

View reviewed changes

haynescd merged commit 79d36e7 into demo-rfc80-poc Jun 26, 2024
16 of 19 checks passed

haynescd deleted the rfc80/Consolidate-SQL-CH-Table-Creation branch June 26, 2024 19:18

haynescd added a commit that referenced this pull request Nov 25, 2024

Consolidate SQL files used to create CH Tables (#10867)

604568f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate SQL files used to create CH Tables #10867

Consolidate SQL files used to create CH Tables #10867

haynescd commented Jun 26, 2024

sonarqubecloud bot commented Jun 26, 2024

sheridancbio left a comment

sheridancbio Jun 26, 2024

haynescd Jun 26, 2024

sheridancbio Jun 26, 2024

Consolidate SQL files used to create CH Tables #10867

Consolidate SQL files used to create CH Tables #10867

Conversation

haynescd commented Jun 26, 2024

sonarqubecloud bot commented Jun 26, 2024

Quality Gate passed

sheridancbio left a comment

Choose a reason for hiding this comment

sheridancbio Jun 26, 2024

Choose a reason for hiding this comment

haynescd Jun 26, 2024

Choose a reason for hiding this comment

sheridancbio Jun 26, 2024

Choose a reason for hiding this comment