Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify clinical data binning related SQL #10823

Merged
merged 2 commits into from
Jul 11, 2024

Conversation

onursumer
Copy link
Member

@onursumer onursumer commented Jun 6, 2024

  • Use clinical data counts derived from the new clinical_data_derived table instead of fetching all the actual clinical data
  • Simplify NA count logic. Now NA data is actually in the database, so we just need the get that NA count.

@onursumer onursumer force-pushed the count-na-sql branch 3 times, most recently from b34da51 to a3624e0 Compare June 21, 2024 00:16
@onursumer onursumer changed the title Move NA count logic into SQL Simplify clinical data binning related SQL Jun 21, 2024
@@ -310,17 +310,33 @@ public List<ClinicalDataBin> calculateStaticDataBins(
List<String> filteredUniqueSampleKeys,
List<String> filteredUniquePatientKeys
) {
return NewClinicalDataBinUtil.calculateStaticDataBins(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to revert the legacy implementation for the legacy endpoint. We can't use NewClinicalDataBinUtil.calculateDynamicDataBins for legacy endpoint anymore since the underlying logic changed significantly.

Comment on lines 39 to 65
<select id="getPatientClinicalDataCountsForBinning">
SELECT
attribute_name as attributeId,
if(attribute_value='', 'NA', attribute_value) AS value,
count(value) as count
FROM clinical_data_derived
<include refid="patientClinicalDataFromStudyViewFilter" />
AND type = 'patient'
GROUP BY attribute_name, value
</select>

<select id="getSampleClinicalDataCountsForBinning">
SELECT
attribute_name as attributeId,
if(attribute_value='', 'NA', attribute_value) AS value,
count(value) as count
FROM clinical_data_derived
<include refid="sampleClinicalDataFromStudyViewFilter" />
AND type = 'sample'
GROUP BY attribute_name, value
</select>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries are very similar to getClinicalDataCounts query. We may be able to directly use that one instead of adding new SQL.

<select id="getClinicalDataCounts" resultType="org.cbioportal.model.ClinicalDataCount">
<include refid="getCategoricalClinicalDataCountsQuerySample">
<property name="table_name_prefix" value="sample"/>
</include>
UNION ALL
<include refid="getCategoricalClinicalDataCountsQueryPatient">
<property name="table_name_prefix" value="patient"/>
</include>
</select>

@onursumer onursumer marked this pull request as ready for review June 21, 2024 18:22
@onursumer onursumer requested review from alisman and haynescd June 21, 2024 18:22
@@ -26,6 +26,26 @@ public ClinicalDataBinner(
this.dataBinner = dataBinner;
}

// TODO move this to a utility class?
public List<ClinicalData> convertCountsToData(List<ClinicalDataCount> clinicalDataCounts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be private

/**
* Calculate number of clinical data marked actually as "NA", "NAN", or "N/A"
*/
public Long countNAs(List<Binnable> clinicalData) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cast to Long and then later it grabs just the intValue(). Do you need the wrapper class?

    bin.setCount(countNAs(clinicalData).intValue());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we don't need a wrapper here. For now, just keeping it consistent with the legacy countNAs method.

public Long countNAs(List<Binnable> clinicalData, ClinicalDataType clinicalDataType, List<String> ids) {

We can revisit this once we completely get rid of the legacy functionality.

@onursumer onursumer force-pushed the count-na-sql branch 5 times, most recently from 4b73039 to 1c383c6 Compare June 27, 2024 21:51
@@ -168,10 +171,13 @@
</select>


<sql id="getCategoricalClinicalDataCountsQuerySample">
<sql id="getClinicalDataCountsQuerySample">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed because this SQL returns both categorical and numeric

@@ -198,10 +204,13 @@
value
</sql>

<sql id="getCategoricalClinicalDataCountsQueryPatient">
<sql id="getClinicalDataCountsQueryPatient">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed because this SQL returns both categorical and numeric

</include>
</if>
<if test="dataFilterValue.start != null || dataFilterValue.end != null">
AND match(attribute_value, '^[\d\.]+$')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ignores edge cases like <18, >90, etc. there is a separate ticket for this cBioPortal/rfc80-team#32

@onursumer onursumer force-pushed the count-na-sql branch 3 times, most recently from 00157cc to c4ae8d0 Compare July 1, 2024 20:22
List<DataBin> dataBins,
ClinicalDataType clinicalDataType,
List<Binnable> clinicalData,
List<String> ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you improve the variable name "ids" to something more decsriptive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is now only used by the legacy endpoint. The new one doesn't need to deal with sample/patient ids anymore. We should probably add comments to the methods that are now only needed for the legacy endpoint so that we can remove them later once we completely migrate to the new implementation.

@onursumer onursumer force-pushed the count-na-sql branch 3 times, most recently from 9e0d230 to b2dba31 Compare July 3, 2024 17:47
alisman
alisman previously approved these changes Jul 8, 2024
Copy link

@haynescd
Copy link
Collaborator

Looks like most of the sonar issues are around the deprecated fn.

except for this one. https://sonarcloud.io/project/issues?resolved=false&sinceLeakPeriod=true&pullRequest=10823&id=cBioPortal_cbioportal&open=AZCdbAVZBEMFXTbxsIZU

Copy link
Collaborator

@haynescd haynescd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@haynescd haynescd merged commit 0dfb346 into cBioPortal:demo-rfc80-poc Jul 11, 2024
10 of 13 checks passed
@onursumer onursumer deleted the count-na-sql branch August 22, 2024 12:53
haynescd pushed a commit that referenced this pull request Nov 25, 2024
* simplify clinical data binning related SQL

* fix numericalClinicalDataCountFilter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants