Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: create tnscope mnvs #1524

Merged
merged 124 commits into from
Feb 7, 2025
Merged

fix: create tnscope mnvs #1524

merged 124 commits into from
Feb 7, 2025

Conversation

mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Jan 20, 2025

Description

This PR aims to add a post-processing step to TNscope (for standard TGA workflow) to merge SNVs and InDels by their PID to MNVs to be more accurately merged with the VarDict results.

See issue: #1525

The script was taken from Sentieon here: https://github.com/Sentieon/sentieon-scripts/blob/master/merge_mnp/merge_mnp.py

Script has been slightly modified to merge variants despite filters other filters than PASS, this to allow for merging of variants with triallelic-site, and any other future soft-filters we might add. Some refactoring was done as well from the original filter to increase readability. (though it's still very messy)

The TNscope VCF is quality filtered before merging SNVs

To avoid merging of low quality variant to MNVs, the VCF is quality filtered before merging.

Merging SNVs with different filters set

There is an issue about how to consolidate the filters when merging SNVs with different filters set. Such as: germline_risk, in_normal, triallelic_site, and PASS. This was solved with logic that can be exemplified in this table below:

v1 v2 v3 filter_set
in_normal PASS germline_risk MNV_CONFLICTING_FILTERS
in_normal,germline_risk in_normal in_normal,germline_risk
in_normal triallelic_site germline_risk MNV_CONFLICTING_FILTERS
in_normal,triallelic_site triallelic_site MNV_CONFLICTING_FILTERS
in_normal,triallelic_site germline_risk in_normal,triallelic_site,germline_risk
in_normal,triallelic_site PASS MNV_CONFLICTING_FILTERS

In summary:

  1. If constituent variants have mixed filters the filter is set to MNV_CONFLICTING_FILTERS
  2. UNLESS, at least 1 of the "normal_filters", that is "in_normal" and "germline_risk" is set in ALL variants of the MNV, then the set of the filters are added instead
  3. And finally of course, if the variants all have the same filter, such as PASS, then that filter is set.

On top of this, a few new INFO fields are added to preserve some information from the constituent variants, originally only the FILTER was added but this has been amended with:

  • TUMOR_DP (ref|alt)
  • NORMAL_DP (ref|alt)
  • TUMOR_AF
  • NORMAL_AF
  • variant chrom_pos_ref_alt

With the values joined as comma-separated list for all variants.

Regarding benchmarking the speed of the merge script:

  • For most standard TGA cases that I checked the entire rule which includes multiple post-processing of TNscope takes 7-8 seconds
  • For WES it took 22 seconds with a maximum memory peak of 50MB

Added

  • TNscope MNV merge script

Changed

  • merge SNVs into MNVs in TNscope TGA
  • change raw delivery SNV file for TGA to before any post-processing

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • balsamic_filters.rst (updated with TNscope MNV merging)

Tests

Feature Tests

Here's an example of the merged INFO field from the sheet above:

DB=.;ECNT=11.0;FS=0.0;HCNT=6.0;MAX_ED=80.0;MIN_ED=0.5;NLOD=175.255;NLODF=39.465;PV=0.40159999999999996;PV2=0.35085;RPA=.;RU=.;SOR=0.911;STR=.;TLOD=7.615;TNSCOPE_MNV_FILTERS=triallelic_site|in_normal,in_normal;TNSCOPE_MNV_NORMAL_ADs=494|10,726|10;TNSCOPE_MNV_NORMAL_AFs=0.0142857,0.013587;TNSCOPE_MNV_TUMOR_ADs=590|10,900|10;TNSCOPE_MNV_TUMOR_AFs=0.0115875,0.010989;TNSCOPE_MNV_VARS=2_158637135_GAAA_G,2_158637139_A_G
The fields added here are:

  • TNSCOPE_MNV_VARS=2_158637135_GAAA_G,2_158637139_A_G
  • TNSCOPE_MNV_TUMOR_AFs=0.0115875,0.010989
  • TNSCOPE_MNV_NORMAL_AFs=0.0142857,0.013587
  • TNSCOPE_MNV_TUMOR_ADs=590|10,900|10
  • TNSCOPE_MNV_NORMAL_ADs=494|10,726|10
  • TNSCOPE_MNV_FILTERS=triallelic_site|in_normal,in_normal

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because: Users discovered the issue initially, that there were 2 representations of the same variant, then it was decided by bioinformatician in discussion with colleagues that MNV is the best way to represent variants for VEP to create the most accurate protein effect prediction .
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Validation criteria

Validation criteria to be added to validation report PR: https://github.com/Clinical-Genomics/validations/pull/285

Version specific criteria

In VCF of any TGA case: SNV.somatic.[case-id].tnscope.research.normalised.vcf.gz the following criteria are met:

  • Created MNVs can be found
  • AF,DP,FILTER,CHROM_POS_ALT_REF from the constituent SNV / InDels used to construct the MNV can be found in the MNV info-field
  • Merged MNVs with filter MNV_CONFLICTING_FILTERS exist, and constituent variants do not all have matched normal filters set (in_normal/germline_risk)
  • MNV exists with filters: in_normal,germline_risk
  • Merged TNscope MNV can be merged with VarDict MNV
  • Merged TNscope MNV can be seen in Scout

Important

One of the below checkboxes for validation need to be checked

  • Added version specific validation criteria to validation report
  • Changes validated in standard sections: [validation-section]
  • Validation criteria not necessary

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Validation criteria
    • Completed the validation criteria section of the template.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Validation criteria
    • The author has completed the validation criteria section of the template
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

@mathiasbio mathiasbio linked an issue Jan 20, 2025 that may be closed by this pull request
@mathiasbio mathiasbio linked an issue Jan 20, 2025 that may be closed by this pull request
3 tasks
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.50%. Comparing base (7d529e6) to head (9d7d71e).
Report is 40 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1524      +/-   ##
===========================================
+ Coverage    99.48%   99.50%   +0.02%     
===========================================
  Files           40       40              
  Lines         1932     2036     +104     
===========================================
+ Hits          1922     2026     +104     
  Misses          10       10              
Flag Coverage Δ
unittests 99.50% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mathiasbio mathiasbio marked this pull request as ready for review February 3, 2025 10:36
@mathiasbio mathiasbio requested a review from a team as a code owner February 3, 2025 10:36
@khurrammaqbool
Copy link
Collaborator

khurrammaqbool commented Feb 4, 2025

Here is the summary of my comments as we discussed.

  1. It will be nice to see the stats that shows the effect of merging the variants into MNVs. This method effects TGA, TO and TN cases so some results from at least one represented case with at leas following basic stats:
    i . Number of SNVs and number of indels before merging
    ii. Number of SNVs and number of indels after merging
    iii. Number of merged variants
    iv. The ratios of the effected variants compared to the total.
    v. The Numbers and the ratio of clinically relevant variants effected by the method.

  2. It is important for the interpretation that the filter, AF, DP along with any additional information relevant for the interpretation, e.g. QUAL, should be retained for each variant merged as MNV. I suggest the following format in the INFO field of the merged VCF for this purpose:

NEW_INFO_TAG:INFO_VAR1|ADDITIONAL_INFO_VAR1,INFO_VAR2|ADDITIONAL_INFO_VAR2; 

To make the VCF format compatible, the original INFO tags should be preserved and a new tag may be added that contains the comma separated information containing the same order as of the merged variants and has the same size as the number of merged variants to be parsed as list.
3. Scout may be able to show the additional information from the new VCF tag of the merged variants as hover or in the variant view page.
4. The PR separates the current raw VCF and generates and moves the method to an intermediate VCF file that is not delivered, this should be mentioned somewhere in the documentation.
6. It is nice that the documentation is updated for the newly added VCF tags, but it may also be relevant to describe all the old and the new VCF tags to avoid the confusion and to be able to identify the origin of the VCF tags by TNScope or the method in the PR.
7. It will be nice to see some examples of the INFO fields for the merged variants and the variants that are merged.
8. A test running bcftools on the merged VCF file and performing filtering to show that the file is compatible.

Copy link
Collaborator

@khurrammaqbool khurrammaqbool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🥇 This is nice work. As I mentioned in the discussion, we should keep the merged file non-redundant i.e. all or most of the information from the variants to be merged should be retained in the merged variant and remove the additional lines representing the original variants that are merged. This will also eliminate the need for keeping the intermediate files and help with the interpretation as well.

BALSAMIC/assets/scripts/merge_mnp.py Show resolved Hide resolved
BALSAMIC/assets/scripts/merge_mnp.py Show resolved Hide resolved
@mathiasbio
Copy link
Collaborator Author

Thanks for the review @khurrammaqbool ! I think I have addressed all your comments and feedback : )

Regarding the failing docker containers I will test them all after the PR has been approved, just in case I need to make further changes to the code which will force me to restart the dockerbuilds.

Copy link
Collaborator

@khurrammaqbool khurrammaqbool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 some additional minor comments.

BALSAMIC/assets/scripts/merge_mnp.py Show resolved Hide resolved
BALSAMIC/assets/scripts/merge_mnp.py Show resolved Hide resolved
BALSAMIC/assets/scripts/merge_mnp.py Outdated Show resolved Hide resolved
BALSAMIC/assets/scripts/merge_mnp.py Show resolved Hide resolved
Copy link

sonarqubecloud bot commented Feb 6, 2025

@mathiasbio mathiasbio merged commit 0d3afc8 into develop Feb 7, 2025
21 of 26 checks passed
@mathiasbio mathiasbio deleted the create_tnscope_mnvs branch February 7, 2025 16:00
@mathiasbio
Copy link
Collaborator Author

Only somalier container failed, I will open a PR to fix it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

[User Story] Preprocess TNscope to create MNVs [Bug] Merging of different variants VarDict and TNscope
2 participants