-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVAnnotate: Functional annotations for SVs called by GATK-SV #7431
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! I have mostly stylistic comments at this point.
It does seem like there are parallels among the annotateX
methods resulting in some repeated code. You could think about loading some of that logic into an abstract annotator class. For each SV type you could then have a subclass that implements more specific logic. That is, implement a common public String annotate()
function that looks a lot like your annotateDUP()
and annotateINV()
methods but calls abstract functions getConsequenceSpanning()
, getConsequenceSingleBreakend()
, getConsequenceDoubleBreakendInExon()
, getConsequenceDoubleBreakendInUtr()
, etc. for each case. Not necessary, but something to think about as this becomes more complex.
|
||
@Argument( | ||
fullName="proteinCodingGTF", | ||
shortName="P", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically don't use short names for non-standard arguments, just so the commands are easier to interpret at a glance.
private VariantContextWriter vcfWriter = null; | ||
|
||
private OverlapDetector<GencodeGtfGeneFeature> gtfOverlapDetector; | ||
|
||
private final Set<String> MSVExonOverlapClassifications = new HashSet<>(Arrays.asList(GATKSVVCFConstants.LOF, GATKSVVCFConstants.INT_EXON_DUP, GATKSVVCFConstants.DUP_PARTIAL, GATKSVVCFConstants.PARTIAL_EXON_DUP, GATKSVVCFConstants.COPY_GAIN)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can delete the blank lines between these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use Sets.newHashSet()
|
||
@Override | ||
public void onTraversalStart() { | ||
FeatureDataSource<GencodeGtfGeneFeature> proteinCodingGTFSource = new FeatureDataSource<>(proteinCodingGTFFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final …
throughout
List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>(); | ||
proteinCodingGTFSource.forEach(gtfFeaturesList::add); // TODO: faster method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FeatureDataSource
implements Iterable
, you can do this in one line with Google's common Lists
class like so:
List<GencodeGtfGeneFeature> gtfFeaturesList = Lists.newArrayList(proteinCodingGTFSource);
FeatureDataSource<GencodeGtfGeneFeature> proteinCodingGTFSource = new FeatureDataSource<>(proteinCodingGTFFile); | ||
List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>(); | ||
proteinCodingGTFSource.forEach(gtfFeaturesList::add); // TODO: faster method? | ||
gtfOverlapDetector = OverlapDetector.create(gtfFeaturesList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not declare the gtfFeaturesList
since it's only used once, and just put the newArrayList()
function in this line, for the sake of code economy.
if (!variantConsequenceDict.containsKey(key)) { | ||
variantConsequenceDict.put(key, new HashSet<>()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can replace with variantConsequenceDict.putIfAbsent()
private String annotateINS(SimpleInterval variantInterval, GencodeGtfTranscriptFeature gtfTranscript) { | ||
return annotateDEL(variantInterval, gtfTranscript); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to have a single function annotateDeletionOrInsertion
variantConsequenceDict.get(key).add(value); | ||
} | ||
|
||
private String annotateDEL(SimpleInterval variantInterval, GencodeGtfTranscriptFeature gtfTranscript) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would spell out the type i.e. annotateDeletion
if (!variantInterval.getContig().equals(featureInterval.getContig())) { | ||
return false; | ||
} else { | ||
return variantInterval.getStart() <= featureInterval.getStart() && variantInterval.getEnd() >= featureInterval.getEnd(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just use Locatable
's contains()
method
} | ||
} | ||
|
||
protected final static int countBreakpointsInsideFeature(SimpleInterval variantInterval, SimpleInterval featureInterval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll want countBreakendsInsideFeature
- see https://bioconductor.org/packages/devel/bioc/vignettes/StructuralVariantAnnotation/inst/doc/vignettes.html#Representing_structural_variants_in_VCF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of trivial comments from me, but this looks fine. You can merge from my point of view. A couple of more general suggestions, though, which you can feel free to ignore:
Have you carefully considered the off-by-one errors that can occur because of the differing standards for representing intervals (0-based vs. 1-based coordinates and open ended vs. closed)? I believe that the IntervalTree class on which the OverlapDetector is based makes some assumptions about intervals being half-open in testing for overlapping intervals. Do all of your input files adhere to the same standard? (Apologies if you have unit tests for this--I didn't examine them carefully.)
The OverlapDetector class is super ugly, IMHO. (But kudos for finding existing code to reuse.) You could eliminate it, and save a ton of memory, by reorganizing this tool to walk through the various sources of annotations as you walk through the VCF, rather than reading them all into memory at the beginning of operations. Alternatively, a simpler to implement strategy would be to go chromosome by chromosome: Every time you stumble into a new contig in the VCF, you build IntervalTrees for just that contig for each of your annotation sources, discarding the IntervalTrees for the previous contig. Doesn't save quite as much memory, but it's simpler. We can discuss if you're interested and these brief couple of sentences don't make the idea clear.
return consequence; | ||
} | ||
|
||
private void annotateTranscript(final SimpleInterval variantInterval, final StructuralVariantAnnotationType svType, final GencodeGtfTranscriptFeature transcript, final Map<String, Set<String>> variantConsequenceDict) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these lines are awfully long. The GATK coding standard says something about line lengths, but it's frequently ignored. Nonetheless, it would be easier to read if you wrapped some of these super-long lines.
import com.google.common.collect.Lists; | ||
import htsjdk.samtools.SAMSequenceDictionary; | ||
import htsjdk.samtools.SAMSequenceRecord; | ||
import htsjdk.samtools.util.Locatable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few unused imports that could be cleaned up.
protected static void annotateNearestTranscriptionStartSite(final SimpleInterval variantInterval, final Map<String, Set<String>> variantConsequenceDict, SVIntervalTree<String> transcriptionStartSiteTree, int maxContigLength, int variantContigID) { | ||
// TODO: keep all nearest TSS for dispersed CPX / CTX or choose closest? | ||
// TODO: will start < end ever? Shouldn't at this point in the pipeline | ||
SVIntervalTree.Entry<String> nearestBefore = transcriptionStartSiteTree.max(new SVInterval(variantContigID, variantInterval.getStart(), variantInterval.getEnd())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super important here--I don't think it will have much of an impact on performance--but just in general best not to repeat operations unnecessarily: Pull out the creation of the new SVInterval into a local variable so you don't repeat it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Now that I think it over, I believe you want to take the min on (contig, start, start) and the max on (contig, end, end) rather than on (contig, start, end) as you're currently doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the interval used - (start,end) or (start,start) & (end,end) - doesn't matter here because it's already been determined that the variant doesn't overlap any of the features, so the min on (start,start) should be the same as the min on (start,end), etc. Unless there's a performance advantage or a nuance in behavior that I'm missing re: using (start,start) and (end,end)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think you're right that if there are no overlappers you'll get what you expect.
The subtlety is that [1000, 1500) is less than [1000, 2000), so if you queried for max on [1000, 2000) and [1000, 1500) was in the tree, you'd get that as the max, which sometimes surprises people who think that the max will necessarily be disjoint from the query interval. But you don't have that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of trivial things that I had meant to mention but forgot.
return count; | ||
} | ||
|
||
protected static boolean variantOverlapsFeature(final SimpleInterval variantInterval, final SimpleInterval featureInterval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this method is unnecessary (and is, in fact, causing you to create an extra SimpleInterval). I think you can just call variantInterval.overlaps(someFeature), rather than variantOverlapsFeature(variantInterval, new SimpleInterval(someFeature)).
// TODO: return consequence instead? | ||
} | ||
|
||
private StructuralVariantAnnotationType getSVType(final VariantContext variant) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to use the SVTYPE info field rather than the ALT allele? Or does that not handle all the cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using built-in methods to access SVTYPE from a VariantContext class, we ran into issues because StructuralVariantType as defined in htsjdk doesn't include CPX or CTX types. Rather than reimplementing a way to access SVTYPE while circumventing that check, it was suggested to use the ALT field because SVTYPE is going to be deprecated in VCF v4.4 anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. I remember that discussion now. But you could use getAttributeAsString(VCConstants.SVTYPE, null) rather than getStructuralVariantType() if it simplified your life. I'm just worried that when we get to INS:ME:ALU that things will break.
c80b15a
to
3c89b0c
Compare
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java
Outdated
Show resolved
Hide resolved
final boolean beforeInvalid = (isNull(nearestBefore) || nearestBefore.getInterval().getContig() != variantContigID ); | ||
final boolean afterInvalid = (isNull(nearestAfter) || nearestAfter.getInterval().getContig() != variantContigID ); | ||
// if at least one result is valid, keep one with shorter distance | ||
if (!(beforeInvalid && afterInvalid)) { | ||
// if result is invalid, set distance to longest contig length so that other TSS will be kept | ||
final int distanceBefore = beforeInvalid ? maxContigLength : nearestBefore.getInterval().gapLen(svInterval); | ||
final int distanceAfter = afterInvalid ? maxContigLength : svInterval.gapLen(nearestAfter.getInterval()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this a little easier to follow by avoiding double-negatives:
final boolean beforeInvalid = (isNull(nearestBefore) || nearestBefore.getInterval().getContig() != variantContigID ); | |
final boolean afterInvalid = (isNull(nearestAfter) || nearestAfter.getInterval().getContig() != variantContigID ); | |
// if at least one result is valid, keep one with shorter distance | |
if (!(beforeInvalid && afterInvalid)) { | |
// if result is invalid, set distance to longest contig length so that other TSS will be kept | |
final int distanceBefore = beforeInvalid ? maxContigLength : nearestBefore.getInterval().gapLen(svInterval); | |
final int distanceAfter = afterInvalid ? maxContigLength : svInterval.gapLen(nearestAfter.getInterval()); | |
final boolean beforeValid = nearestBefore != null && nearestBefore.getInterval().getContig() == variantContigID; | |
final boolean afterValid = nearestAfter != null && nearestAfter.getInterval().getContig() == variantContigID; | |
// only update if at least one TSS is valid | |
if (beforeValid || afterValid) { | |
// set distance to closest valid TSS | |
final int distanceBefore = beforeValid ? nearestBefore.getInterval().gapLen(svInterval) : maxContigLength; | |
final int distanceAfter = afterValid ? svInterval.gapLen(nearestAfter.getInterval()) : maxContigLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also got rid of maxContigLength
and replaced with Integer.MAX_VALUE
src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/SVUtils.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateIntegrationTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @epiercehoffman ! I have a few finishing touches to suggest here:
- Use of
static
methods - this is somewhat subjective. IMO, static methods are good for small "helper" functions, particularly when exposed (public), and functions that do not access class members. I've suggested a few functions that I think are more appropriately non-static. It shouldn't be too much trouble to modify your tests this way. - Using
GTFIntervalsTreeContainer
to hold the trees would also cut down on lines of code, which is always great. I would move the container class into the engine and make itpublic
with proper getter methods. The trees don't seem to be accessed that often, it shouldn't be too burdensome to call the getter methods.
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java
Outdated
Show resolved
Hide resolved
* add TSS_DUP annotation category * fix TSS and promoter inference and overlap determination * handle 2nd breakpoint for BNDs in GATK-SV VCF format * handle BND representation of CPX and CTX events for spec-compliant VCF format * add option to annotate BNDs under a specified length as DEL or DUP if applicable * style edits * add unit tests for TSS, promoter, annotation of all SV types, and add toy GTF file for testing
* handle dDUP events: add INS interval * set INS interval to CHR:POS-POS+1 and ignore END * fixes to BND handling
* add tool docs * rename command line args * add unit tests for ClosedSVInterval * add unit test for SVInterval.toSimpleInterval
3e0b041
to
e714c7e
Compare
Add
SVAnnotate
for functional annotation of SV VCFs from GATK-SV (after preprocessing to correct END keys, etc.). Posting as a draft for pre-review of work-in-progress code.