Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mismatches cache to improve performance on ultra-long reads #4651

Merged
merged 2 commits into from
Nov 13, 2024

Conversation

cmdcolin
Copy link
Collaborator

@cmdcolin cmdcolin commented Nov 13, 2024

here is a summary of the problem in google slide form

When each block renders, the renderer e.g. PileupRenderer requests features using the data adapters getFeatures method

Currently, the BAM library (@gmod/bam) has a feature cache to avoid re-requesting from the actual remote file each time, but JBrowse 2 re-creates new instances of e.g. the BamSlightlyLazyFeature from scratch each time.

This leads to duplicated computation of e.g. the "mismatches" on BAM and CRAM are not cached across blocks. This particularly affects long reads

This PR adds a method to cache the mismatches object by making a simple LRU cache that avoids re-creating an already created feature. it is not an "async LRU" so it is easy to reason about

image

@cmdcolin
Copy link
Collaborator Author

alternative approaches considered: this PR caches the record on the JBrowse wrapper around the @gmod/bam BamFeature class. we could have considered caching on the @gmod/bam BamFeature class directly, but there is some amount of setup the the JBrowse wrapper does such as integrating it with the underlying reference sequence that the @gmod/bam "has no knowledge of" but which is quite important for calculating mismatches in many cases (e.g. when there is no MD tag)

@cmdcolin
Copy link
Collaborator Author

cmdcolin commented Nov 13, 2024

note that this PR is somewhat similar to the "cacheMismatches" feature added somewhat late in the game to jbrowse 1 cacheMismatches

@cmdcolin cmdcolin changed the title Add LRU feature cache on BamAdapter and CramAdapter to allow re-using mismatches object across block requests Add cache to BamAdapter and CramAdapter to re-use mismatches across block requests Nov 13, 2024
@cmdcolin
Copy link
Collaborator Author

this PR basically has no effect on short and even medium length long reads but does have an effect on ultra long reads

60x nanopore ultralong reads

ultralong reads are e.g. 300kb on average. when it has to repeatedly parse the large CIGAR string for these reads to find mismatches it is a bit slow

live demo

I don't have a benchmark for zooming in and out but it can be like 4-8 seconds to zoom in and out the ultralong reads on main, but it's about ~1 second with this branch

notably, on other datasets including even medium length pacbio reads and short reads, it is basically no effect. this suggests the cache could be kept exclusively for use with only ultralong reads but it might not be necessary

@cmdcolin cmdcolin changed the title Add cache to BamAdapter and CramAdapter to re-use mismatches across block requests Add mismatches cache to BamAdapter and CramAdapter to improve performance on ultra-long reads Nov 13, 2024
@cmdcolin cmdcolin changed the title Add mismatches cache to BamAdapter and CramAdapter to improve performance on ultra-long reads Add mismatches cache to improve performance on ultra-long reads Nov 13, 2024
@cmdcolin cmdcolin merged commit 4400e43 into main Nov 13, 2024
4 checks passed
@cmdcolin cmdcolin deleted the feature_cache branch November 13, 2024 06:54
@cmdcolin cmdcolin added the enhancement New feature or request label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant