-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated seqFISH Decoding Method: CheckAll Decoder #1978
Conversation
@nickeener this is really cool to see- thanks for the contribution! I'll take a look at the test/CI situation and see if I can help |
@berl Any updates on this? It appears that it is erroring out trying to download the sphinx-bootstrap-theme dependency on Python 3.9 but is otherwise passing all the initial tests. |
@nickeener yeah, this has been fixed in add this line to the |
@berl Running into this error after I add the
Any advice on this? |
@nickeener did you run
but the error is in line 103 where it checks for a virtual environment. can you create a virtual environment with python=3.9 (using Also, what's your OS and |
@berl Yes I am in the top level starfish directory when I run My OS is Would you be able to share the new correctly built requirements files with me so I could just replace the old versions with them? |
@nickeener I suspect you previously tried to repin requirements, and there are stale files left over from the attempt. Can you remove $STARFISH/ .REQUIREMENTS.txt-env ( |
@ttung That was exactly what was wrong thanks! Was able to run |
I'm cloned your changes into #1984. I have it mostly working. |
@nickeener I cherry picked most of 8752555 into my branch and fired off a build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickeener nice work! I noticed a few things to fix and then I think we should merge.
@ttung thanks for your help with this! My review was pretty basic but after the edits above, I'm inclined to merge to add this functionality to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fixes!
This is an update to my previous PR for a new seqFISH decoding method. We discovered some problems with the method in regards to the false positive rate and have spent the past several months improving it to it's current state and think it is now fit to be published.
So recapping the previous PR, we found that starFISH's seqFISH decoding (PerRoundMaxChannel with NEAREST_NEIGHBOR or EXACT_MATCH TraceBuildingStrategy) was finding far fewer targets in our seqFISH datasets than we were expecting from previous results and the reason for this is the limited search space within which starFISH looks for barcodes. For both TraceBuildingStrategy's, spots in different rounds are only connected into barcodes if they are spatially close to a spot in an arbitrary anchor round which does not utilize the inherent error correction that can be done by comparing the barcodes found using each round as an anchor. Additionally, the EXACT_MATCH TraceBuildingStrategy requires spots in different rounds to be in the exact same pixel location in order to connected into barcodes while NEAREST_NEIGHBOR requires that they be nearest neighbors, which are both extremely sensitive to spot drift between rounds, requiring precise registration that isn't always possible. The PerRoundMaxChannel decoder also lacks the ability to find error-corrected barcodes (in codebook where all codes have at least a hamming distance of 2 each code can still be uniquely identified by a barcode that is missing one round).
To address this, I attempted to write a seqFISH decoding method that emulates the one used by the lab where seqFISH originated at Caltech. In this method, the spots in all rounds are separately used as anchors to search to spots within a search radius of the anchor spots. So for each spot, all the possible barcodes that could constructed using the spots in other rounds within the search radius are considered, which is a much larger search space than when only considering exact matches or nearest neighbors. Each anchor spot is then assigned a "best" barcode out of its total possible set (it "checks all" of the possible barcodes) based on the spatial variance and intensities of the spots that make it up and whether it has a match to codebook, though the order of those steps will vary.
This larger search space requires a few extra tricks to keep false positive calls down. One of these is a filter that removes barcodes that were not found as best for a certain number of the spots/rounds that make it up (usually one less than the total round number) which the Caltech group calls the seed filter. So if we had a 5 round barcode and that barcode was chosen as the best choice for only 3 of the spots that make it up, it would not pass the filter. I've also introduced a parameter I call strictness that sets a maximum number of possible barcodes a given spot is allowed to have before it is dropped as ambiguous. Barcodes in spot dense regions tend to be difficult to call accurately so it is useful to be able to restrict that if necessary. Another trick I found helpful was to run decoding in multiple nested stages that start off with stricter parameters which are loosened as decoding progresses. This includes the search radius (starts off with radius 0 and increments up to the user-specified value), the number of rounds that can be omitted from the barcode to make a match (only looks for full codes first then looks for partials) and the max allowed score (based on spatial variance and intensity of spots). Between each decoding, the spots that were found to be in decoded barcodes that pass all filters are removed from the original spot set before decoding the remaining spots. This allows you to call high confidence barcodes first and remove their spots, making calling adjacent barcodes easier. Running the whole decoding process repeatedly became quite time consuming so I have added in multiprocessing capability to allow users to speed things up using Python's standard multiprocessing library.
In addition to the filters and incremental decoding I also take advantage of two slightly different decoding algorithms with different precision/recall properties. In one method, which I call "filter-first", once all the possible barcodes for each spot are assembled, the "best" barcode for each spot is chosen based on a scoring function that uses the spatial variance of the spots and their intensities. The chosen barcodes are then matched to the codebook and those that have no match are dropped. The other method, called "decode-first", instead matches all the possible barcodes for each spot to the codebook first and then if there are multiple matches the best is chosen based on the distance score. The "filter-first" method tends to be higher accuracy but return fewer decoded targets (high precision/low recall) while the "decode-first" method finds more mRNA targets but at a cost to accuracy (low precision/high recall). In keeping with my incremental decoding strategy I described earlier, the first set of decoding is done using the high accuracy "filter-first" method followed by decodings done using the low accuracy version.
Below is a description of the required inputs and a step by step explanation of the algorithm.
Inputs:
Steps
(2-7 is repeated, first using the filter-first method for each incremental radius and then using the decode-first method for each radius)
One final detail worth explaining here is the mode parameter. This parameter can take three values: "high", "med", or "low" and essentially simplifies a number of parameter choices into these three presets including the seed filter value at the end of each decoding step, the max allowed score allowed for a barcode, and the strictness value for the filter-first and decode-first methods. "high" corresponds to high accuracy (lower recall), "low" corresponds to low accuracy (higher recall), and "med" is a balance between the two.
To evaluate the results of our decodings and compare them with starFISH methods, we used two different kinds of qc measures. The first is correlation with orthogonal RNA counts (either smFISH or RNAseq) and the other is the proportion of false positive calls measured by placing "blank" codes in the codebook that adhere to the hamming distance rules of the codebook but don't actually correspond to any true mRNA. This false positive metric was lacking from my previous PR and when we ran it on the previous method our false positive values were quite high. Depending on the method used to calculate the false positive rate and the specific dataset, high false positive rates can be a problem with seqFISH data but the mode parameter allows you to easily adjust the decoding parameters to get the best results.
Below are some figures showing results from 3 different seqFISH datasets comparing the starFISH PerRoundMaxChannel + NEAREST_NEIGHBORS method with each mode choice for the CheckAll decoder. Each red/blue bar pair along the x-axis represent a single cell, FOV, or image chunk (don't have segmented cells yet for this dataset so it is broken into equal sized chunks), while the height of the blue bar represents the total number of on-target (real) barcodes found in that cell/fov/chunk and the red bar height is the off-target count (blank) with the black bar showing the median on-target count. Also printed on each graph is the overall false positive (FP) rate and the pearson correlation with smFISH or RNAseq where available. For each dataset the same set of spots was used for each run.
Intron seqFISH
RNA SPOTs seqFISH
Mouse Atlas seqFISH
From these, you can see that this updated decoder is capable of finding more targets than starFISH, with lower false positive rates, and higher smFISH/RNAseq correlations. The one area where the starFISH methods have us beat is speed. Being more thorough is costly so while starFISH can decode a set of spots in 2-3 minutes on my system, this new decoder can take up to 1.5 hours even when using 16 different simultaneous processes. That time is from the intron seqFISH dataset which is quite large (5 x 12 x 29 x 2048 x 2048) so the run times are more reasonable for smaller datasets.
Running the "make all" tests for my current repo throws a mypy error but in code that I haven't touched (like the ImageStack class code) so hopefully we can work together to get those kinks worked out so we can add this to starFISH so others might use it.