Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a new spot-based decoding method, the CheckAll decoder, based on the method described here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/). It is capable of detecting several times more true targets from seqFISH image data than current spot-based methods in starFISH (PerRoundMaxChannel) as barcodes are not restricted to exactly matching spots or only nearest neighbor spots and because it tries to put together barcodes based on spots from every round instead of a single arbitrary anchor round. It is also capable of utilizing error correction rounds in the codebook which current starFISH methods do not consider.
Summary of algorithm:
Inputs:
spots - starFISH SpotFindingResults object
codebook - starFISH Codebook object
filter_rounds - Number of rounds that a barcode must be identified in to pass filters
error_rounds - Number of error-correction rounds built into the codebook (ie number of rounds that can be dropped from a barcode and still be able to uniquely match to a single target in the codebook)
Tests of the CheckAll decoder vs starFISH's PerRoundMaxChannel method (w/ nearest neighbor trace building strategy) show improved performance with the CheckAll decoder. All the following tests used seqfISH image data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/.
Note PRMC NN = PerRoundMaxChannel Nearest Neighbor
The x-axis for each of the above images marks the value for the search radius parameter used in either decoding method (the distance that spots can be from a reference spot and be allowed to form a potential barcode). It is marked in increments of increasing symmetric neighborhood size (in 3D). The left figure shows the total number of decoded transcripts that are assigned to a cell for each method (note: for the CheckAll decoder this includes partial barcodes (codes that did not use all rounds in decoding) which the PerRoundMaxChannel method does not consider). Depending on the search radius, there is as much as a 442% increase in the total number of decoded barcodes for the CheckAll decoder vs PerRoundMaxChannel.
To assess the accuracy of either decoding method, I used orthologous smFISH data that was available from the same samples for several dozen of the same genes as were probed in the seqFISH experiment. Using this data, I calculated the Pearson correlation coefficient for the correlation between the smFISH data and the results from decoding the seqFISH data with either method (note: because the targets in this dataset were introns (see paper) the values that were correlated were the calculated burst frequencies for each gene (how often/fast transcription is cycled on/off) instead of counts). The results of this are shown in the center figure above with the right-hand figure showing the same data but zoomed out to a 0-1 range. The starFISH PerRoundMaxChannel method does achieve a higher accuracy using this test but it is not significant and comes at the cost of detecting far fewer barcodes. (Note: missing values on lower end of x-axis are due to not having enough results to calculate the burst frequency of the transcripts).
Unlike current starFISH methods, the CheckAll decoder is capable of taking advantage of error correction rounds built into the codebook. As an example, say a experiment is designed with a codebook that has 5 rounds, but the codes are designed in such a way that only any 4 of those rounds are needed to uniquely match a barcode to a target, the additional round would be considered an error correction round because you may be able to uniquely identify a barcode as a specific target with only 4 rounds, but if you can also use that fifth round you can be extra confident that the spot combination making up a barcode is correct. This method is based on a previous pull request made by a colleague of mine (ctcisar#1).
The above figures show similar results to the first figure except the results of the CheckAll decoder have been split between barcodes that were made using spots in all rounds (error correction) and those that only had a partial match (no correction). Even without considering error correction, the CheckAll decoder detects as much as 181% more barcodes than the PerRoundMaxChannel method. The smFISH correlation are as expected with error corrected barcodes achieving a higher correlation score with the smFISH data than those that were not corrected. Whether a barcode in the final DecodedIntensityTable uses an error correction round or not can be extracted from the new "rounds_used" field which shows the number of rounds used to make a barcode for each barcode in the table. This allows easy separation of data into high and lower confidence calls. Additionally, the distance field of the DecodedIntensityTable is no longer based on the intensity of the spots in each barcode but is the value that is calculated for the sum of variances of the list of spatial coordinates for each spot in a barcode. This can also be used as a filter in many cases as barcodes made of more tightly clustered spots may be more likely to be true targets.
The major downside to the CheckAll decoder is it's speed. This is no surprise, as it is searching the entire possible barcode space for every spot from all rounds instead of just nearest neighbors to spots in a single round, but the possible barcode space can become quite large as search radius increases which can significantly increase run times. In order to address this, I've added the ability to multi-thread the program and run multiple chunks simultaneously in parallel using the python module ray, though even with this added parallelization, runtimes for CheckAll are much higher than for PerRoundMaxChannel. The above figure shows the runtime in minutes for the CheckAll decoder (using 16 threads) vs PerRoundMaxChannel with nearest neighbors (note: the seqFISH dataset used here is among the larger that are available at 5 rounds, 12 channels, and over 10,000 barcodes in the codebook so for most other seqFISH datasets I expect runtimes will be considerably less than what is shown here, unfortunately I did not have access to another suitable seqFISH dataset to test on). Ongoing work is being done to optimize the method and bring runtimes down. I was unable to figure out how to correctly add ray to the requirements file so that will still need to be done.