-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New seqFISH Decoding Method #1960
Conversation
Currently passes the flake8 and mypy tests but the fast-test produces an error that I'm not sure how to address:
There is also an error running the
I could use some help addressing these problems so that we can get these new features added to starFISH. Thanks! |
@nickeener this is likely to be related to the antique python used in travis CI right now. can you give a pip freeze for your current environment where you're running into these problems? also, I'm excited to see some new decoding algorithms- thanks for contributing! |
This comment has been minimized.
This comment has been minimized.
@nickeener maybe |
@berl I believe it is v3.8.8 of Python. So I should try it from an environment that uses 3.6? |
@nickeener making a new environment with 3.6 will get you to the current CI testing configuration so the existing tests should run. If your code breaks in 3.6, that will be another strong motivator for getting starfish up to full 3.8 support (and probably dropping 3.6 support). In the meantime, you could also try to put together some tests for your new decoding functionality. I'm also curious to learn more about |
@berl I believe I found the source of the |
Hi @nickeener due to pretty large changes in upstream dependencies (Xarray, Numpy, Pandas, scikit-image, etc...) in the last ~1 year or so, a lot of tests and code in this repository ended up needing maintenance. I've put together a PR to address all those problems so it would be best if you could wait a bit until those changes can be merged into this main starfish repository. See: #1963 |
looks like this PR is still waiting on travis checks... do you know how to get it to build against the new CI workflow @njmei ? does it just need a new commit? |
Just needs a rebase. |
I'm not sure how your local repo is configured, so it's hard to tell you exactly what commands to run. I copied your changes into #1964 though. Typically, one would have two remotes configured, and then you'd need to pull the spacetx/starfish remote, rebase your changes on top of that, and then push to your remote (nickeener/starfish). |
@neuromusic It looks like this is what is preventing tests from running. I don't have the permissions: |
thanks for the ping @njmei ! approved. |
@neuromusic sorry forgot to include updates for one file in the previous commit, causing the linting to fail. Should work now. Could you please rerun? |
hmm.. I'm not getting these errors when I run |
@nickeener This is what gets run during the linting step:
In terms of how test dependencies are installed:
|
@njmei Thank you so much the |
@nickeener Now that you've been approved to run tests, they should just auto-run anytime to add a new commit. |
@njmei awesome thanks! That makes things simpler. |
@njmei there appears to be some issue with my approval status as the tests have not auto-run after my latest commit and there doesn't appear to be any option to manually start the workflow. It still says Also, I've added the new python library |
@nickeener Hmmm.... yeah that's annoying that the test runner status looks like it reverted. Unfortunately, I'm a volunteer software engineer unaffiliated with CZI/spacetx so there's not much I can do on my end. You should be able to get the tests to run from your fork though, have you checked the Regarding the PR itself: There is also a question of maintenance, it looks like Is there a way to replace Keep in mind though, these are just my opinions and the actual owners/contributers (@neuromusic) will need to probably weigh in... |
sorry about the trouble here with the workflow approvals @nickeener ... thanks to @njmei, we are now using Actions for CI, but I didn't realize this was something that needed configured. I found the setting and changed the permissions and hopefully 🤞 I don't need to "approve" your runs anymore as for adding the the maintenance question is definitely a concern... we've currently in a (very) slim "maintenance" mode, but I'm exploring ways to bring on more support for maintenance. as for however, it's not clear to me that that's even necessary... I'm not familiar with |
@neuromusic ray is being initialized in the |
@neuromusic I've updated this PR to replace |
@neuromusic Hey just wanted to give a quick overview of some of the updates I've made to this method since I originally made this PR. I changed the way it chooses between spot combinations that use the same spot. The previous method simply chose the code that had the minimum spatial variance for its spots, updated method treats it like a maximum independent set problem where it wants to find the set of spot combinations that use each spot only once and also use as many of the spots as possible. This is an NP-complete problem but I was able to leverage the spatial variance of the spots in each possible combination to make it fast. This resulted in a ~30% increase in the total number of mRNA targets that can be identified (in my test data set) while also slightly increasing accuracy (by correlation with smFISH results). The following figure is similar to the one in my original post but I've added a new line for the updated results. The left figure shows the number of transcripts identified by the old and updated method compared to starFISHs nearest neighbor decoder while the center and right figure show the accuracy by correlation with values obtained using smFISH (center figure just zoomed in version of right figure). I've also made significant improvements to the run time and memory requirements of the method. This figure shows the run time improvements. The previous version took over 6 hours to run using a search radius of 2.45 while the current version now takes just 51 minutes to do the same while also using half as much memory (don't have figures for memory unfortunately). I also discovered that the server I've previously been running many of my tests on is somewhat dated and slow as a result so tested running the decoder on my local system and saw another significant drop in run times with that same run only taking 11 minutes. My local system isn't particularly powerful either so I expect it'd be even less on a more modern server system. Multiprocessing now uses the python standard library module instead of ray. If you'd like me to make any other changes to get this approved please let me know. |
It appears to be failing one of the tests now (Docker Smoketest). Are there recent updates that I need to pull to my fork? |
Replaced by #1978 |
This PR adds a new spot-based decoding method, the CheckAll decoder, based on the method described here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/). It is capable of detecting several times more true targets from seqFISH image data than current spot-based methods in starFISH (PerRoundMaxChannel) as barcodes are not restricted to exactly matching spots or only nearest neighbor spots and because it tries to put together barcodes based on spots from every round instead of a single arbitrary anchor round. It is also capable of utilizing error correction rounds in the codebook which current starFISH methods do not consider.
Summary of algorithm:
Inputs:
spots - starFISH SpotFindingResults object
codebook - starFISH Codebook object
filter_rounds - Number of rounds that a barcode must be identified in to pass filters
error_rounds - Number of error-correction rounds built into the codebook (ie number of rounds that can be dropped from a barcode and still be able to uniquely match to a single target in the codebook)
Tests of the CheckAll decoder vs starFISH's PerRoundMaxChannel method (w/ nearest neighbor trace building strategy) show improved performance with the CheckAll decoder. All the following tests used seqfISH image data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/.
Note PRMC NN = PerRoundMaxChannel Nearest Neighbor
The x-axis for each of the above images marks the value for the search radius parameter used in either decoding method (the distance that spots can be from a reference spot and be allowed to form a potential barcode). It is marked in increments of increasing symmetric neighborhood size (in 3D). The left figure shows the total number of decoded transcripts that are assigned to a cell for each method (note: for the CheckAll decoder this includes partial barcodes (codes that did not use all rounds in decoding) which the PerRoundMaxChannel method does not consider). Depending on the search radius, there is as much as a 442% increase in the total number of decoded barcodes for the CheckAll decoder vs PerRoundMaxChannel.
To assess the accuracy of either decoding method, I used orthologous smFISH data that was available from the same samples for several dozen of the same genes as were probed in the seqFISH experiment. Using this data, I calculated the Pearson correlation coefficient for the correlation between the smFISH data and the results from decoding the seqFISH data with either method (note: because the targets in this dataset were introns (see paper) the values that were correlated were the calculated burst frequencies for each gene (how often/fast transcription is cycled on/off) instead of counts). The results of this are shown in the center figure above with the right-hand figure showing the same data but zoomed out to a 0-1 range. The starFISH PerRoundMaxChannel method does achieve a higher accuracy using this test but it is not significant and comes at the cost of detecting far fewer barcodes. (Note: missing values on lower end of x-axis are due to not having enough results to calculate the burst frequency of the transcripts).
Unlike current starFISH methods, the CheckAll decoder is capable of taking advantage of error correction rounds built into the codebook. As an example, say a experiment is designed with a codebook that has 5 rounds, but the codes are designed in such a way that only any 4 of those rounds are needed to uniquely match a barcode to a target, the additional round would be considered an error correction round because you may be able to uniquely identify a barcode as a specific target with only 4 rounds, but if you can also use that fifth round you can be extra confident that the spot combination making up a barcode is correct. This method is based on a previous pull request made by a colleague of mine (ctcisar#1).
The above figures show similar results to the first figure except the results of the CheckAll decoder have been split between barcodes that were made using spots in all rounds (error correction) and those that only had a partial match (no correction). Even without considering error correction, the CheckAll decoder detects as much as 181% more barcodes than the PerRoundMaxChannel method. The smFISH correlation are as expected with error corrected barcodes achieving a higher correlation score with the smFISH data than those that were not corrected. Whether a barcode in the final DecodedIntensityTable uses an error correction round or not can be extracted from the new "rounds_used" field which shows the number of rounds used to make a barcode for each barcode in the table. This allows easy separation of data into high and lower confidence calls. Additionally, the distance field of the DecodedIntensityTable is no longer based on the intensity of the spots in each barcode but is the value that is calculated for the sum of variances of the list of spatial coordinates for each spot in a barcode. This can also be used as a filter in many cases as barcodes made of more tightly clustered spots may be more likely to be true targets.
The major downside to the CheckAll decoder is it's speed. This is no surprise, as it is searching the entire possible barcode space for every spot from all rounds instead of just nearest neighbors to spots in a single round, but the possible barcode space can become quite large as search radius increases which can significantly increase run times. In order to address this, I've added the ability to multi-thread the program and run multiple chunks simultaneously in parallel using the python module ray, though even with this added parallelization, runtimes for CheckAll are much higher than for PerRoundMaxChannel. The above figure shows the runtime in minutes for the CheckAll decoder (using 16 threads) vs PerRoundMaxChannel with nearest neighbors (note: the seqFISH dataset used here is among the larger that are available at 5 rounds, 12 channels, and over 10,000 barcodes in the codebook so for most other seqFISH datasets I expect runtimes will be considerably less than what is shown here, unfortunately I did not have access to another suitable seqFISH dataset to test on). Ongoing work is being done to optimize the method and bring runtimes down. I was unable to figure out how to correctly add ray to the requirements file so that will still need to be done.