Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New seqFISH Decoding Method #1960

Closed
wants to merge 17 commits into from
Closed

Conversation

nickeener
Copy link
Contributor

This PR adds a new spot-based decoding method, the CheckAll decoder, based on the method described here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/). It is capable of detecting several times more true targets from seqFISH image data than current spot-based methods in starFISH (PerRoundMaxChannel) as barcodes are not restricted to exactly matching spots or only nearest neighbor spots and because it tries to put together barcodes based on spots from every round instead of a single arbitrary anchor round. It is also capable of utilizing error correction rounds in the codebook which current starFISH methods do not consider.

Summary of algorithm:

Inputs:
spots - starFISH SpotFindingResults object
codebook - starFISH Codebook object
filter_rounds - Number of rounds that a barcode must be identified in to pass filters
error_rounds - Number of error-correction rounds built into the codebook (ie number of rounds that can be dropped from a barcode and still be able to uniquely match to a single target in the codebook)

1. For each spot in each round, find all neighbors in other rounds that are within the search radius
2. For each spot in each round, build all possible full length barcodes based on the channel labels of the spot's 
neighbors and itself
3. Drop barcodes that don't have a matching target in the codebook
4. Choose the "best" barcode of each spot's possible target matching barcodes by calculating the sum of variances 
for each of the spatial coordinates of the spots that make up each barcode and choosing the minimum distance barcode 
(if there is a tie, they are all dropped as ambiguous). Each spot is assigned a "best" barcode in this way.
    sum( var(x1,x2,...,xn), var(y1,y2,...,yn), var(z1,z2,...,zn) ), where n = # spots in barcode
5. Only keep barcodes/targets that were found as "best" in a certain number of the rounds (determined by filter_rounds
parameter)
6. If a specific spot is used in more than one of the remaining barcodes, the barcode with the higher spatial variance
between it's spots is dropped (ensures each spot is only used once)
(End here if number of error_rounds = 0)
7. Remove all spots used in decoded targets that passed the previous filtering steps from the original set of spots
8. Rerun steps 2-5 for barcodes that use less than the full set of rounds for codebook matching (how many rounds can be
dropped determined by error_rounds parameter)

Tests of the CheckAll decoder vs starFISH's PerRoundMaxChannel method (w/ nearest neighbor trace building strategy) show improved performance with the CheckAll decoder. All the following tests used seqfISH image data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/.

image

Note PRMC NN = PerRoundMaxChannel Nearest Neighbor

The x-axis for each of the above images marks the value for the search radius parameter used in either decoding method (the distance that spots can be from a reference spot and be allowed to form a potential barcode). It is marked in increments of increasing symmetric neighborhood size (in 3D). The left figure shows the total number of decoded transcripts that are assigned to a cell for each method (note: for the CheckAll decoder this includes partial barcodes (codes that did not use all rounds in decoding) which the PerRoundMaxChannel method does not consider). Depending on the search radius, there is as much as a 442% increase in the total number of decoded barcodes for the CheckAll decoder vs PerRoundMaxChannel.

To assess the accuracy of either decoding method, I used orthologous smFISH data that was available from the same samples for several dozen of the same genes as were probed in the seqFISH experiment. Using this data, I calculated the Pearson correlation coefficient for the correlation between the smFISH data and the results from decoding the seqFISH data with either method (note: because the targets in this dataset were introns (see paper) the values that were correlated were the calculated burst frequencies for each gene (how often/fast transcription is cycled on/off) instead of counts). The results of this are shown in the center figure above with the right-hand figure showing the same data but zoomed out to a 0-1 range. The starFISH PerRoundMaxChannel method does achieve a higher accuracy using this test but it is not significant and comes at the cost of detecting far fewer barcodes. (Note: missing values on lower end of x-axis are due to not having enough results to calculate the burst frequency of the transcripts).

Unlike current starFISH methods, the CheckAll decoder is capable of taking advantage of error correction rounds built into the codebook. As an example, say a experiment is designed with a codebook that has 5 rounds, but the codes are designed in such a way that only any 4 of those rounds are needed to uniquely match a barcode to a target, the additional round would be considered an error correction round because you may be able to uniquely identify a barcode as a specific target with only 4 rounds, but if you can also use that fifth round you can be extra confident that the spot combination making up a barcode is correct. This method is based on a previous pull request made by a colleague of mine (ctcisar#1).

image

The above figures show similar results to the first figure except the results of the CheckAll decoder have been split between barcodes that were made using spots in all rounds (error correction) and those that only had a partial match (no correction). Even without considering error correction, the CheckAll decoder detects as much as 181% more barcodes than the PerRoundMaxChannel method. The smFISH correlation are as expected with error corrected barcodes achieving a higher correlation score with the smFISH data than those that were not corrected. Whether a barcode in the final DecodedIntensityTable uses an error correction round or not can be extracted from the new "rounds_used" field which shows the number of rounds used to make a barcode for each barcode in the table. This allows easy separation of data into high and lower confidence calls. Additionally, the distance field of the DecodedIntensityTable is no longer based on the intensity of the spots in each barcode but is the value that is calculated for the sum of variances of the list of spatial coordinates for each spot in a barcode. This can also be used as a filter in many cases as barcodes made of more tightly clustered spots may be more likely to be true targets.

image

The major downside to the CheckAll decoder is it's speed. This is no surprise, as it is searching the entire possible barcode space for every spot from all rounds instead of just nearest neighbors to spots in a single round, but the possible barcode space can become quite large as search radius increases which can significantly increase run times. In order to address this, I've added the ability to multi-thread the program and run multiple chunks simultaneously in parallel using the python module ray, though even with this added parallelization, runtimes for CheckAll are much higher than for PerRoundMaxChannel. The above figure shows the runtime in minutes for the CheckAll decoder (using 16 threads) vs PerRoundMaxChannel with nearest neighbors (note: the seqFISH dataset used here is among the larger that are available at 5 rounds, 12 channels, and over 10,000 barcodes in the codebook so for most other seqFISH datasets I expect runtimes will be considerably less than what is shown here, unfortunately I did not have access to another suitable seqFISH dataset to test on). Ongoing work is being done to optimize the method and bring runtimes down. I was unable to figure out how to correctly add ray to the requirements file so that will still need to be done.

@nickeener
Copy link
Contributor Author

Currently passes the flake8 and mypy tests but the fast-test produces an error that I'm not sure how to address:

pytest -v -n 8 --cov starfish -m 'not (slow or napari)'
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: -n --cov starfish
  inifile: None
  rootdir: /home/nick/projects/spacetx/decoder/starfish

There is also an error running the make -C docs html command that appears to be a versioning problem with Sphinx:

Running Sphinx v3.5.3
/home/nick/projects/spacetx/decoder/starfish/docs/source/conf.py:160: RemovedInSphinx40Warning: The app.add_stylesheet() is deprecated. Please use app.add_css_file() instead.
  app.add_stylesheet("my-styles.css")
generating gallery...

Configuration error:
Unknown key(s) in sphinx_gallery_conf:
'download_section_examples', did you mean 'download_all_examples'?
make[1]: *** [Makefile:25: html] Error 2
make[1]: Leaving directory '/home/nick/projects/spacetx/decoder/starfish/docs'
make: *** [Makefile:72: docs-html] Error 2

I could use some help addressing these problems so that we can get these new features added to starFISH. Thanks!

@berl
Copy link
Collaborator

berl commented Oct 13, 2021

@nickeener this is likely to be related to the antique python used in travis CI right now. can you give a pip freeze for your current environment where you're running into these problems?

also, I'm excited to see some new decoding algorithms- thanks for contributing!

@nickeener

This comment has been minimized.

@berl
Copy link
Collaborator

berl commented Oct 13, 2021

@nickeener maybe conda list will be more interpretable for me sorry! I'm looking for your python version where you ran into the problems- we at least know it's not python 3.6 because of the scikit-image version! @njmei has some work on removing python 3.6 support, but not sure it's ready for prime time.

@nickeener
Copy link
Contributor Author

@berl I believe it is v3.8.8 of Python. So I should try it from an environment that uses 3.6?

@berl
Copy link
Collaborator

berl commented Oct 14, 2021

@nickeener making a new environment with 3.6 will get you to the current CI testing configuration so the existing tests should run. If your code breaks in 3.6, that will be another strong motivator for getting starfish up to full 3.8 support (and probably dropping 3.6 support). In the meantime, you could also try to put together some tests for your new decoding functionality.

I'm also curious to learn more about ray thanks for the introduction!

@nickeener
Copy link
Contributor Author

@berl I believe I found the source of the pytest error. I had to install pytest-cov in order for it to recognize the --cov flag and pytest-xdist for it to recognize the -n flag. It now runs the tests when I execute using make all but a number of them fail, even when I run it on a fresh clone of the the main starfish repo. Changing to a python 3.6 environment also does appear to make a difference. I still need to write some tests for my new additions (new to writing official unit tests like this) but how should I handle these errors in scripts I haven't even touched?

@njmei
Copy link
Collaborator

njmei commented Oct 16, 2021

Hi @nickeener due to pretty large changes in upstream dependencies (Xarray, Numpy, Pandas, scikit-image, etc...) in the last ~1 year or so, a lot of tests and code in this repository ended up needing maintenance. I've put together a PR to address all those problems so it would be best if you could wait a bit until those changes can be merged into this main starfish repository.

See: #1963

@neuromusic
Copy link
Collaborator

looks like this PR is still waiting on travis checks... do you know how to get it to build against the new CI workflow @njmei ? does it just need a new commit?

@ttung
Copy link
Collaborator

ttung commented Oct 17, 2021

Just needs a rebase.

@nickeener
Copy link
Contributor Author

@njmei no problem. Just let me know whenever it is ready so I can pull the changes and retry the tests.

@ttung is rebasing something I need to do? Sorry, not super familiar with github stuff.

@ttung
Copy link
Collaborator

ttung commented Oct 18, 2021

I'm not sure how your local repo is configured, so it's hard to tell you exactly what commands to run. I copied your changes into #1964 though.

Typically, one would have two remotes configured, and then you'd need to pull the spacetx/starfish remote, rebase your changes on top of that, and then push to your remote (nickeener/starfish).

@njmei
Copy link
Collaborator

njmei commented Oct 19, 2021

@neuromusic It looks like this is what is preventing tests from running. I don't have the permissions:

image

@neuromusic
Copy link
Collaborator

thanks for the ping @njmei ! approved.

@nickeener
Copy link
Contributor Author

@neuromusic sorry forgot to include updates for one file in the previous commit, causing the linting to fail. Should work now. Could you please rerun?

@nickeener
Copy link
Contributor Author

hmm.. I'm not getting these errors when I run make all. @neuromusic are there additional mypy plugins that are being used here besides the default? I had to install a number for flake8 and pytest that were not included in the default pip install so I assume this is similar.

@njmei
Copy link
Collaborator

njmei commented Oct 19, 2021

hmm.. I'm not getting these errors when I run make all. @neuromusic are there additional mypy plugins that are being used here besides the default? I had to install a number for flake8 and pytest that were not included in the default pip install so I assume this is similar.

@nickeener This is what gets run during the linting step:

In terms of how test dependencies are installed:

@nickeener
Copy link
Contributor Author

nickeener commented Oct 19, 2021

@njmei Thank you so much the make install-dev command was exactly what I needed and I was able to fix those linting errors. make all runs all the way through without errors again. @neuromusic fingers crossed but it should pass linting and tests now. Could you please rerun it? Thanks!

@njmei
Copy link
Collaborator

njmei commented Oct 19, 2021

@nickeener Now that you've been approved to run tests, they should just auto-run anytime to add a new commit.

@nickeener
Copy link
Contributor Author

@njmei awesome thanks! That makes things simpler.

@nickeener
Copy link
Contributor Author

@njmei there appears to be some issue with my approval status as the tests have not auto-run after my latest commit and there doesn't appear to be any option to manually start the workflow. It still says First-time contributors need a maintainer to approve running workflows.

Also, I've added the new python library ray to the REQUIREMENTS.txt file in the main starfish directory. Do I also need to add it to the requirements/REQUIREMENTS-CI.txt.in and requirements/REQUIREMENTS-CI.txt files?

@njmei
Copy link
Collaborator

njmei commented Oct 20, 2021

@nickeener Hmmm.... yeah that's annoying that the test runner status looks like it reverted. Unfortunately, I'm a volunteer software engineer unaffiliated with CZI/spacetx so there's not much I can do on my end. You should be able to get the tests to run from your fork though, have you checked the actions menu in your fork of Starfish?

Regarding the PR itself:
I didn't realize that ray was a necessary dependency for this method. Adding new dependencies gives me some pause since starfish (to the best of my knowledge) is intended to work on Windows, mac, and *nix systems. Reading up a bit more on ray (https://docs.ray.io/en/latest/installation.html#windows-support) makes me wonder if the tests you currently have will play well with the Windows test runners...

There is also a question of maintenance, it looks like ray has it's own set of complicated dependencies that for now seem okay (though maybe the scipy requirement could clash with what starfish now has pinned), but given that there is not dedicated engineering support for starfish what I worry about the most is a repeat of #1963 half a year or a year down the line.

Is there a way to replace ray with multiprocessing?

Keep in mind though, these are just my opinions and the actual owners/contributers (@neuromusic) will need to probably weigh in...

@neuromusic
Copy link
Collaborator

sorry about the trouble here with the workflow approvals @nickeener ... thanks to @njmei, we are now using Actions for CI, but I didn't realize this was something that needed configured. I found the setting and changed the permissions and hopefully 🤞 I don't need to "approve" your runs anymore

as for adding the ray requirement... in addition to REQUIREMENTS.txt, to get it to run on CI, you'll also need to run make requirements/REQUIREMENTS-CI.txt to update that file which is used to pin the requirements for CI

the maintenance question is definitely a concern... we've currently in a (very) slim "maintenance" mode, but I'm exploring ways to bring on more support for maintenance. as for ray specifically... I agree with @njmei that if multiprocessing will do the job, that's preferable since we already use multiprocessing extensively.

however, it's not clear to me that that's even necessary... I'm not familiar with ray, but after a quick glance at the docs, its actually not clear to me how ray is being used here.... I see it initialized and shutdown, but I don't see it used otherwise (no decorators etc)? is it necessary?

@nickeener
Copy link
Contributor Author

@neuromusic ray is being initialized in the CheckAll class's run function and is then used by several of the functions that are imported from the check_all_funcs.py script. There's several points where's it's advantageous to chunk the data and run it in parallel. I should be able to replace it with multiprocessing though it may take a speed hit as ray was designed as kind of a optimized version of multiprocessing so I assume it will be faster but I haven't tested this. Though that isn't as much of a problem as it was a week ago as I've made some significant speed improvements to the algorithm since making this PR and the graph showing the run times is now out of date. Might take me some time to adjust everything as I'm not familiar with multiprocessing.

@neuromusic
Copy link
Collaborator

used by several of the functions that are imported from the check_all_funcs.py script

aha! that's why my CTRL-F failed me :P
image

thank you!!

@nickeener
Copy link
Contributor Author

@neuromusic I've updated this PR to replace ray with multiprocessing. Passes all checks now.

@nickeener
Copy link
Contributor Author

@neuromusic Hey just wanted to give a quick overview of some of the updates I've made to this method since I originally made this PR.

I changed the way it chooses between spot combinations that use the same spot. The previous method simply chose the code that had the minimum spatial variance for its spots, updated method treats it like a maximum independent set problem where it wants to find the set of spot combinations that use each spot only once and also use as many of the spots as possible. This is an NP-complete problem but I was able to leverage the spatial variance of the spots in each possible combination to make it fast. This resulted in a ~30% increase in the total number of mRNA targets that can be identified (in my test data set) while also slightly increasing accuracy (by correlation with smFISH results).

The following figure is similar to the one in my original post but I've added a new line for the updated results. The left figure shows the number of transcripts identified by the old and updated method compared to starFISHs nearest neighbor decoder while the center and right figure show the accuracy by correlation with values obtained using smFISH (center figure just zoomed in version of right figure).

image

I've also made significant improvements to the run time and memory requirements of the method. This figure shows the run time improvements.

image

The previous version took over 6 hours to run using a search radius of 2.45 while the current version now takes just 51 minutes to do the same while also using half as much memory (don't have figures for memory unfortunately). I also discovered that the server I've previously been running many of my tests on is somewhat dated and slow as a result so tested running the decoder on my local system and saw another significant drop in run times with that same run only taking 11 minutes. My local system isn't particularly powerful either so I expect it'd be even less on a more modern server system.

Multiprocessing now uses the python standard library module instead of ray. If you'd like me to make any other changes to get this approved please let me know.

@nickeener
Copy link
Contributor Author

It appears to be failing one of the tests now (Docker Smoketest). Are there recent updates that I need to pull to my fork?

@nickeener
Copy link
Contributor Author

Replaced by #1978

@nickeener nickeener closed this Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants