Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zUMIs detects fewer cells than CellRanger #128

Closed
johnchamberlin opened this issue Aug 1, 2019 · 8 comments
Closed

zUMIs detects fewer cells than CellRanger #128

johnchamberlin opened this issue Aug 1, 2019 · 8 comments

Comments

@johnchamberlin
Copy link

Hello,

I've tested zUMIs on a few datasets from the 10X Chromium website:

https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/nuclei_2k
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/hgmm_1k_v3

For the PBMC1K dataset, CellRanger reports 1222 cells, while zumis reports only 606
For the PBMC10k dataset, 11769 vs 10894
For mouse nuclei dataset, 2352 vs 1789
For mixed species dataset, 1046 vs 940

I am using the default 'nReadsPerCell: 100', which seems extremely conservative. I am providing the barcode whitelist from CellRanger and also using Automatic barcode detection. UMI filtering is left as default: num_based: 1 and phred: 20

Have you encountered this pattern before? Perhaps I should turn of automatic barcode detection? It seems to take the intersection of the whitelist barcodes and automatically detected but the output isn't very detailed for that step.
I've pasted one .yaml file below

Thanks,
John

###########################################
#Welcome to zUMIs
#below, please fill the mandatory inputs
#We expect full paths for all files.
###########################################

#define a project name that will be used to name output files
project: pbmc1k
#Sequencing File Inputs:
#For each input file, make one list object & define path and barcode ranges
#base definition vocabulary: BC(n) UMI(n) cDNA(n).
#Barcode range definition needs to account for all ranges. You can give several
comma-separated ranges for BC & UMI sequences, eg. BC(1-6,20-26)
#you can specify between 1 and 4 input files
sequence_files:
file1:
name: /scratch/general/lustre/u1212967/singlecell_premrna/pbmc_1k/pbmc_1k_v3
_fastqs/R1.fastq.gz
base_definition:
- BC(1-16)
- UMI(17-28)
file2:
name: /scratch/general/lustre/u1212967/singlecell_premrna/pbmc_1k/pbmc_1k_v3
_fastqs/R2.fastq.gz
base_definition:
- cDNA(1-91)

#reference genome setup
reference:
STAR_index: /scratch/general/lustre/u1212967/sc_refs/zumis_refs/GRCh38_STARidx_noGTF_2.7.1a/
GTF_file: /scratch/general/lustre/u1212967/sc_refs/GRCh38/genes/genes.gtf
additional_files: #Optional parameter. It is possible to give additional reference sequences here
, eg ERCC.fa
additional_STAR_params: #Optional parameter. you may add custom mapping parameters to STAR here

#output
out_dir: /scratch/general/lustre/u1212967/zumi_1k_pbmc/

###########################################
#below, you may optionally change default parameters
###########################################

#number of processors to use
num_threads: 32
mem_limit: 96 #Memory limit in Gigabytes, null meaning unlimited RAM usage.

#barcode & UMI filtering options
#number of bases under the base quality cutoff that should be filtered out.
#Phred score base-cutoff for quality control.
filter_cutoffs:
BC_filter:
num_bases: 1
phred: 20
UMI_filter:
num_bases: 1
phred: 20

#Options for Barcode handling
#You can give either number of top barcodes to use or give an annotation of cell barcodes.
#If you leave both barcode_num and barcode_file empty, zUMIs will perform automatic cell barcode se
lection for you!
barcodes:
barcode_num: null
barcode_file: /scratch/general/lustre/u1212967/sc_refs/3M-february-2018.txt
automatic: yes #Give yes/no to this option. If the cell barcodes should be detected automatically
. If the barcode file is given in combination with automatic barcode detection, the list of given b
arcodes will be used as whitelist.
BarcodeBinning: 1 #Hamming distance binning of close cell barcode sequences.
nReadsperCell: 100 #Keep only the cell barcodes with atleast n number of reads.

#Options related to counting of reads towards expression profiles
counting_opts:
introns: yes #can be set to no for exon-only counting.
downsampling: 0 #Number of reads to downsample to. This value can be a fixed number of reads (e.g
. 10000) or a desired range (e.g. 10000-20000) Barcodes with less than will not be reported. 0
means adaptive downsampling. Default: 0.
strand: 1 #Is the library stranded? 0 = unstranded, 1 = positively stranded, 2 = negatively stran
ded
Ham_Dist: 1 #Hamming distance collapsing of UMI sequences.
velocyto: no #Would you like velocyto to do counting of intron-exon spanning reads
primaryHit: no #Do you want to count the primary Hits of multimapping reads towards gene expressi
on levels?
twoPass: yes #perform basic STAR twoPass mapping

#produce stats files and plots?
make_stats: yes

#Start zUMIs from stage. Possible TEXT(Filtering, Mapping, Counting, Summarising). Default: Filteri
ng.
which_Stage: Filtering

#define dependencies program paths
samtools_exec: samtools #samtools executable
Rscript_exec: Rscript #Rscript executable
STAR_exec: STAR #STAR executable
pigz_exec: pigz #pigz executable

#below, fqfilter will add a read_layout flag defining SE or PE
zUMIs_directory: /uufs/chpc.utah.edu/common/home/u1212967/software/bin/zUMIs
read_layout: SE

@cziegenhain
Copy link
Collaborator

Hi John,

Nice that you quantified this a bit!
I think you don't need to pay too much to the nReadsPerCell: 100 parameter - this just controls which sequences get immediately discarded before the actual barcode detection starts.

In general, I agree with your observation. In my experience, CellRanger is very inclusive with extremely small cells that border on being background noise. I am attaching a (ATACseq) CellRanger report to illustrate my point:
newplot

My recommendation would also be to really look at the read-barcode distribution plots zUMIs should output ( https://github.com/sdparekh/zUMIs/blob/master/ExampleData/zUMIs_output/stats/Example.detected_cells.pdf )
I would be happy to discuss making the barcode detection more user-tunable if you think that's a useful feature for our pipeline!

Best,
Christoph

@johnchamberlin
Copy link
Author

Thanks Christoph. I took a second look at your paper and see that the cell filtering is more complex than I realized. One issue I did notice is that the project.detected_cells.pdf isn't generated if I provide a whitelist, even when automatic filtering is also used.

@cziegenhain
Copy link
Collaborator

Thanks for letting me know - that must have escaped some if-condition then.
I will push an update in the next few dates to make sure the detected_cells.pdf also appears when using automatic+whitelist!

@johnchamberlin
Copy link
Author

Thanks Christoph. I just have one more quick question that doesn't deserve a separate thread: how can I optimize the speed of zUMIs with larger datasets? I'm using slurm on a university HPC cluster and so far everything seems to be working on a single, 32 core node, with 196 gb of memory, but I wonder if it could be completed more quickly? For example, is there any reason for me to use multiple nodes? Or is more threads the only answer?

The 10K PBMC dataset ran out of memory on a 128 gb node so my expectation is that if I want to process datasets much larger than this, then 196 gb will also become insufficient. There are some 384 gb nodes available as well but not as many.

Cheers,
John

@cziegenhain
Copy link
Collaborator

Hi John,

Here it gets a tad more complicated:

I have observed that the UMI hamming distance collapse does tend to exceed the set RAM limit.
Unfortunately it's a bit hard for me to see the pattern currently as it really varies with number of reads per cell, memory and cpu threads.
Anyway it usually helps to set the RAM limit more stringent, eg going from 96 gb you set to eg 60. zUMIs should then break work into less cells at a time easing the memory load! That way you shouldn't need bigger instances for larger datasets per se. Not sure which zUMIs version you have here but I changed the chunking up behavior two days ago to be better on RAM...

Runtime wise I have a pretty big improvement in the hamming distance collapse of close UMIs coded up. Depending on the dataset it could speed that step up by 3x. I'm currently testing and hopefully push that to GitHub tomorrow!

Hope that helps,
Christoph

@johnchamberlin
Copy link
Author

Great! Thanks for your continued efforts. I will update to the new version and try to set mem_limit to about 2/3 of the node RAM limit. Will zumis tell me if I set the mem_limit too low?

@cziegenhain
Copy link
Collaborator

No, you will unfortunately not be messaged to change the memory limit.
I just pushed the latest optimizations, the barcode detection plot should also appear now.

@cziegenhain
Copy link
Collaborator

Seems like this is solved, closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants