-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zUMIs detects fewer cells than CellRanger #128
Comments
Hi John, Nice that you quantified this a bit! In general, I agree with your observation. In my experience, CellRanger is very inclusive with extremely small cells that border on being background noise. I am attaching a (ATACseq) CellRanger report to illustrate my point: My recommendation would also be to really look at the read-barcode distribution plots zUMIs should output ( https://github.com/sdparekh/zUMIs/blob/master/ExampleData/zUMIs_output/stats/Example.detected_cells.pdf ) Best, |
Thanks Christoph. I took a second look at your paper and see that the cell filtering is more complex than I realized. One issue I did notice is that the project.detected_cells.pdf isn't generated if I provide a whitelist, even when automatic filtering is also used. |
Thanks for letting me know - that must have escaped some if-condition then. |
Thanks Christoph. I just have one more quick question that doesn't deserve a separate thread: how can I optimize the speed of zUMIs with larger datasets? I'm using slurm on a university HPC cluster and so far everything seems to be working on a single, 32 core node, with 196 gb of memory, but I wonder if it could be completed more quickly? For example, is there any reason for me to use multiple nodes? Or is more threads the only answer? The 10K PBMC dataset ran out of memory on a 128 gb node so my expectation is that if I want to process datasets much larger than this, then 196 gb will also become insufficient. There are some 384 gb nodes available as well but not as many. Cheers, |
Hi John, Here it gets a tad more complicated: I have observed that the UMI hamming distance collapse does tend to exceed the set RAM limit. Runtime wise I have a pretty big improvement in the hamming distance collapse of close UMIs coded up. Depending on the dataset it could speed that step up by 3x. I'm currently testing and hopefully push that to GitHub tomorrow! Hope that helps, |
Great! Thanks for your continued efforts. I will update to the new version and try to set mem_limit to about 2/3 of the node RAM limit. Will zumis tell me if I set the mem_limit too low? |
No, you will unfortunately not be messaged to change the memory limit. |
Seems like this is solved, closing the issue. |
Hello,
I've tested zUMIs on a few datasets from the 10X Chromium website:
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/nuclei_2k
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/hgmm_1k_v3
For the PBMC1K dataset, CellRanger reports 1222 cells, while zumis reports only 606
For the PBMC10k dataset, 11769 vs 10894
For mouse nuclei dataset, 2352 vs 1789
For mixed species dataset, 1046 vs 940
I am using the default 'nReadsPerCell: 100', which seems extremely conservative. I am providing the barcode whitelist from CellRanger and also using Automatic barcode detection. UMI filtering is left as default: num_based: 1 and phred: 20
Have you encountered this pattern before? Perhaps I should turn of automatic barcode detection? It seems to take the intersection of the whitelist barcodes and automatically detected but the output isn't very detailed for that step.
I've pasted one .yaml file below
Thanks,
John
###########################################
#Welcome to zUMIs
#below, please fill the mandatory inputs
#We expect full paths for all files.
###########################################
#define a project name that will be used to name output files
project: pbmc1k
#Sequencing File Inputs:
#For each input file, make one list object & define path and barcode ranges
#base definition vocabulary: BC(n) UMI(n) cDNA(n).
#Barcode range definition needs to account for all ranges. You can give several
comma-separated ranges for BC & UMI sequences, eg. BC(1-6,20-26)
#you can specify between 1 and 4 input files
sequence_files:
file1:
name: /scratch/general/lustre/u1212967/singlecell_premrna/pbmc_1k/pbmc_1k_v3
_fastqs/R1.fastq.gz
base_definition:
- BC(1-16)
- UMI(17-28)
file2:
name: /scratch/general/lustre/u1212967/singlecell_premrna/pbmc_1k/pbmc_1k_v3
_fastqs/R2.fastq.gz
base_definition:
- cDNA(1-91)
#reference genome setup
reference:
STAR_index: /scratch/general/lustre/u1212967/sc_refs/zumis_refs/GRCh38_STARidx_noGTF_2.7.1a/
GTF_file: /scratch/general/lustre/u1212967/sc_refs/GRCh38/genes/genes.gtf
additional_files: #Optional parameter. It is possible to give additional reference sequences here
, eg ERCC.fa
additional_STAR_params: #Optional parameter. you may add custom mapping parameters to STAR here
#output
out_dir: /scratch/general/lustre/u1212967/zumi_1k_pbmc/
###########################################
#below, you may optionally change default parameters
###########################################
#number of processors to use
num_threads: 32
mem_limit: 96 #Memory limit in Gigabytes, null meaning unlimited RAM usage.
#barcode & UMI filtering options
#number of bases under the base quality cutoff that should be filtered out.
#Phred score base-cutoff for quality control.
filter_cutoffs:
BC_filter:
num_bases: 1
phred: 20
UMI_filter:
num_bases: 1
phred: 20
#Options for Barcode handling
#You can give either number of top barcodes to use or give an annotation of cell barcodes.
#If you leave both barcode_num and barcode_file empty, zUMIs will perform automatic cell barcode se
lection for you!
barcodes:
barcode_num: null
barcode_file: /scratch/general/lustre/u1212967/sc_refs/3M-february-2018.txt
automatic: yes #Give yes/no to this option. If the cell barcodes should be detected automatically
. If the barcode file is given in combination with automatic barcode detection, the list of given b
arcodes will be used as whitelist.
BarcodeBinning: 1 #Hamming distance binning of close cell barcode sequences.
nReadsperCell: 100 #Keep only the cell barcodes with atleast n number of reads.
#Options related to counting of reads towards expression profiles
counting_opts:
introns: yes #can be set to no for exon-only counting.
downsampling: 0 #Number of reads to downsample to. This value can be a fixed number of reads (e.g
. 10000) or a desired range (e.g. 10000-20000) Barcodes with less than will not be reported. 0
means adaptive downsampling. Default: 0.
strand: 1 #Is the library stranded? 0 = unstranded, 1 = positively stranded, 2 = negatively stran
ded
Ham_Dist: 1 #Hamming distance collapsing of UMI sequences.
velocyto: no #Would you like velocyto to do counting of intron-exon spanning reads
primaryHit: no #Do you want to count the primary Hits of multimapping reads towards gene expressi
on levels?
twoPass: yes #perform basic STAR twoPass mapping
#produce stats files and plots?
make_stats: yes
#Start zUMIs from stage. Possible TEXT(Filtering, Mapping, Counting, Summarising). Default: Filteri
ng.
which_Stage: Filtering
#define dependencies program paths
samtools_exec: samtools #samtools executable
Rscript_exec: Rscript #Rscript executable
STAR_exec: STAR #STAR executable
pigz_exec: pigz #pigz executable
#below, fqfilter will add a read_layout flag defining SE or PE
zUMIs_directory: /uufs/chpc.utah.edu/common/home/u1212967/software/bin/zUMIs
read_layout: SE
The text was updated successfully, but these errors were encountered: