Current development is awaiting a release use minnow-velocity
for using the recent version of minnow
Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or UMI deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-seq (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as PCR amplification, CB (cellular barcodes) and UMI (Unique Molecule Identifiers) selection, and sequence fragmentation and sequencing.
Minnow is a read level simulator for droplet based single cell RNA-seq data. Minnow simulates the reads by sampling sequences from the underlying de-Bruijn graph (using --dbg
) of the reference transcriptome or alternatively just samples sequences from the reference transcriptome. As the --dbg
option also enables other features of the software, it is useful to describe those.
Minnow-velocity is currently available on Minnow's github repo under the minnow-velocity branch, for downloading, please
- Open your Terminal and go to the folder you want to download minnow in.
- Run the following codes:
git clone --single-branch --branch minnow-velocity https://github.com/COMBINE-lab/minnow.git
cd minnow
mkdir build
cd build
cmake ..
make
NOTICE: In the following steps, we assume that your working directory is in the minnow/build
folder.
As Minnow need the BFH file dumpped from Alevin to make its probability files, we need to run alevin before simulation with --dumpBfh
option specified.
Download the reference files from GENCODE, and build Salmon index.
mkdir data
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.pc_transcripts.fa.gz
gunzip data/gencode.vM25.pc_transcripts.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
gunzip data/gencode.vM25.annotation.gtf.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz
gunzip data/GRCm38.primary_assembly.genome.fa.gz
grep ">" GRCm38.primary_assembly.genome.fa | cut -d ">" -f 2 | cut -d " " -f 1 > GRCm38.primary_assembly.genome.chrnames.txt
cd ..
salmon index \
-t gencode.vM25.pc_transcripts.fa \
-i gencode.vM25.annotation.sidx --gencode -p 128 \
-d GRCm38.primary_assembly.genome.chrnames.txt
- Get data set from Hermann et al., 2018. Here we use bamtofastq from 10X Genomics, please load before you run the code.
cd data
wget https://sra-pub-src-1.s3.amazonaws.com/SRR6459157/AdultMouse_Rep3_possorted_genome_bam.bam.1
mv AdultMouse_Rep3_possorted_genome_bam.bam.1 AdultMouse_Rep3_possorted_genome_bam.bam
bamtofastq --reads-per-fastq=500000000 AdultMouse_Rep3_possorted_genome_bam.bam FASTQtmp
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_I1_001.fastq.gz AdultMouseRep3_S1_L001_I1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R1_001.fastq.gz AdultMouseRep3_S1_L001_R1_001.fastq.gz
mv FASTQtmp/Ad-Ms-Total-Sorted_20k_count_MissingLibrary_1_HK2GNBBXX/bamtofastq_S1_L006_R2_001.fastq.gz AdultMouseRep3_S1_L001_R2_001.fastq.gz
cd ..
- Run Alevin to quantify the gene abundances based on the index generated above.
awk -F "\t" '$3 == "transcript" { print $9 }' data/gencode.vM25.annotation.gtf | tr -d ";\"" | awk '{print $4"\t"$2}' > data/gencode.vM25.annotation.tx2gene.tsv
salmon alevin -l ISR -i gencode.vM25.annotation.sidx \
-1 data/AdultMouseRep3_S1_L001_R1_001.fastq.gz \
-2 data/AdultMouseRep3_S1_L001_R2_001.fastq.gz \
-o alevin_out -p 36 \
--tgMap data/gencode.vM25.annotation.tx2gene.tsv \
--chromium \
--dumpFeatures --expectCells 1850 \
--dumpBfh
By specifying --dumpBfh
, alevin will dump the BFH file in its result, which will be used in the following steps.
The indexing phase of minnow is mostly concerned with creating de-Bruijn graph for the transcriptome, additionally to run minnow we need a transcript to gene mapping which can be obtained from gtf file. At the end, minnow index command will create a direcory with updated fasta file that has the following features:
- Any sequence in the fasta file with smaller length than the read length will be removed.
- PolyA tails will be clipped.
N
s will be replaced by randomA/T/G/C
.
Create the de-Buijn graph on your transcriptome file using the following command, note that do not pass your genome file into minnow index
.
src/minnow index -r data/gencode.vM25.pc_transcripts.fa -k 101 -f 20 --tmpdir tmp -p 10 -o minnow_index
There will be a ref_k101_fixed.fa
in the minnow index
folder, this fasta file is the updated fasta file we talked above.
Followings are the options,
SYNOPSIS
minnow index -r <ref_file>... -o <output_dir> [-f <filt_size>] [--tmpdir <twopaco_tmp_dir>] [-k <kmer_length>] [-p <threads>]
OPTIONS
-r, --ref <ref_file>
path to the reference transcriptome fasta file
-o, --output <output_dir>
directory where index is written
-f, --filt-size <filt_size>
filter size to pass to TwoPaCo when building the reference dBG
--tmpdir <twopaco_tmp_dir>
temporary work directory to pass to TwoPaCo when building the reference dBG
-k, --klen <kmer_length>
length of the k-mer with which the dBG was built (default = 101)
-p, --threads <threads>
total number of threads to use for building MPHF (default = 16)
While creating the index, -k
option should be treated as read length
for rest of the simulation.
If you just want to see the transcript name in your fasta file, please run the following option to clean the header in this file.
sed -e '/^>/ s/|.*//' data/gencode.vM25.pc_transcripts.fa > \
gencode.vM25.pc_transcripts_cleanHeader.fa
You can find the bfh.txt
file in the alevin_out/alevin
folder, this file is what we need for minnow estimate
to get geneProb and cellProb files, which specify the multimapping probability.
To run minnow estimate, please run
src/minnow estimate -o minnow_estimate -r data/gencode.vM25.pc_transcripts.fa --g2t data/gencode.vM25.annotation.tx2gene.tsv --bfh alevin_out/alevin/bfh.txt
The countProb.txt
file will be used in the following steps.
Here we show how to use splatter package in R to simulate the read count matrix, you can download this package from Bioconductor.
- create 100 genes with 150 cells data Splatter is a package for the simulation of single-cell RNA sequencing count data, following codes are ran in R.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("splatter")
library(splatter)
out_dir <- "data/test_splatter_data"
num_genes <- 100
num_cells <- 150
sim <- splatSimulate(
nGenes=num_genes,
batchCells=num_cells,
verbose = FALSE
)
dir.create(out_dir, showWarnings=FALSE)
write.table(colnames(sim), file= file.path(out_dir, "quants_mat_cols.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(counts(sim), file= file.path(out_dir, "quants_mat.csv"), quote=FALSE, col.names=FALSE, row.names=FALSE, sep=",")
- Specify gene names for simulated data.
As we want to simulate data with real gene names, while splatter only returns meaningless gene names (for instance, gene1, gene2, etc), we need to randomly select gene names from our reference files such that the selected genes have at least one isoform with length longer than read length. Fortunately, if you have run
minnow estimate
, you will have a fixed reference fileref_k101_fixed.fa
which contains only transcripts that meet this criteria.
cut -f 2 data/gencode.vM25.annotation.tx2gene.tsv | sort | uniq | shuf -n 100 - > data/test_splatter_data/quants_mat_rows.txt
If you are using Mac OS and shuf
is not found, please download it from Homebrew by brew install coreutils
or use a Linux machine.
Now we can start our simulation process. There are two mode in minnow, here we give the example of ``--splatter-mode`.
We can run minnow simulate in --splatter-mode
by running the following codes:
src/minnow simulate --splatter-mode \
--g2t data/gencode.vM25.annotation.tx2gene.tsv \
--inputdir data/test_splatter_data \
--PCR 4 \
-r minnow_index/ref_k101_fixed.fa \
-e 0.01 \
-p 16 \
-o minnow_simulate \
--dbg \
--gfa minnow_index/dbg.gfa \
-w data/737K-august-2016.txt \
--countProb minnow_estimate/countProb.txt \
--custom \
--gencode
Minnow generate two fastq
files as the simulated read files, which can be analyzed by Alevin directly!
we will use alevin to re-estimate the read count matrix, all required files have already been downloaded in the data
folder.
salmon alevin -l ISR -i gencode.vM25.annotation.sidx \
-1 minnow_simulate/hg_100_S1_L001_R1_001.fastq.gz \
-2 minnow_simulate/hg_100_S1_L001_R2_001.fastq.gz \
-o minnow_simulate/alevin_out -p 16 \
--tgMap data/gencode.vM25.annotation.tx2gene.tsv \
--chromium \
--dumpFeatures --expectCells 100 \
Now we are done with the process!
The truth read count matrix is in the minnow_simulate/alevin
folder, and the re-estimated read count matrix is in the minnow_simulate/alevin_out/alevin
folder, check it now!
git clone --single-branch --branch minnow-velocity https://github.com/COMBINE-lab/minnow.git
cd minnow
mkdir build
cd build
cmake ..
make
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.transcripts.fa.gz
awk -F "\t" '$3 == "transcript" { print $9 }' <( zcat gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz ) | tr -d ";\"" | awk '{print $4"\t"$2}' > tx2gene.tsv
src/minnow index -r ../mouse/GRCm38.p6/gencode.vM25.transcripts.fa.gz -k 101 -f 20 --tmpdir tmp -p 10 -o minnow_ind
BiocManager::install("splatter")
library(splatter)
library(scater)
set.seed(1)
num_genes<- 1000
num_cells <- 1000
sim <- splatSimulate(
nGenes=num_genes,
batchCells=num_cells,
verbose = FALSE
)
out_dir<-"splatter_out"
write.table(rownames(sim), file= file.path(out_dir, "quants_mat_rows.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(colnames(sim), file= file.path(out_dir, "quants_mat_cols.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(counts(sim), file= file.path(out_dir, "quants_mat.csv"), quote=FALSE, col.names=FALSE, row.names=FALSE, sep=",")
Assigning unique gene names (same number of unique gene names as your rows in the count matrix generated in splatter)
cut -f 2 ../mouse/GRCm38.p6/tx2gene.tsv | sort | uniq > ../mouse/GRCm38.p6/gene.list
shuf -n 1000 ../mouse/GRCm38.p6/gene.list > splatter_out/quants_mat_rows.txt
Now you can run minnow with the files generated above
src/minnow simulate --splatter-mode \
--g2t tx2gene.tsv \
--inputdir splatter_out \
--PCR 6 \
-r minnow_ind/ref_k101_fixed.fa \
-e 0.01 -p 10 -o minnow_splatter_out \
--dbg --gfa minnow_ind/dbg.gfa \
-w ../data/737K-august-2016.txt --countProb ../data/hg/countProb_pbmc_4k.txt --custom
To check the edit distance distribution and the distribution of distance from end
please run the following
src/validate validate -f <( gunzip -c hg_100_S1_L001_R2_001.fastq.gz ) -t minnow_ind/ref_k101_fixed.fa -o splatter_out/edit.dist -m 10
-m
option is for edit distance, if you use a PCR cycle of 6, it's unlikely to see a sequence which is more than 6 edit distance apart, if you use a higher
PCR cycle then use higher values to find the edit-distance distribution with arbitarary high edit distances,
The edit distance distribution will be followd by a distribution of the distance from end
. A snapshot of the out put from the above command is as follows,
➜ build git:(minnow-velocity) ✗ head -20 splatter_out/edit.dist
0 606398
1 304145
2 75481
3 12264
4 1537
5 157
6 18
--
100 6522
101 6490
102 6637
103 6604
104 6161
105 6131
106 5891
107 5887
108 5213
109 5448
110 5405
111 5295
...
- Doublets
- Empty-drops
- Retained intron
- Clusters