-
Notifications
You must be signed in to change notification settings - Fork 34
/
scImpute-vignette.Rmd
131 lines (106 loc) · 7.55 KB
/
scImpute-vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Introduction to scImpute"
author: "Wei Vivian Li, Jingyi Jessica Li"
# author:
# - name: Wei Vivian Li, Jingyi Jessica Li
# affiliation:
# - Department of Statistics, University of California, Los Angeles
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
#output: pdf_document
vignette: >
%\VignetteIndexEntry{scImpute: accurate and robust imputation for scRNA-seq data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscape at single-cell resolution. However, scRNA-seq analysis is complicated by the excess of zero or near zero counts in the data, which are the so-called dropouts due to low amounts of mRNA within each individual cell. Consequently, downstream analysis of scRNA-seq woule be severely biased if the dropout events are not properly corrected. `scImpute` is developed to accurately and efficiently impute the dropout values in scRNA-seq data.
`scImpute` can be applied to raw data count before the users perform downstream analyses such as
- dimension reduction of scRNA-seq data
- normalization of scRNA-seq data
- clustering of cell populations
- differential gene expression analysis
- time-series analysis of gene expression dynamics
## Quick start
`scImpute` can be easily incorporated into existing pipeline of scRNA-seq analysis.
Its only input is the raw count matrix with rows representing genes and columns representing cells. It will output an imputed count matrix with the same dimension.
In the simplest case, the imputation task can be done with one single function `scimpute`:
```{r eval = FALSE}
scimpute(# full path to raw count matrix
count_path = system.file("extdata", "raw_count.csv", package = "scImpute"),
infile = "csv", # format of input file
outfile = "csv", # format of output file
out_dir = "./", # full path to output directory
labeled = FALSE, # cell type labels not available
drop_thre = 0.5, # threshold set on dropout probability
Kcluster = 2, # 2 cell subpopulations
ncores = 10) # number of cores used in parallel computation
```
This function returns the column indices of outlier cells, and creates a new file `scImpute_count.csv` in `out_dir` to store the imputed count matrix.
## Step-by-step description
The input file can be a `.csv` file, `.txt` file, or `.rds` file. In all cases, the **first column** should give the gene names and the **first row** should give the cell names. We use the example files in the package as illustration. If the raw counts are stored in a `.csv` file, and we also hope to output the imputed matrix into a `.csv` file, then specify this information with
```{r eval = FALSE}
# full path of the input file
count_path = system.file("extdata", "raw_count.csv", package = "scImpute")
infile = "csv"
outfile = "csv"
```
Similarly, If the raw counts are stored in a `.txt` file, and we also hope to output the imputed matrix into a `.txt` file, then specify this information with
```{r eval = FALSE}
# full path of the input file
count_path = system.file("extdata", "raw_count.txt", package = "scImpute")
infile = "txt"
outfile = "txt"
```
Next, we need to set up the directory to store all the temporary and final outputs:
```{r eval = FALSE}
# a '/' sign is necessary at the end of the path
out_dir = "~/output/"
```
We highly recommend using parallel computing with `scImpute`, which will significantly reduce the computation time. Suppose we would like to use 20 cores, then we can run the `scImpute` function with `ncores = 20`.
`scImpute` has two statistical parameters.
The **first parameter is `Kcluster`**, which determines the **number of initial clusters** to help identify candidate neighbors of each cell. The imputation results does not heavily rely on the choice of `Kcluster`, since `scImpute` uses a model-based method to select similar cells in a later stage. `Kcluster` can be specified based on the number of known cell types and users' biological expertise, and it may also be learned by clustering the raw data and inspecting the clustering results.
The **second parameter** is `drop_thre`. Only the values that have **dropout probability** larger than `drop_thre` are imputed by `scImpute`. A default threshold `drop_thre = 0.5` is sufficient for most scRNA-seq data.
Now to get the imputed matrix, all we need is the main `scimpute` function
```{r eval = FALSE}
Kcluster = 2
drop_thre = 0.5
ncores = 10
scimpute(count_path, infile, outfile, out_dir, labeled = FALSE, drop_thre, Kcluster, ncores)
```
If `outfile = "csv"`, this function will create a new file `scimpute_count.csv` in `out_dir` to store the imputed count matrix; if `outfile = "txt"`, this function will create a new file `scimpute_count.txt` in `out_dir`.
Note that the order of parameters matters in R functions, so we suggest using the format in **Quick start** to specify parameters and avoid mistakes. If the users would like to apply `scImpute` on data coming from homogeneous cells, this can be achieved by setting `Kcluster = 1` and `labeled = FALSE`.
## Apply `scImpute` with cell type information
Sometimes users may have the cell type (or subpopulation) information of the single cells and `scimpute` can take advantage of this information to impute among each cell type. To do this, we need a character vector `labels` specifying the cell type of each column in the raw count matrix. In other words, the length of `labels` equals the number of cells and the order of elements in `labels` should match the order of columns in the raw count matrix. Then we just need to specify `labeled = TRUE` in `scimpute` (default is `FALSE`) and specify the `labels` argument. `Kcluster` is not used when `labeled = TRUE`.
```{r eval = FALSE}
labels = readRDS(system.file("extdata", "labels.rds", package = "scImpute"))
labels[1:5]
> [1] "c1" "c1" "c1" "c2" "c2"
scimpute(count_path,
infile = "csv",
outfile = "csv",
out_dir = out_dir,
labeled = TRUE,
drop_thre = 0.5,
labels = labels,
ncores = 10)
```
## Apply `scImpute` to TPM values
We strongly suggest using `scImpute` on count matrices. However, if only TPM values are available, users can apply `scImpute` with gene lengths supplied. `scImpute` will use the gene lengths (sum of exon lengths) to scale the data , which ensures a good fitting of the mixture models. In this case, users need to specify `type = "TPM"` (`type = "count"` by default), and supply a vector `genelen` of gene lengths. The order of genes in `genelen` should match the order in the expression matrix. For example:
```{r eval = FALSE}
> genelen[1:3]
ENSMUSG00000021252 ENSMUSG00000007777 ENSMUSG00000024442
4235 998 2404
scimpute(count_path,
infile = "csv",
outfile = "csv",
out_dir = out_dir,
type = "TPM"
genelen = genelen,
drop_thre = 0.5,
ncores = 10)
```
## How to save computation time with `scImpute`
`scImpute` benefits from parallel computation, and each processor does not require heavy memory cost. `scimpute` completes computation in seconds when applied to a dataset with 10,000 genes and 100 cells, running with 10 cores. The memory requirement for this data set is around 2G. The running time mostly depends on
* number of processors (`ncores`)
* number of cells in the scRNA-seq data
When the number of cells is extremely large, a filtering step on the cells can save the computation time of `scImpute`.