-
Notifications
You must be signed in to change notification settings - Fork 2
/
preprocess.Rmd
121 lines (96 loc) · 3.55 KB
/
preprocess.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# Preprocessing
```{r, echo=FALSE, results="hide"}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
knitr::opts_chunk$set(dpi=300, dev="png", dev.args=list(pointsize=15))
```
```{r, echo=FALSE, results="hide"}
library(BiocStyle)
```
## Loading in the data
The aim here is to convert the 10X data set into a `HDF5Matrix` object.
We first obtain the data using the `r Biocpkg("TENxBrainData")` package:
```{r}
#library(TENxBrainData)
#sce <- TENxBrainData() # or we would, if it was working properly.
library(SingleCellExperiment)
library(HDF5Array)
sce <- SingleCellExperiment(
list(counts=HDF5Array("rawdata/counts.h5", "counts")),
rowData=readRDS("rawdata/rowdata.rds"),
colData=readRDS("rawdata/coldata.rds")
)
sce
```
We have a look at some of the cell-level metadata.
Data were obtained from multiple mice, which were captured and sequenced in multiple libraries.
Note that the libraries and nested within the mice.
```{r}
table(sce$Library, sce$Mouse)
```
We also add some gene-level annotation.
There's already gene symbols, so we just add the chromosome location.
```{r}
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
chr.loc <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=rowData(sce)$Ensembl,
keytype="GENEID", column="CDSCHROM")
rowData(sce)$Chr <- chr.loc
head(rowData(sce))
```
## Performing cell-based quality control
We use `r Biocpkg("scater")` to calculate quality control summaries for each cell and gene.
```{r}
library(scater)
system.time({
sce <- calculateQCMetrics(sce,
feature_controls=list(Mito=which(rowData(sce)$Chr=="chrM")),
compact=TRUE, BPPARAM=BPPARAM)
})
head(colnames(rowData(sce)))
```
We have a look at the three relevant metrics, plotted against the batch of origin for each cell.
```{r qchist, fig.height=6, fig.width=12}
par(mfrow=c(1,3))
hist(sce$scater_qc$all$log10_total_counts,
xlab=expression(Log[10]~"library size"), col="grey80")
hist(sce$scater_qc$all$log10_total_features_by_counts,
xlab=expression(Log[10]~"number of genes expressed"), col="grey80")
hist(sce$scater_qc$feature_control_Mito$pct_counts,
xlab="Percentage of mitochondrial reads", col="grey80", breaks=50)
```
We use some of these metrics for quality control on the cells.
This is done within each batch to avoid discarding cells, e.g., if one batch was sequenced at lower depth.
```{r}
low.libsize <- isOutlier(sce$scater_qc$all$log10_total_counts,
nmad=3, batch=sce$Library, type="lower")
low.ngenes <- isOutlier(sce$scater_qc$all$log10_total_features_by_counts,
nmad=3, batch=sce$Library, type="lower")
discard <- low.libsize | low.ngenes
data.frame(LowLib=sum(low.libsize), LowGenes=sum(low.ngenes), Lost=sum(discard))
```
Low-quality cells are discarded from the object.
```{r}
sce <- sce[,!discard]
```
## Looking at gene-based metrics
For each feature, we plot the average count against the percentage of cells in which expression was detected.
```{r ave-per-gene}
ave.count <- rowData(sce)$scater_qc$all$mean_counts
pct.cells <- rowData(sce)$scater_qc$all$n_cells_by_counts/ncol(sce) * 100
smoothScatter(log10(ave.count), pct.cells, ylab="Percentage of cells",
xlab=expression("Log"[10]~"average count"))
```
We also inspect the identities of the top most-highly-expressed genes.
```{r}
top.genes <- rowData(sce)[,c("Ensembl", "Symbol")]
top.genes$AveCount <- ave.count
top.genes$PctCells <- pct.cells
top.genes <- top.genes[order(ave.count, decreasing=TRUE),]
head(as.data.frame(top.genes), 20)
```
<!--
Serializing the SCE object:
```{r}
dir.create("objects")
saveRDS(sce, file="objects/sce.rds")
```
-->