diff --git a/_episodes/03-Overview-scRNA-seq-Data.md b/_episodes/03-Overview-scRNA-seq-Data.md
index 035fc5c..62cb608 100644
--- a/_episodes/03-Overview-scRNA-seq-Data.md
+++ b/_episodes/03-Overview-scRNA-seq-Data.md
@@ -1,65 +1,42 @@
----
-# Please do not edit this file directly; it is auto generated.
-# Instead, please edit 03-Overview-scRNA-seq-Data.md in _episodes_rmd/
-source: Rmd
-title: "Overview of scRNA-seq Data"
-teaching: 90
-exercises: 30
-questions:
-- "What does single cell RNA-Seq data look like?"
-objectives:
-- "Understand the types of files provided by CellRanger."
-- "Understand the structure of files provided by CellRanger."
-- "Describe a sparse matrix and explain why it is useful."
-- "Read in a count matrix using Seurat."
-keypoints:
-- "CellRanger produces a gene expression count matrix that can be read in using Seurat."
-- "The count matrix is stored as a sparse matrix with features in rows and cells in columns."
----
-
-
-
-
-
## Overview of Single Cell Analysis Process
## Open Project File
-In the Setup section of this workshop, you created an RStudio Project.
+In the Setup section of this workshop, you created an RStudio Project.
Open this project now, by:
-1. selecting File --> Open Project... from the Menu
-2. choosing "scRNA.Rproj"
-3. opening the project file.
+1. selecting File –> Open Project… from the Menu
+2. choosing “scRNA.Rproj”
+3. opening the project file.
## What do raw scRNA-Seq data look like?
-The raw data for an scRNA-Seq experiment typically consists of two FASTQ files.
-One file contains the sequences of the cell barcode and molecular
+The raw data for an scRNA-Seq experiment typically consists of two FASTQ
+files. One file contains the sequences of the cell barcode and molecular
barcode (UMI), while the other file contains the sequences derived from
-the transcript. The reads in file one are approximately 28bp long
-(16bp cell barcode, 12bp UMI), while the reads in file two are
-approximately 90bp long.
+the transcript. The reads in file one are approximately 28bp long (16bp
+cell barcode, 12bp UMI), while the reads in file two are approximately
+90bp long.
The Single Cell Biology Laboratory at JAX additionally provides output
-of running the
-[10X CellRanger pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
-(see below).
+of running the [10X CellRanger
+pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
+(see below).
## Typical pre-processing pipeline
### 10X CellRanger
-[10X CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
-is "a set of analysis pipelines that process Chromium single cell
-data to align reads, generate feature-barcode matrices" and perform
-various other downstream analyses.
-In this course we will work with data that has been preprocessed
-using CellRanger.
-All you need to remember is that we used CellRanger to obtain
-gene expression counts for each gene within each cell.
+[10X
+CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
+is “a set of analysis pipelines that process Chromium single cell data
+to align reads, generate feature-barcode matrices” and perform various
+other downstream analyses. In this course we will work with data that
+has been preprocessed using CellRanger. All you need to remember is that
+we used CellRanger to obtain gene expression counts for each gene within
+each cell.
### CellRanger alternatives
@@ -67,693 +44,610 @@ There are several alternatives to CellRanger. Each of these alternatives
has appealing properties that we encourage you to read about but do not
have the time to discuss in this course. Alternatives include:
- * `alevin` [Srivastava et al. 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y),
- from the developers of the `salmon` aligner
- * `kallisto | bustools` [Melsted et al. 2021](https://doi.org/10.1038/s41587-021-00870-2),
- from the developers of the `kallisto` aligner
- * `STARsolo` [Kaminow et al 2021](https://doi.org/10.1101/2021.05.05.442755),
- from the developers of the `STAR` aligner
-
-While you should be aware that these alternatives
-exist and in some cases there may be very compelling reasons to use them,
-broadly speaking CellRanger is the most widely used tool for
-processing 10X Chromium scRNA-Seq data.
-
-
-## Introduction to two major single cell analysis ecosystems:
-
-At the time that this workshop was created, there were many different software
-packages designed for analyzing scRNA-seq data in a variety of scenarios. The
-two scRNA-seq software "ecosystems" that were most widely in use were:
-
-* R/Seurat : The Seurat ecosystem is the tool of choice for this workshop. The
-biggest strength of Seurat is its straightforward vignettes and ease of
-visualization/exploration.
- * [Seurat](https://www.nature.com/articles/nbt.3192) was released in 2015
- by the [Regev lab](https://biology.mit.edu/profile/aviv-regev/).
- * The first author, Rahul Satija, now has a faculty position and has
- maintained and improved Seurat.
- * Currently at [version 4](https://www.cell.com/cell/fulltext/S0092-8674(21)00583-3).
- * Source code available on [Github](https://www.github.com/satijalab/seurat).
- * Each version of Seurat adds new functionality:
- * Seurat v1: Infers cellular localization by integrating scRNA-seq
- with *in situ* hybridization.
- * Seurat v2: Integrates multiple scRNA-seq data sets using shared
- correlation structure.
- * Seurat v3: Integrates data from multiple technologies, i.e. scRNA-seq,
- scATAC-seq, proteomics, *in situ* hybridization.
- * Seurat v4: Integrative multimodal analysis and mapping of user data
- sets to cell identity reference database.
-
-* Python/[scanpy](https://scanpy.readthedocs.io/en/stable/) and [anndata](https://anndata.readthedocs.io/en/latest/)
- * Scanpy is a python toolkit for analyzing single-cell gene expression data.
- * Scanpy is built jointly with anndata, which is a file format specification
- and accompanying API for efficiently storing and accessing single cell data.
- * Like Seurat, scanpy is under active development as well. Scanpy has an
- advantage of being a somewhat larger and more diverse community than
- Seurat, where developement is centered around a single lab group.
- * This software has been used in a very large number of single cell projects. We
- encourage you to check it out and consider using it for your own work.
-
-For this course we will not use scanpy because we view R/Seurat as
-having a slight edge over scanpy when it comes to visualization and
+- `alevin` [Srivastava et
+ al. 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y),
+ from the developers of the `salmon` aligner
+- `kallisto | bustools` [Melsted et
+ al. 2021](https://doi.org/10.1038/s41587-021-00870-2), from the
+ developers of the `kallisto` aligner
+- `STARsolo` [Kaminow et al
+ 2021](https://doi.org/10.1101/2021.05.05.442755), from the
+ developers of the `STAR` aligner
+
+While you should be aware that these alternatives exist and in some
+cases there may be very compelling reasons to use them, broadly speaking
+CellRanger is the most widely used tool for processing 10X Chromium
+scRNA-Seq data.
+
+## Introduction to two major single cell analysis ecosystems:
+
+At the time that this workshop was created, there were many different
+software packages designed for analyzing scRNA-seq data in a variety of
+scenarios. The two scRNA-seq software “ecosystems” that were most widely
+in use were:
+
+- R/Seurat : The Seurat ecosystem is the tool of choice for this
+ workshop. The biggest strength of Seurat is its straightforward
+ vignettes and ease of visualization/exploration.
+ - [Seurat](https://www.nature.com/articles/nbt.3192) was released
+ in 2015 by the [Regev
+ lab](https://biology.mit.edu/profile/aviv-regev/).
+ - The first author, Rahul Satija, now has a faculty position and
+ has maintained and improved Seurat.
+ - Currently at [version
+ 4](https://www.cell.com/cell/fulltext/S0092-8674(21)00583-3).
+ - Source code available on
+ [Github](https://www.github.com/satijalab/seurat).
+ - Each version of Seurat adds new functionality:
+ - Seurat v1: Infers cellular localization by integrating
+ scRNA-seq with *in situ* hybridization.
+ - Seurat v2: Integrates multiple scRNA-seq data sets using
+ shared correlation structure.
+ - Seurat v3: Integrates data from multiple technologies,
+ i.e. scRNA-seq, scATAC-seq, proteomics, *in situ*
+ hybridization.
+ - Seurat v4: Integrative multimodal analysis and mapping of
+ user data sets to cell identity reference database.
+- Python/[scanpy](https://scanpy.readthedocs.io/en/stable/) and
+ [anndata](https://anndata.readthedocs.io/en/latest/)
+ - Scanpy is a python toolkit for analyzing single-cell gene
+ expression data.
+ - Scanpy is built jointly with anndata, which is a file format
+ specification and accompanying API for efficiently storing and
+ accessing single cell data.
+ - Like Seurat, scanpy is under active development as well. Scanpy
+ has an advantage of being a somewhat larger and more diverse
+ community than Seurat, where developement is centered around a
+ single lab group.
+ - This software has been used in a very large number of single
+ cell projects. We encourage you to check it out and consider
+ using it for your own work.
+
+For this course we will not use scanpy because we view R/Seurat as
+having a slight edge over scanpy when it comes to visualization and
interactive exploration of single cell data.
-
## Reading in CellRanger Data
As described above,
-[CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
-is software which preprocesses Chromium single cell data to
-align reads, generate feature-bar code matrices, and perform other downstream
-analyses.
-We will not be using any of CellRanger's downstream analyses,
-but we *will* be using the feature-barcode matrix produced by CellRanger.
-A feature-barcode matrix -- in the context of scRNA-Seq -- is a
-matrix that gives gene expression counts for each gene in each single cell.
-In a feature-barcode matrix, the
-genes (rows) are the features, and the cells (columns) are each identified
-by a barcode.
-The name feature-barcode matrix is a generalized term for the
-gene expression matrix. For example, feature-barcode could also refer
-to a matrix of single cell protein expression or single cell
-chromatin accessibility.
-In this workshop, we will read in the
-feature-barcode matrix produced by CellRanger and will perform the downstream
-analysis using Seurat.
+[CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
+is software which preprocesses Chromium single cell data to align reads,
+generate feature-bar code matrices, and perform other downstream
+analyses. We will not be using any of CellRanger’s downstream analyses,
+but we *will* be using the feature-barcode matrix produced by
+CellRanger. A feature-barcode matrix – in the context of scRNA-Seq – is
+a matrix that gives gene expression counts for each gene in each single
+cell. In a feature-barcode matrix, the genes (rows) are the features,
+and the cells (columns) are each identified by a barcode. The name
+feature-barcode matrix is a generalized term for the gene expression
+matrix. For example, feature-barcode could also refer to a matrix of
+single cell protein expression or single cell chromatin accessibility.
+In this workshop, we will read in the feature-barcode matrix produced by
+CellRanger and will perform the downstream analysis using Seurat.
### Liver Atlas
#### Cell Ranger Files
-In this lesson, we will read in a subset of data from the
-[Liver Atlas](https://livercellatlas.org/index.php), which is described in their
-[Cell paper](https://www.cell.com/cell/fulltext/S0092-8674(21)01481-1).
-Briefly, the authors performed scRNASeq on liver cells from mice and humans,
-identified cell types, clustered them, and made the data publicly available.
-We will be working with a subset of the *mouse* liver data.
-We split the data into two sets, one to use in the lesson and one for
-you to work with independently as a challenge.
-
-Before the workshop, you should have downloaded the data from
-Box and placed it in your `data` directory.
-Go to the [Setup](../setup) page for instructions on how to download the data
-files.
-
-Open a file browser and look in the `data` subdirectory `mouseStSt_invivo` and
-you should see three files. Each file ends with 'gz', which indicates that it
-has been compressed (or 'zipped') using
-[gzip](https://www.gnu.org/software/gzip/). You **don't** need to unzip them;
-the software that we use will uncompress the files as it reads them in. The
-files are:
-
- * matrix.mtx.gz: The feature-barcode matrix, i.e. a two-dimensional
- matrix containing the counts for each gene in each cell.
- * Genes are in rows and cells are in columns.
- * This file is in a special sparse matrix format which reduces disk space
- and memory usage.
- * barcodes.tsv.gz: DNA barcodes for each cell. Used as column names in counts matrix.
- * features.tsv.gz: Gene symbols for each gene. Used as row names in counts matrix.
+In this lesson, we will read in a subset of data from the [Liver
+Atlas](https://livercellatlas.org/index.php), which is described in
+their [Cell
+paper](https://www.cell.com/cell/fulltext/S0092-8674(21)01481-1).
+Briefly, the authors performed scRNASeq on liver cells from mice and
+humans, identified cell types, clustered them, and made the data
+publicly available. We will be working with a subset of the *mouse*
+liver data. We split the data into two sets, one to use in the lesson
+and one for you to work with independently as a challenge.
+
+Before the workshop, you should have downloaded the data from Box and
+placed it in your `data` directory. Go to the [Setup](../setup) page for
+instructions on how to download the data files.
+
+Open a file browser and look in the `data` subdirectory
+`mouseStSt_invivo` and you should see three files. Each file ends with
+‘gz’, which indicates that it has been compressed (or ‘zipped’) using
+[gzip](https://www.gnu.org/software/gzip/). You **don’t** need to unzip
+them; the software that we use will uncompress the files as it reads
+them in. The files are:
+
+- matrix.mtx.gz: The feature-barcode matrix, i.e. a two-dimensional
+ matrix containing the counts for each gene in each cell.
+ - Genes are in rows and cells are in columns.
+ - This file is in a special sparse matrix format which reduces
+ disk space and memory usage.
+- barcodes.tsv.gz: DNA barcodes for each cell. Used as column names in
+ counts matrix.
+- features.tsv.gz: Gene symbols for each gene. Used as row names in
+ counts matrix.
> ## Challenge 1
-> 1). R has a function called [file.size](https://stat.ethz.ch/R-manual/R-devel/library/base/html/file.info.html).
-Look at the help for this function and get the size of each of the files in
-the `mouseStSt_invivo` directory. Which one is the largest?
+>
+> 1). R has a function called
+> [file.size](https://stat.ethz.ch/R-manual/R-devel/library/base/html/file.info.html).
+> Look at the help for this function and get the size of each of the
+> files in the `mouseStSt_invivo` directory. Which one is the largest?
>
> > ## Solution to Challenge 1
> >
-> > 1). `file.size(file.path(data_dir, 'mouseStSt_invivo', 'barcodes.tsv.gz'))`
-> > 584346 bytes
-> > `file.size(file.path(data_dir, 'mouseStSt_invivo', 'features.tsv.gz'))`
-> > 113733 bytes
-> > `file.size(file.path(data_dir, 'mouseStSt_invivo', 'matrix.mtx.gz'))`
-> > 603248953 bytes
-> > 'matrix.mtx.gz' is the largest file.
-> {: .solution}
-{: .challenge}
+> > 1).
+> > `file.size(file.path(data_dir, 'mouseStSt_invivo', 'barcodes.tsv.gz'))`
+> > 584346 bytes
+> > `file.size(file.path(data_dir, 'mouseStSt_invivo', 'features.tsv.gz'))`
+> > 113733 bytes
+> > `file.size(file.path(data_dir, 'mouseStSt_invivo', 'matrix.mtx.gz'))`
+> > 603248953 bytes
+> > ‘matrix.mtx.gz’ is the largest file.
+> > {: .solution} {: .challenge}
#### CellRanger Quality Control Report
CellRanger also produces a Quality Control (QC) report as an HTML
-document. It produces one report for each sample we run (each
-channel of the 10X chip). We do not have
-the QC report from the Liver Atlas study, but the figure below shows an
-example report. The report highlights three numbers:
-
-1. Estimated Number of Cells: This indicates the number of cells recovered
-in your experiment. As we previously discussed this will be less than the
-number of cells you loaded. The number of cells recovered
-will almost never be the exact number of cells you had hoped to recover,
-but we might like to see a number within approximately +/-20% of your goal.
-1. Mean Reads per Cell: This indicates the number of reads in each cell.
-This will be a function of how deeply you choose to sequence your library.
-1. Median Genes per Cell: This indicates the median number of genes detected
-in each cell. Note that this is much lower than in bulk RNA-Seq. This
-number will also be lower for single nucleus than for single cell
-RNA-Seq, and is also likely to vary between cell types.
+document. It produces one report for each sample we run (each channel of
+the 10X chip). We do not have the QC report from the Liver Atlas study,
+but the figure below shows an example report. The report highlights
+three numbers:
+
+1. Estimated Number of Cells: This indicates the number of cells
+ recovered in your experiment. As we previously discussed this will
+ be less than the number of cells you loaded. The number of cells
+ recovered will almost never be the exact number of cells you had
+ hoped to recover, but we might like to see a number within
+ approximately +/-20% of your goal.
+2. Mean Reads per Cell: This indicates the number of reads in each
+ cell. This will be a function of how deeply you choose to sequence
+ your library.
+3. Median Genes per Cell: This indicates the median number of genes
+ detected in each cell. Note that this is much lower than in bulk
+ RNA-Seq. This number will also be lower for single nucleus than for
+ single cell RNA-Seq, and is also likely to vary between cell types.
-When you run a sample that has a problem of some kind, the CellRanger
+When you run a sample that has a problem of some kind, the CellRanger
report might be able to detect something anomalous about your data and
present you with a warning. Here are two examples of reports with
warning flags highlighted.
-In the report below, CellRanger notes that a low fraction of reads are within
-cells. This might be caused by, for example, very high levels of ambient
-RNA.
+In the report below, CellRanger notes that a low fraction of reads are
+within cells. This might be caused by, for example, very high levels of
+ambient RNA.
-In the report below, CellRanger notes that a low fraction of reads are
-confidently mapped to the transcriptome, and a high fraction of reads map
-antisense to genes. Note that in this sample we are seeing only 249 genes
-per cell despite a mean of over 90,000 reads per cell. This likely indicates
-a poor quality library.
+In the report below, CellRanger notes that a low fraction of reads are
+confidently mapped to the transcriptome, and a high fraction of reads
+map antisense to genes. Note that in this sample we are seeing only 249
+genes per cell despite a mean of over 90,000 reads per cell. This likely
+indicates a poor quality library.
-
### Reading a CellRanger Gene Expression Count Matrix using Seurat
-In order to read these files into memory, we will use the
-[Seurat::Read10X()](https://satijalab.org/seurat/reference/read10x) function.
-This function searches for the three files mentioned above in the directory that
-you pass in. Once it verifies that all three files are present, it reads them
-in to create a counts matrix with genes in rows and cells in columns.
+In order to read these files into memory, we will use the
+[Seurat::Read10X()](https://satijalab.org/seurat/reference/read10x)
+function. This function searches for the three files mentioned above in
+the directory that you pass in. Once it verifies that all three files
+are present, it reads them in to create a counts matrix with genes in
+rows and cells in columns.
+ library(Seurat)
+ data_dir <- 'data'
-~~~
-library(Seurat)
-data_dir <- 'data'
-~~~
{: .language-r}
-We will use the `gene.column = 1` argument to tell Seurat to use the first
-column in 'features.tsv.gz' as the gene identifier.
+We will use the `gene.column = 1` argument to tell Seurat to use the
+first column in ‘features.tsv.gz’ as the gene identifier.
-Run the following command. This may take up to three minutes to complete.
+Run the following command. This may take up to three minutes to
+complete.
+ # uses the Seurat function Read10X()
+ counts <- Read10X(file.path(data_dir, 'mouseStSt_invivo'), gene.column = 1)
-~~~
-# uses the Seurat function Read10X()
-counts <- Read10X(file.path(data_dir, 'mouseStSt_invivo'), gene.column = 1)
-~~~
{: .language-r}
`counts` now contains the sequencing read counts for each gene and cell.
How many rows and columns are there in `counts`?
+ dim(counts)
-~~~
-dim(counts)
-~~~
{: .language-r}
+ [1] 31053 47743
-
-~~~
-[1] 31053 47743
-~~~
{: .output}
-In the `counts` matrix, genes are in rows and cells are in columns. Let's look
-at the first few gene names.
+In the `counts` matrix, genes are in rows and cells are in columns.
+Let’s look at the first few gene names.
+ head(rownames(counts), n = 10)
-~~~
-head(rownames(counts), n = 10)
-~~~
{: .language-r}
+ [1] "Xkr4" "Gm1992" "Gm37381" "Rp1" "Sox17" "Gm37323" "Mrpl15" "Lypla1"
+ [9] "Gm37988" "Tcea1"
-
-~~~
- [1] "Xkr4" "Gm1992" "Gm37381" "Rp1" "Sox17" "Gm37323" "Mrpl15"
- [8] "Lypla1" "Gm37988" "Tcea1"
-~~~
{: .output}
-As you can see, the gene names are gene symbols. There is some risk that these
-may not be unique. Let's check whether any of the gene symbols are duplicated.
-We will sum the number of duplicated gene symbols.
+As you can see, the gene names are gene symbols. There is some risk that
+these may not be unique. Let’s check whether any of the gene symbols are
+duplicated. We will sum the number of duplicated gene symbols.
+ sum(duplicated(rownames(counts)))
-~~~
-sum(duplicated(rownames(counts)))
-~~~
{: .language-r}
+ [1] 0
-
-~~~
-[1] 0
-~~~
{: .output}
-The sum equals zero, so there are no duplicated gene symbols, which is good.
-As it turns out, the reference genome/annotation files that are prepared for
-use by CellRanger have already been filtered to ensure no duplicated gene
-symbols.
+The sum equals zero, so there are no duplicated gene symbols, which is
+good. As it turns out, the reference genome/annotation files that are
+prepared for use by CellRanger have already been filtered to ensure no
+duplicated gene symbols.
-Let's look at the cell identifiers in the column names.
+Let’s look at the cell identifiers in the column names.
+ head(colnames(counts), n = 10)
-~~~
-head(colnames(counts), n = 10)
-~~~
{: .language-r}
+ [1] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2"
+ [4] "AATGGCTCAACGGTAG-2" "ACACTGAAGTGCAGGT-2" "ACCACAACAGTCTCTC-2"
+ [7] "ACGATGTAGTGGTTCT-2" "ACGCACGCACTAACCA-2" "ACTGCAATCAACTCTT-2"
+ [10] "ACTGCAATCGTCACCT-2"
-
-~~~
- [1] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2"
- [4] "AATGGCTCAACGGTAG-2" "ACACTGAAGTGCAGGT-2" "ACCACAACAGTCTCTC-2"
- [7] "ACGATGTAGTGGTTCT-2" "ACGCACGCACTAACCA-2" "ACTGCAATCAACTCTT-2"
-[10] "ACTGCAATCGTCACCT-2"
-~~~
{: .output}
-Each of these barcodes identifies one cell. They should all be unique. Once
-again, let's verify this.
+Each of these barcodes identifies one cell. They should all be unique.
+Once again, let’s verify this.
+ sum(duplicated(colnames(counts)))
-~~~
-sum(duplicated(colnames(counts)))
-~~~
{: .language-r}
+ [1] 0
-
-~~~
-[1] 0
-~~~
{: .output}
-The sum of duplicated values equals zero, so all of the barcodes are unique.
-The barcode sequence is the actual sequence of the oligonucleotide tag that
-was attached to the GEM (barcoded bead) that went into each droplet. In early
-versions of 10X technology there were approximately
-[750,000 barcodes](https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-)
-per run while in the current chemistry there are
-[>3 million barcodes](https://kb.10xgenomics.com/hc/en-us/articles/360031133451-Why-is-there-a-discrepancy-in-the-3M-february-2018-txt-barcode-whitelist-).
-CellRanger attempts to correct sequencing errors in the barcodes
-and uses a "whitelist" of known barcodes (in the 10X chemistry) to help.
-
+The sum of duplicated values equals zero, so all of the barcodes are
+unique. The barcode sequence is the actual sequence of the
+oligonucleotide tag that was attached to the GEM (barcoded bead) that
+went into each droplet. In early versions of 10X technology there were
+approximately [750,000
+barcodes](https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-)
+per run while in the current chemistry there are [>3 million
+barcodes](https://kb.10xgenomics.com/hc/en-us/articles/360031133451-Why-is-there-a-discrepancy-in-the-3M-february-2018-txt-barcode-whitelist-).
+CellRanger attempts to correct sequencing errors in the barcodes and
+uses a “whitelist” of known barcodes (in the 10X chemistry) to help.
-Next, let's look at the values in `counts`.
+Next, let’s look at the values in `counts`.
+ counts[1:10, 1:20]
-~~~
-counts[1:10, 1:20]
-~~~
{: .language-r}
+ 10 x 20 sparse Matrix of class "dgCMatrix"
-
-~~~
-10 x 20 sparse Matrix of class "dgCMatrix"
-~~~
{: .output}
+ [[ suppressing 20 column names 'AAACGAATCCACTTCG-2', 'AAAGGTACAGGAAGTC-2', 'AACTTCTGTCATGGCC-2' ... ]]
-
-~~~
- [[ suppressing 20 column names 'AAACGAATCCACTTCG-2', 'AAAGGTACAGGAAGTC-2', 'AACTTCTGTCATGGCC-2' ... ]]
-~~~
{: .output}
+
+ Xkr4 . . . . . . . . . . . . . . . . . . . .
+ Gm1992 . . . . . . . . . . . . . . . . . . . .
+ Gm37381 . . . . . . . . . . . . . . . . . . . .
+ Rp1 . . . . . . . . . . . . . . . . . . . .
+ Sox17 . . 2 4 . . . 1 . 1 1 . . 2 . . 1 8 1 .
+ Gm37323 . . . . . . . . . . . . . . . . . . . .
+ Mrpl15 . . . 1 1 . . . 1 . 2 . . . . 1 . 1 1 .
+ Lypla1 . . 2 1 . 1 1 . . . 1 1 2 . 1 1 1 . . .
+ Gm37988 . . . . . . . . . . . . . . . . . . . .
+ Tcea1 . . 2 . 2 2 . . 1 2 . 2 2 . . 2 1 1 2 .
-
-~~~
-
-Xkr4 . . . . . . . . . . . . . . . . . . . .
-Gm1992 . . . . . . . . . . . . . . . . . . . .
-Gm37381 . . . . . . . . . . . . . . . . . . . .
-Rp1 . . . . . . . . . . . . . . . . . . . .
-Sox17 . . 2 4 . . . 1 . 1 1 . . 2 . . 1 8 1 .
-Gm37323 . . . . . . . . . . . . . . . . . . . .
-Mrpl15 . . . 1 1 . . . 1 . 2 . . . . 1 . 1 1 .
-Lypla1 . . 2 1 . 1 1 . . . 1 1 2 . 1 1 1 . . .
-Gm37988 . . . . . . . . . . . . . . . . . . . .
-Tcea1 . . 2 . 2 2 . . 1 2 . 2 2 . . 2 1 1 2 .
-~~~
{: .output}
-We can see the gene symbols in rows along the left. The barcodes are not shown
-to make the values easier to read. Each of the periods represents a zero. The
-'1' values represent a single read for a gene in one cell.
+We can see the gene symbols in rows along the left. The barcodes are not
+shown to make the values easier to read. Each of the periods represents
+a zero. The ‘1’ values represent a single read for a gene in one cell.
-Although `counts` looks like a matrix and you can use many matrix functions on
-it, `counts` is actually a *different* type of object. In scRNASeq, the read
-depth in each cell is quite low. So you many only get counts for a small number
-of genes in each cell. The `counts` matrix has 31053 rows and
-47743 columns, and includes 1.4825634 × 109
-entries. However, most of these entries
-(92.4930544%) are
-zeros because every gene is not detected in every cell. It would be wasteful
-to store all of these zeros in memory. It would also make it difficult to
-store all of the data in memory. So `counts` is a 'sparse matrix', which only
-stores the positions of non-zero values in memory.
+Although `counts` looks like a matrix and you can use many matrix
+functions on it, `counts` is actually a *different* type of object. In
+scRNASeq, the read depth in each cell is quite low. So you many only get
+counts for a small number of genes in each cell. The `counts` matrix has
+31053 rows and 47743 columns, and includes 1.4825634^{9} entries.
+However, most of these entries (92.4930544%) are zeros because every
+gene is not detected in every cell. It would be wasteful to store all of
+these zeros in memory. It would also make it difficult to store all of
+the data in memory. So `counts` is a ‘sparse matrix’, which only stores
+the positions of non-zero values in memory.
-Look at the structure of the `counts` matrix using [str](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/str.html).
+Look at the structure of the `counts` matrix using
+[str](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/str.html).
+ str(counts)
-~~~
-str(counts)
-~~~
{: .language-r}
+ Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
+ ..@ i : int [1:111295227] 15 19 36 38 40 61 66 67 70 93 ...
+ ..@ p : int [1:47744] 0 3264 6449 9729 13446 16990 20054 23142 26419 29563 ...
+ ..@ Dim : int [1:2] 31053 47743
+ ..@ Dimnames:List of 2
+ .. ..$ : chr [1:31053] "Xkr4" "Gm1992" "Gm37381" "Rp1" ...
+ .. ..$ : chr [1:47743] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
+ ..@ x : num [1:111295227] 1 1 1 2 1 6 1 1 2 1 ...
+ ..@ factors : list()
-
-~~~
-Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
- ..@ i : int [1:111295227] 15 19 36 38 40 61 66 67 70 93 ...
- ..@ p : int [1:47744] 0 3264 6449 9729 13446 16990 20054 23142 26419 29563 ...
- ..@ Dim : int [1:2] 31053 47743
- ..@ Dimnames:List of 2
- .. ..$ : chr [1:31053] "Xkr4" "Gm1992" "Gm37381" "Rp1" ...
- .. ..$ : chr [1:47743] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
- ..@ x : num [1:111295227] 1 1 1 2 1 6 1 1 2 1 ...
- ..@ factors : list()
-~~~
{: .output}
-We can see that the formal class name is a "dgCMatrix". There are two long
-vectors of integers which encode the positions of non-zero values. The gene
-names and cell barcodes are stored in character vectors and the non-zero values
-are an integer vector. This class saves space by not allocating memory to store
-all of the zero values.
+We can see that the formal class name is a “dgCMatrix”. There are two
+long vectors of integers which encode the positions of non-zero values.
+The gene names and cell barcodes are stored in character vectors and the
+non-zero values are an integer vector. This class saves space by not
+allocating memory to store all of the zero values.
-Let's look at small portion of `counts`. We will create a tile plot indicating
-which values are non-zero for the first 100 cells and genes in rows 400 to 600.
-For historical reasons, R plots the rows along the X-axis and columns along the
-Y-axis. We will transpose the matrix so that genes are on the Y-axis, which
-reflects the way in which we normally look at this matrix.
+Let’s look at small portion of `counts`. We will create a tile plot
+indicating which values are non-zero for the first 100 cells and genes
+in rows 400 to 600. For historical reasons, R plots the rows along the
+X-axis and columns along the Y-axis. We will transpose the matrix so
+that genes are on the Y-axis, which reflects the way in which we
+normally look at this matrix.
+ image(1:100, 400:600, t(as.matrix(counts[400:600,1:100]) > 0),
+ xlab = 'Cells', ylab = 'Genes')
-~~~
-image(1:100, 400:600, t(as.matrix(counts[400:600,1:100]) > 0),
- xlab = 'Cells', ylab = 'Genes')
-~~~
{: .language-r}
-
-
-
plot of chunk counts_image
-
+
-In the tile plot above, each row represents one gene and each column represents
-one cell. Red indicates non-zero values and yellow indicates zero values. As
-you can see, most of the matrix consists of zeros (yellow tiles) and hence is
-called 'sparse'. You can also see that some genes are expressed in most cells,
-indicated by the horizontal red lines, and that some genes are expressed in
-very few cells.
+In the tile plot above, each row represents one gene and each column
+represents one cell. Red indicates non-zero values and yellow indicates
+zero values. As you can see, most of the matrix consists of zeros
+(yellow tiles) and hence is called ‘sparse’. You can also see that some
+genes are expressed in most cells, indicated by the horizontal red
+lines, and that some genes are expressed in very few cells.
-What proportion of genes have zero counts in all samples?
+What proportion of genes have zero counts in all samples?
+ gene_sums <- data.frame(gene_id = rownames(counts),
+ sums = Matrix::rowSums(counts))
+ sum(gene_sums$sums == 0)
-~~~
-gene_sums <- data.frame(gene_id = rownames(counts),
- sums = Matrix::rowSums(counts))
-sum(gene_sums$sums == 0)
-~~~
{: .language-r}
+ [1] 7322
-
-~~~
-[1] 7322
-~~~
{: .output}
-We can see that 7322 (24%)
-genes have no reads at all associated with them. In the next lesson, we will
-remove genes that have no counts in any cells.
+We can see that 7322 (24%) genes have no reads at all associated with
+them. In the next lesson, we will remove genes that have no counts in
+any cells.
-Next, let's look at the number of counts in each cell.
+Next, let’s look at the number of counts in each cell.
+ hist(Matrix::colSums(counts))
-~~~
-hist(Matrix::colSums(counts))
-~~~
{: .language-r}
-
+ Matrix::colSums(counts) %>%
+ enframe() %>%
+ ggplot(aes(value)) +
+ geom_histogram(bins = 30) +
+ scale_x_log10() +
+ theme_bw(base_size = 16)
-The range of counts covers several orders of magnitude, from
-500 to 3.32592 × 105. We will need
-to normalize for this large difference in sequencing depth,
-which we will cover in the next lesson.
+{: .language-r}
+
+
+The range of counts covers several orders of magnitude, from 500 to
+3.32592^{5}. We will need to normalize for this large difference in
+sequencing depth, which we will cover in the next lesson.
### Sample Metadata
-Sample metadata refers to information about your samples that is not the
-"data", i.e. the gene counts. This might include information such as sex,
-tissue, or treatment. In the case of the liver atlas data, the authors provided
-a metadata file for their samples.
+Sample metadata refers to information about your samples that is not the
+“data”, i.e. the gene counts. This might include information such as
+sex, tissue, or treatment. In the case of the liver atlas data, the
+authors provided a metadata file for their samples.
-The sample metadata file is a comma-separated variable (CSV) file, We will read
-it in using the readr
-[read_csv](https://readr.tidyverse.org/reference/read_delim.html) function.
+The sample metadata file is a comma-separated variable (CSV) file, We
+will read it in using the readr
+[read\_csv](https://readr.tidyverse.org/reference/read_delim.html)
+function.
+ metadata <- read_csv(file.path(data_dir, 'mouseStSt_invivo', 'annot_metadata_first.csv'))
-~~~
-metadata <- read_csv(file.path(data_dir, 'mouseStSt_invivo', 'annot_metadata_first.csv'))
-~~~
{: .language-r}
+ Rows: 47743 Columns: 4
+ ── Column specification ────────────────────────────────────────────────────────────
+ Delimiter: ","
+ chr (4): sample, cell, digest, typeSample
+ ℹ Use `spec()` to retrieve the full column specification for this data.
+ ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-~~~
-Rows: 47743 Columns: 4
-── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-Delimiter: ","
-chr (4): sample, cell, digest, typeSample
-
-ℹ Use `spec()` to retrieve the full column specification for this data.
-ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-~~~
{: .output}
-Let's look at the top of the metadata.
+Let’s look at the top of the metadata.
+ head(metadata)
-~~~
-head(metadata)
-~~~
{: .language-r}
+ # A tibble: 6 × 4
+ sample cell digest typeSample
+
+ 1 CS48 AAACGAATCCACTTCG-2 inVivo scRnaSeq
+ 2 CS48 AAAGGTACAGGAAGTC-2 inVivo scRnaSeq
+ 3 CS48 AACTTCTGTCATGGCC-2 inVivo scRnaSeq
+ 4 CS48 AATGGCTCAACGGTAG-2 inVivo scRnaSeq
+ 5 CS48 ACACTGAAGTGCAGGT-2 inVivo scRnaSeq
+ 6 CS48 ACCACAACAGTCTCTC-2 inVivo scRnaSeq
-
-~~~
-# A tibble: 6 × 4
- sample cell digest typeSample
-
-1 CS48 AAACGAATCCACTTCG-2 inVivo scRnaSeq
-2 CS48 AAAGGTACAGGAAGTC-2 inVivo scRnaSeq
-3 CS48 AACTTCTGTCATGGCC-2 inVivo scRnaSeq
-4 CS48 AATGGCTCAACGGTAG-2 inVivo scRnaSeq
-5 CS48 ACACTGAAGTGCAGGT-2 inVivo scRnaSeq
-6 CS48 ACCACAACAGTCTCTC-2 inVivo scRnaSeq
-~~~
{: .output}
In the table above, you can see that there are four columns:
-1. sample: mouse identifier from which cell was derived;
-1. cell: the DNA bar code used to identify the cell;
-1. digest: cells for this liver atlas were harvested using either an *in vivo*
-or an *ex vivo* procedure. In this subset of the data we are looking only
-at *in vivo* samples;
-1. typeSample: the type of library preparation protocol, either single cell
-RNA-seq (scRnaSeq) or nuclear sequencing (nucSeq). In this subset of the data
-we are looking only at scRnaSeq samples.
+1. sample: mouse identifier from which cell was derived;
+2. cell: the DNA bar code used to identify the cell;
+3. digest: cells for this liver atlas were harvested using either an
+ *in vivo* or an *ex vivo* procedure. In this subset of the data we
+ are looking only at *in vivo* samples;
+4. typeSample: the type of library preparation protocol, either single
+ cell RNA-seq (scRnaSeq) or nuclear sequencing (nucSeq). In this
+ subset of the data we are looking only at scRnaSeq samples.
-Let's confirm that we are only looking at scRnaSeq samples from *in vivo*
-digest cells:
+Let’s confirm that we are only looking at scRnaSeq samples from *in
+vivo* digest cells:
+ dplyr::count(metadata, digest, typeSample)
-~~~
-dplyr::count(metadata, digest, typeSample)
-~~~
{: .language-r}
+ # A tibble: 1 × 3
+ digest typeSample n
+
+ 1 inVivo scRnaSeq 47743
-
-~~~
-# A tibble: 1 × 3
- digest typeSample n
-
-1 inVivo scRnaSeq 47743
-~~~
{: .output}
-We're going to explore the data further using a series of Challenges.
-You will be asked to look at the contents of some of the columns to see
-how the data are
-distributed.
+We’re going to explore the data further using a series of Challenges.
+You will be asked to look at the contents of some of the columns to see
+how the data are distributed.
> ## Challenge 2
-> How many mice were used to produce this data? The "sample" column contains
-the mouse identifier for each cell.
+>
+> How many mice were used to produce this data? The “sample” column
+> contains the mouse identifier for each cell.
>
> > ## Solution to Challenge 2
> >
-> > count(metadata, sample) %>% summarize(total = n())
-> {: .solution}
-{: .challenge}
-
+> > count(metadata, sample) %>% summarize(total = n())
+> > {: .solution} {: .challenge}
> ## Challenge 3
-> How many cells are there from each mouse?
+>
+> How many cells are there from each mouse?
>
> > ## Solution to Challenge 3
> >
-> > count(metadata, sample)
-> {: .solution}
-{: .challenge}
+> > count(metadata, sample) {: .solution} {: .challenge}
-
-In this workshop, we will attempt to reproduce some of the results of the
-[Liver Atlas](https://livercellatlas.org/index.php) using Seurat. We will
-analyze the *in-vivo* single **cell** RNA-Seq
-together.
+In this workshop, we will attempt to reproduce some of the results of
+the [Liver Atlas](https://livercellatlas.org/index.php) using Seurat. We
+will analyze the *in-vivo* single **cell** RNA-Seq together.
### Save Data for Next Lesson
-We will use the *in-vivo* data in the next lesson.
-If you plan to keep your RStudio open, we will simply continue to the next
-lesson. If you wanted to save the data you could execute a command like:
+We will use the *in-vivo* data in the next lesson. If you plan to keep
+your RStudio open, we will simply continue to the next lesson. If you
+wanted to save the data you could execute a command like:
+ save(counts, metadata, file = file.path(data_dir, 'lesson03.Rdata'))
-~~~
-save(counts, metadata, file = file.path(data_dir, 'lesson03.Rdata'))
-~~~
{: .language-r}
> ## Challenge 5
-> In the lesson above, you read in the scRNASeq data. There is
-another dataset which was created using an *ex vivo* digest in the
-`mouseStSt_exvivo`
-directory. Delete the `counts` and `metadata` objects from your environment.
-Then read in the counts and metadata from the `mouseStSt_exvivo`
-directory and save them to a file called 'lesson03_challenge.Rdata'.
+>
+> In the lesson above, you read in the scRNASeq data. There is another
+> dataset which was created using an *ex vivo* digest in the
+> `mouseStSt_exvivo` directory. Delete the `counts` and `metadata`
+> objects from your environment. Then read in the counts and metadata
+> from the `mouseStSt_exvivo` directory and save them to a file called
+> ‘lesson03\_challenge.Rdata’.
>
> > ## Solution to Challenge 5
> >
> > `# Remove exising counts and metadata.`
-> > `rm(counts, metadata)`
-> > `# Read in new counts.`
+> > `rm(counts, metadata)` `# Read in new counts.`
> > `counts <- Seurat::Read10X(file.path(data_dir, 'mouseStSt_exvivo'), gene.column = 1)`
> > `# Read in new metadata.`
> > `metadata <- read_csv(file.path(data_dir, 'mouseStSt_exvivo', 'annot_metadata.csv'))`
> > `# Save data for next lesson.`
> > `save(counts, metadata, file = file.path(data_dir, 'lesson03_challenge.Rdata'))`
-> {: .solution}
-{: .challenge}
+> > {: .solution} {: .challenge}
### Session Info
+ sessionInfo()
-~~~
-sessionInfo()
-~~~
{: .language-r}
+ R version 4.4.1 (2024-06-14)
+ Platform: x86_64-apple-darwin20
+ Running under: macOS 15.0.1
+
+ Matrix products: default
+ BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
+ LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
+
+ locale:
+ [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+
+ time zone: America/New_York
+ tzcode source: internal
+
+ attached base packages:
+ [1] stats graphics grDevices utils datasets methods base
+
+ other attached packages:
+ [1] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4 lubridate_1.9.3
+ [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
+ [9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
+ [13] tidyverse_2.0.0 knitr_1.48 rmarkdown_2.28
+
+ loaded via a namespace (and not attached):
+ [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.9
+ [4] magrittr_2.0.3 spatstat.utils_3.1-0 farver_2.1.2
+ [7] vctrs_0.6.5 ROCR_1.0-11 spatstat.explore_3.3-2
+ [10] htmltools_0.5.8.1 sctransform_0.4.1 parallelly_1.38.0
+ [13] KernSmooth_2.23-24 htmlwidgets_1.6.4 ica_1.0-3
+ [16] plyr_1.8.9 plotly_4.10.4 zoo_1.8-12
+ [19] igraph_2.0.3 mime_0.12 lifecycle_1.0.4
+ [22] pkgconfig_2.0.3 Matrix_1.7-0 R6_2.5.1
+ [25] fastmap_1.2.0 fitdistrplus_1.2-1 future_1.34.0
+ [28] shiny_1.9.1 digest_0.6.37 colorspace_2.1-1
+ [31] patchwork_1.3.0 tensor_1.5 RSpectra_0.16-2
+ [34] irlba_2.3.5.1 labeling_0.4.3 progressr_0.14.0
+ [37] fansi_1.0.6 spatstat.sparse_3.1-0 timechange_0.3.0
+ [40] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
+ [43] compiler_4.4.1 bit64_4.5.2 withr_3.0.1
+ [46] fastDummies_1.7.4 highr_0.11 R.utils_2.12.3
+ [49] MASS_7.3-60.2 tools_4.4.1 lmtest_0.9-40
+ [52] httpuv_1.6.15 future.apply_1.11.2 goftest_1.2-3
+ [55] R.oo_1.26.0 glue_1.8.0 nlme_3.1-164
+ [58] promises_1.3.0 grid_4.4.1 Rtsne_0.17
+ [61] cluster_2.1.6 reshape2_1.4.4 generics_0.1.3
+ [64] gtable_0.3.5 spatstat.data_3.1-2 tzdb_0.4.0
+ [67] R.methodsS3_1.8.2 data.table_1.16.2 hms_1.1.3
+ [70] utf8_1.2.4 spatstat.geom_3.3-3 RcppAnnoy_0.0.22
+ [73] ggrepel_0.9.6 RANN_2.6.2 pillar_1.9.0
+ [76] vroom_1.6.5 spam_2.11-0 RcppHNSW_0.6.0
+ [79] later_1.3.2 splines_4.4.1 lattice_0.22-6
+ [82] bit_4.5.0 survival_3.6-4 deldir_2.0-4
+ [85] tidyselect_1.2.1 miniUI_0.1.1.1 pbapply_1.7-2
+ [88] gridExtra_2.3 scattermore_1.2 xfun_0.48
+ [91] matrixStats_1.4.1 stringi_1.8.4 lazyeval_0.2.2
+ [94] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
+ [97] cli_3.6.3 uwot_0.2.2 xtable_1.8-4
+ [100] reticulate_1.39.0 munsell_0.5.1 Rcpp_1.0.13
+ [103] globals_0.16.3 spatstat.random_3.3-2 png_0.1-8
+ [106] spatstat.univar_3.0-1 parallel_4.4.1 dotCall64_1.2
+ [109] listenv_0.9.1 viridisLite_0.4.2 scales_1.3.0
+ [112] ggridges_0.5.6 crayon_1.5.3 leiden_0.4.3.1
+ [115] rlang_1.1.4 cowplot_1.1.3
-
-~~~
-R version 4.4.0 (2024-04-24 ucrt)
-Platform: x86_64-w64-mingw32/x64
-Running under: Windows 10 x64 (build 19045)
-
-Matrix products: default
-
-
-locale:
-[1] LC_COLLATE=English_United States.utf8
-[2] LC_CTYPE=English_United States.utf8
-[3] LC_MONETARY=English_United States.utf8
-[4] LC_NUMERIC=C
-[5] LC_TIME=English_United States.utf8
-
-time zone: America/New_York
-tzcode source: internal
-
-attached base packages:
-[1] stats graphics grDevices utils datasets methods base
-
-other attached packages:
- [1] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4 lubridate_1.9.3
- [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
- [9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
-[13] tidyverse_2.0.0 knitr_1.48
-
-loaded via a namespace (and not attached):
- [1] deldir_2.0-4 pbapply_1.7-2 gridExtra_2.3
- [4] rlang_1.1.4 magrittr_2.0.3 RcppAnnoy_0.0.22
- [7] spatstat.geom_3.3-3 matrixStats_1.4.1 ggridges_0.5.6
- [10] compiler_4.4.0 png_0.1-8 vctrs_0.6.5
- [13] reshape2_1.4.4 crayon_1.5.3 pkgconfig_2.0.3
- [16] fastmap_1.2.0 labeling_0.4.3 utf8_1.2.4
- [19] promises_1.3.0 tzdb_0.4.0 bit_4.5.0
- [22] xfun_0.44 jsonlite_1.8.9 goftest_1.2-3
- [25] highr_0.11 later_1.3.2 spatstat.utils_3.1-0
- [28] irlba_2.3.5.1 parallel_4.4.0 cluster_2.1.6
- [31] R6_2.5.1 ica_1.0-3 spatstat.data_3.1-2
- [34] stringi_1.8.4 RColorBrewer_1.1-3 reticulate_1.39.0
- [37] spatstat.univar_3.0-1 parallelly_1.38.0 lmtest_0.9-40
- [40] scattermore_1.2 Rcpp_1.0.13 tensor_1.5
- [43] future.apply_1.11.2 zoo_1.8-12 R.utils_2.12.3
- [46] sctransform_0.4.1 httpuv_1.6.15 Matrix_1.7-0
- [49] splines_4.4.0 igraph_2.0.3 timechange_0.3.0
- [52] tidyselect_1.2.1 abind_1.4-8 rstudioapi_0.16.0
- [55] spatstat.random_3.3-2 codetools_0.2-20 miniUI_0.1.1.1
- [58] spatstat.explore_3.3-2 listenv_0.9.1 lattice_0.22-6
- [61] plyr_1.8.9 shiny_1.9.1 withr_3.0.1
- [64] ROCR_1.0-11 evaluate_1.0.1 Rtsne_0.17
- [67] future_1.34.0 fastDummies_1.7.4 survival_3.7-0
- [70] polyclip_1.10-7 fitdistrplus_1.2-1 pillar_1.9.0
- [73] KernSmooth_2.23-24 plotly_4.10.4 generics_0.1.3
- [76] vroom_1.6.5 RcppHNSW_0.6.0 hms_1.1.3
- [79] munsell_0.5.1 scales_1.3.0 globals_0.16.3
- [82] xtable_1.8-4 glue_1.8.0 lazyeval_0.2.2
- [85] tools_4.4.0 data.table_1.16.2 RSpectra_0.16-2
- [88] RANN_2.6.2 leiden_0.4.3.1 dotCall64_1.2
- [91] cowplot_1.1.3 grid_4.4.0 colorspace_2.1-1
- [94] nlme_3.1-165 patchwork_1.3.0 cli_3.6.3
- [97] spatstat.sparse_3.1-0 spam_2.11-0 fansi_1.0.6
-[100] viridisLite_0.4.2 uwot_0.2.2 gtable_0.3.5
-[103] R.methodsS3_1.8.2 digest_0.6.35 progressr_0.14.0
-[106] ggrepel_0.9.6 htmlwidgets_1.6.4 farver_2.1.2
-[109] R.oo_1.26.0 htmltools_0.5.8.1 lifecycle_1.0.4
-[112] httr_1.4.7 mime_0.12 bit64_4.5.2
-[115] MASS_7.3-61
-~~~
{: .output}
-
diff --git a/_episodes/04-Quality-Control.md b/_episodes/04-Quality-Control.md
index ae67f0a..6ca5905 100644
--- a/_episodes/04-Quality-Control.md
+++ b/_episodes/04-Quality-Control.md
@@ -1,282 +1,227 @@
----
-# Please do not edit this file directly; it is auto generated.
-# Instead, please edit 04-Quality-Control.md in _episodes_rmd/
-source: Rmd
-title: "Quality Control of scRNA-Seq Data"
-teaching: 90
-exercises: 30
-questions:
-- "How do I determine if my single cell RNA-seq experiment data is high quality?"
-- "What are the common quality control metrics that I should check in my scRNA-seq data?"
-objectives:
-- "Critically examine scRNA-seq data to identify potential technical issues."
-- "Apply filters to remove cells that are largely poor quality/dead cells."
-- "Understand the implications of different filtering steps on the data."
-keypoints:
-- "It is essential to filter based on criteria including mitochondrial gene expression and number of genes expressed in a cell."
-- "Determining your filtering thresholds should be done separately for each experiment, and these values can vary dramatically in different settings."
----
-
-
-
-
-~~~
-suppressPackageStartupMessages(library(tidyverse))
-suppressPackageStartupMessages(library(Matrix))
-suppressPackageStartupMessages(library(SingleCellExperiment))
-suppressPackageStartupMessages(library(scds))
-suppressPackageStartupMessages(library(Seurat))
-~~~
-{: .language-r}
-
-
+ suppressPackageStartupMessages(library(tidyverse))
+ suppressPackageStartupMessages(library(Matrix))
+ suppressPackageStartupMessages(library(SingleCellExperiment))
+ suppressPackageStartupMessages(library(scds))
+ suppressPackageStartupMessages(library(Seurat))
+{: .language-r}
## Quality control in scRNA-seq
-There are many technical reasons why cells produced by an scRNA-seq protocol
-might not be of high quality. The goal of the quality control steps are to
-assure that only single, live cells are included in the final data set.
-Ultimately some multiplets and poor quality cells will likely escape your
-detection and make it into your final dataset; however, these quality
-control steps aim to reduce the chance of this happening.
-Failure to undertake quality control is likely to adversely impact cell type
-identification, clustering, and interpretation of the data.
+There are many technical reasons why cells produced by an scRNA-seq
+protocol might not be of high quality. The goal of the quality control
+steps are to assure that only single, live cells are included in the
+final data set. Ultimately some multiplets and poor quality cells will
+likely escape your detection and make it into your final dataset;
+however, these quality control steps aim to reduce the chance of this
+happening. Failure to undertake quality control is likely to adversely
+impact cell type identification, clustering, and interpretation of the
+data.
Some technical questions that you might ask include:
-1. Why is mitochondrial gene expression high in some cells?
-1. What is a unique molecular identifier (UMI), and why do we check numbers of UMI?
-1. What happens to make gene counts low in a cell?
+1. Why is mitochondrial gene expression high in some cells?
+2. What is a unique molecular identifier (UMI), and why do we check
+ numbers of UMI?
+3. What happens to make gene counts low in a cell?
+## Doublet detection
-## Doublet detection
-
-We will begin by discussing doublets. We have already discussed the
+We will begin by discussing doublets. We have already discussed the
concept of the doublet. Now we will try running one computational
doublet-detection approach and track predictions of doublets.
We will use the scds method. scds contains two methods for predicting
-doublets. Method cxds is based on co-expression of gene pairs, while
-method bcds uses the full count information and a binary classification
-approach using in silico doublets.
-Method cxds_bcds_hybrid combines both approaches. We will use the combined
-approach. See
-[Bais and Kostka 2020](https://academic.oup.com/bioinformatics/article/36/4/1150/5566507)
+doublets. Method cxds is based on co-expression of gene pairs, while
+method bcds uses the full count information and a binary classification
+approach using in silico doublets. Method cxds\_bcds\_hybrid combines
+both approaches. We will use the combined approach. See [Bais and Kostka
+2020](https://academic.oup.com/bioinformatics/article/36/4/1150/5566507)
for more details.
-Because this doublet prediction method takes some time and is
-a bit memory-intensive, we will run it only on cells from one mouse.
-We will return to the doublet predictions later in this lesson.
-
+Because this doublet prediction method takes some time and is a bit
+memory-intensive, we will run it only on cells from one mouse. We will
+return to the doublet predictions later in this lesson.
+ cell_ids <- filter(metadata, sample == 'CS52') %>% pull(cell)
+ sce <- SingleCellExperiment(list(counts = counts[, cell_ids]))
+ sce <- cxds_bcds_hybrid(sce)
+ doublet_preds <- colData(sce)
-~~~
-cell_ids <- filter(metadata, sample == 'CS52') %>% pull(cell)
-sce <- SingleCellExperiment(list(counts = counts[, cell_ids]))
-sce <- cxds_bcds_hybrid(sce)
-doublet_preds <- colData(sce)
-~~~
{: .language-r}
+ used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
+ Ncells 8641273 461.5 12568895 671.3 NA 12568895 671.3
+ Vcells 182791127 1394.6 746650068 5696.5 16384 1137399190 8677.7
-~~~
- used (Mb) gc trigger (Mb) max used (Mb)
-Ncells 8111351 433.2 14166661 756.6 11252338 601.0
-Vcells 181808806 1387.1 436713384 3331.9 436709910 3331.9
-~~~
{: .output}
-
## High-level overview of quality control and filtering
-First we will walk through some of the typical quantities one
-examines when conducting quality control of scRNA-Seq data.
-
+First we will walk through some of the typical quantities one examines
+when conducting quality control of scRNA-Seq data.
### Filtering Genes by Counts
-As mentioned in an earlier lesson, the counts matrix is sparse and may contain
-rows (genes) or columns (cells) with low overall counts. In the case of genes,
-we likely wish to exclude genes with zeros counts in most cells. Let's see how
-many genes have zeros counts across all cells. Note that the
-[Matrix package](https://cran.r-project.org/web/packages/Matrix/index.html)
-has a special implementation of
-[rowSums](https://rdrr.io/rforge/Matrix/man/colSums.html) which works with
-sparse matrices.
+As mentioned in an earlier lesson, the counts matrix is sparse and may
+contain rows (genes) or columns (cells) with low overall counts. In the
+case of genes, we likely wish to exclude genes with zeros counts in most
+cells. Let’s see how many genes have zeros counts across all cells. Note
+that the [Matrix
+package](https://cran.r-project.org/web/packages/Matrix/index.html) has
+a special implementation of
+[rowSums](https://rdrr.io/rforge/Matrix/man/colSums.html) which works
+with sparse matrices.
+ gene_counts <- Matrix::rowSums(counts)
+ sum(gene_counts == 0)
-~~~
-gene_counts <- Matrix::rowSums(counts)
-sum(gene_counts == 0)
-~~~
{: .language-r}
+ [1] 7322
-
-~~~
-[1] 7322
-~~~
{: .output}
-Of the 31053 genes, 7322 have zero counts across
-all cells. These genes do not inform us about the mean, variance, or covariance
-of any of the other genes and we will remove them before proceeding with
-further analysis.
+Of the 31053 genes, 7322 have zero counts across all cells. These genes
+do not inform us about the mean, variance, or covariance of any of the
+other genes and we will remove them before proceeding with further
+analysis.
+ counts <- counts[gene_counts > 0,]
-~~~
-counts <- counts[gene_counts > 0,]
-~~~
{: .language-r}
This leaves 23731 genes in the counts matrix.
-We could also set some other threshold for filtering genes. Perhaps we should
-look at the number of genes that have different numbers of counts. We will use
-a histogram to look at the distribution of overall gene counts. Note that, since
-we just resized the counts matrix, we need to recalculate `gene_counts`.
-
-We will count the number of cells in which each gene was detected. Because
-`counts` is a sparse matrix, we have to be careful not to perform operations
-that would convert the entire matrix into a non-sparse matrix. This might
-happen if we wrote code like:
-
-```{}
-gene_counts <- rowSums(counts > 0)
-```
-
-The expression `counts > 0` would create a logical matrix that takes up much
-more memory than the sparse matrix. We might be tempted to try
-`rowSums(counts == 0)`, but this would also result in a non-sparse matrix
-because most of the values would be `TRUE`. However, if we evaluate
-`rowSums(counts != 0)`, then most of the values would be `FALSE`, which can be
-stored as 0 and so the matrix would still be sparse. The `Matrix` package has
-an implementation of 'rowSums()' that is efficient, but you may have to specify
-that you want to used the `Matrix` version of 'rowSums()' explicitly.
-
-The number of cells in which each gene is detected spans several
-orders of magnitude and this makes it difficult to interpret the plot. Some
-genes are detected in all cells while others are detected in only one cell.
-Let's zoom in on the part with lower gene counts.
-
-
-~~~
-gene_counts <- tibble(counts = Matrix::rowSums(counts > 0))
-
-gene_counts %>%
- dplyr::count(counts) %>%
- ggplot(aes(counts, n)) +
- geom_col() +
- labs(title = 'Histogram of Number of Cells in which Gene was Detected',
- x = 'Number of Genes',
- y = 'Number of Cells in which Gene was Detected') +
- lims(x = c(0, 50)) +
- theme_bw(base_size = 14) +
- annotate('text', x = 2, y = 1596, hjust = 0,
- label = str_c(sum(gene_counts == 1), ' genes were detected in only one cell')) +
- annotate('text', x = 3, y = 924, hjust = 0,
- label = str_c(sum(gene_counts == 2), ' genes were detected in two cells'))
-~~~
-{: .language-r}
+We could also set some other threshold for filtering genes. Perhaps we
+should look at the number of genes that have different numbers of
+counts. We will use a histogram to look at the distribution of overall
+gene counts. Note that, since we just resized the counts matrix, we need
+to recalculate `gene_counts`.
+
+We will count the number of cells in which each gene was detected.
+Because `counts` is a sparse matrix, we have to be careful not to
+perform operations that would convert the entire matrix into a
+non-sparse matrix. This might happen if we wrote code like:
+
+ gene_counts <- rowSums(counts > 0)
+
+The expression `counts > 0` would create a logical matrix that takes up
+much more memory than the sparse matrix. We might be tempted to try
+`rowSums(counts == 0)`, but this would also result in a non-sparse
+matrix because most of the values would be `TRUE`. However, if we
+evaluate `rowSums(counts != 0)`, then most of the values would be
+`FALSE`, which can be stored as 0 and so the matrix would still be
+sparse. The `Matrix` package has an implementation of ‘rowSums()’ that
+is efficient, but you may have to specify that you want to used the
+`Matrix` version of ‘rowSums()’ explicitly.
+
+The number of cells in which each gene is detected spans several orders
+of magnitude and this makes it difficult to interpret the plot. Some
+genes are detected in all cells while others are detected in only one
+cell. Let’s zoom in on the part with lower gene counts.
+
+ gene_counts <- tibble(counts = Matrix::rowSums(counts > 0))
+
+ gene_counts %>%
+ dplyr::count(counts) %>%
+ ggplot(aes(counts, n)) +
+ geom_col() +
+ labs(title = 'Histogram of Number of Cells in which Gene was Detected',
+ x = 'Number of Genes',
+ y = 'Number of Cells in which Gene was Detected') +
+ lims(x = c(0, 50)) +
+ theme_bw(base_size = 14) +
+ annotate('text', x = 2, y = 1596, hjust = 0,
+ label = str_c(sum(gene_counts == 1), ' genes were detected in only one cell')) +
+ annotate('text', x = 3, y = 924, hjust = 0,
+ label = str_c(sum(gene_counts == 2), ' genes were detected in two cells'))
+{: .language-r}
+ Warning: Removed 9335 rows containing missing values or values outside the scale range
+ (`geom_col()`).
-~~~
-Warning: Removed 9335 rows containing missing values or values outside the
-scale range (`geom_col()`).
-~~~
{: .warning}
-
-
-
plot of chunk gene_count_hist_2
-
+
-In the plot above, we can see that there are 1596
-genes that were detected in only one cell, 924
-genes detected in two cells, etc.
+In the plot above, we can see that there are 1596 genes that were
+detected in only one cell, 924 genes detected in two cells, etc.
Making a decision to keep or remove a gene based on its expression being
-detected in a certain
-number of cells can be tricky.
-If you retain all genes, you may consume more computational resources and add
-noise to your analysis. If you discard too many genes, you may miss rare
-but important cell types.
-
-Consider this example: You have a total of 10,000 cells in your scRNA-seq
-results. There is a rare cell population consisting of 100 cells that
-expresses 20 genes which are not expressed in any other cell type. If you
-only retain genes that are detected in more than 100 cells, you will miss
-this cell population.
+detected in a certain number of cells can be tricky. If you retain all
+genes, you may consume more computational resources and add noise to
+your analysis. If you discard too many genes, you may miss rare but
+important cell types.
+Consider this example: You have a total of 10,000 cells in your
+scRNA-seq results. There is a rare cell population consisting of 100
+cells that expresses 20 genes which are not expressed in any other cell
+type. If you only retain genes that are detected in more than 100 cells,
+you will miss this cell population.
> ## Challenge 1
-> How would filtering genes too strictly affect your results?
-How would this affect your ability to discriminate between cell types?
+>
+> How would filtering genes too strictly affect your results? How would
+> this affect your ability to discriminate between cell types?
>
> > ## Solution to Challenge 1
-> >
-> > Filtering too strictly would make it more difficult to distinguish between
-cell types. The degree to which this problem affects your analyses depends on
-the degree of strictness of your filtering. Let's take the situation to its
-logical extreme -- what if we keep only genes expressed in at least 95% of cells.
-If we did this, we would end up with only 41
-genes! By definition these genes will be highly expressed in all cell types,
-therefore eliminating our ability to clearly distinguish between cell types.
-> {: .solution}
-{: .challenge}
-
+> >
+> > Filtering too strictly would make it more difficult to distinguish
+> > between cell types. The degree to which this problem affects your
+> > analyses depends on the degree of strictness of your filtering.
+> > Let’s take the situation to its logical extreme – what if we keep
+> > only genes expressed in at least 95% of cells. If we did this, we
+> > would end up with only 41 genes! By definition these genes will be
+> > highly expressed in all cell types, therefore eliminating our
+> > ability to clearly distinguish between cell types. {: .solution} {:
+> > .challenge}
> ## Challenge 2
-> What total count threshold would you choose to filter genes? Remember that
-there are 47,743 cells.
+>
+> What total count threshold would you choose to filter genes? Remember
+> that there are 47,743 cells.
>
> > ## Solution to Challenge 2
> >
-> > This is a question that has a somewhat imprecise answer. Following from
-challenge one, we do not want to be *too* strict in our filtering. However,
-we do want to remove genes that will not provide much information about
-gene expression variability among the cells in our dataset. Our recommendation
-would be to filter genes expressed in < 5 cells, but one could reasonably
-justify a threshold between, say, 3 and 20 cells.
-> {: .solution}
-{: .challenge}
+> > This is a question that has a somewhat imprecise answer. Following
+> > from challenge one, we do not want to be *too* strict in our
+> > filtering. However, we do want to remove genes that will not provide
+> > much information about gene expression variability among the cells
+> > in our dataset. Our recommendation would be to filter genes
+> > expressed in < 5 cells, but one could reasonably justify a
+> > threshold between, say, 3 and 20 cells. {: .solution} {: .challenge}
### Filtering Cells by Counts
-Next we will look at the number of genes expressed in each cell.
-If a cell lyses and leaks RNA,the total number of reads in a
-cell may be low, which leads to lower gene counts. Furthermore, each
-single cell suspension likely contains some amount of so-called "ambient"
-RNA from damaged/dead/dying cells. This ambient RNA comes along for the ride
-in every droplet. Therefore even droplets that do not contain cells
-(empty droplets) can have some reads mapping to transcripts that look
-like gene expression.
-Filtering out these kinds of cells is a quality control step that
-should improve your final results.
-
-We will explicitly use the `Matrix` package's implementation of 'colSums()'.
-
-
-~~~
-tibble(counts = Matrix::colSums(counts > 0)) %>%
- ggplot(aes(counts)) +
- geom_histogram(bins = 500) +
- labs(title = 'Histogram of Number of Genes per Cell',
- x = 'Number of Genes with Counts > 0',
- y = 'Number of Cells')
-~~~
+Next we will look at the number of genes expressed in each cell. If a
+cell lyses and leaks RNA,the total number of reads in a cell may be low,
+which leads to lower gene counts. Furthermore, each single cell
+suspension likely contains some amount of so-called “ambient” RNA from
+damaged/dead/dying cells. This ambient RNA comes along for the ride in
+every droplet. Therefore even droplets that do not contain cells (empty
+droplets) can have some reads mapping to transcripts that look like gene
+expression. Filtering out these kinds of cells is a quality control step
+that should improve your final results.
+
+We will explicitly use the `Matrix` package’s implementation of
+‘colSums()’.
+
+ tibble(counts = Matrix::colSums(counts > 0)) %>%
+ ggplot(aes(counts)) +
+ geom_histogram(bins = 500) +
+ labs(title = 'Histogram of Number of Genes per Cell',
+ x = 'Number of Genes with Counts > 0',
+ y = 'Number of Cells')
+
{: .language-r}
-
-
-
plot of chunk sum_cell_counts
-
+
Cells with way more genes expressed than the typical cell might be
doublets/multiplets and should also be removed.
@@ -285,678 +230,567 @@ doublets/multiplets and should also be removed.
-In order to use Seurat, we must take the sample metadata and gene counts and
-create a
-[Seurat Object](https://rdrr.io/cran/SeuratObject/man/Seurat-class.html).
-This is a data structure which organizes the data and metadata and will
-store aspects of the analysis as we progress through the workshop.
-
-Below, we will create a Seurat object for the liver data. We must first
-convert the cell metadata into a data.frame and place the barcodes
-in rownames. The we will pass the counts and metadata into the
-[CreateSeuratObject](https://search.r-project.org/CRAN/refmans/SeuratObject/html/CreateSeuratObject.html)
-function to create the Seurat object.
-
-In the section above, we examined the counts across genes and cells and
-proposed filtering using thresholds. The CreateSeuratObject function
-contains two arguments, 'min.cells' and 'min.features', that allow us to
-filter the genes and cells by counts. Although we might use these arguments
-for convenience in a typical analysis, for this course we will look more
-closely at these quantities on a per-library basis to decide on
-our filtering thresholds. We will us the 'min.cells' argument to filter out
-genes that occur in less than 5 cells.
-
-
-~~~
-# set a seed for reproducibility in case any randomness used below
-set.seed(1418)
-~~~
+In order to use Seurat, we must take the sample metadata and gene counts
+and create a [Seurat
+Object](https://rdrr.io/cran/SeuratObject/man/Seurat-class.html). This
+is a data structure which organizes the data and metadata and will store
+aspects of the analysis as we progress through the workshop.
+
+Below, we will create a Seurat object for the liver data. We must first
+convert the cell metadata into a data.frame and place the barcodes in
+rownames. The we will pass the counts and metadata into the
+[CreateSeuratObject](https://search.r-project.org/CRAN/refmans/SeuratObject/html/CreateSeuratObject.html)
+function to create the Seurat object.
+
+In the section above, we examined the counts across genes and cells and
+proposed filtering using thresholds. The CreateSeuratObject function
+contains two arguments, ‘min.cells’ and ‘min.features’, that allow us to
+filter the genes and cells by counts. Although we might use these
+arguments for convenience in a typical analysis, for this course we will
+look more closely at these quantities on a per-library basis to decide
+on our filtering thresholds. We will us the ‘min.cells’ argument to
+filter out genes that occur in less than 5 cells.
+
+ # set a seed for reproducibility in case any randomness used below
+ set.seed(1418)
+
{: .language-r}
+ metadata <- as.data.frame(metadata) %>%
+ column_to_rownames('cell')
+ liver <- CreateSeuratObject(counts = counts,
+ project = 'liver: scRNA-Seq',
+ meta.data = metadata,
+ min.cells = 5)
-~~~
-metadata <- as.data.frame(metadata) %>%
- column_to_rownames('cell')
-liver <- CreateSeuratObject(counts = counts,
- project = 'liver: scRNA-Seq',
- meta.data = metadata,
- min.cells = 5)
-~~~
{: .language-r}
We now have a Seurat object with 20,120 genes and 47,743 cells.
-We will remove the counts object to save some memory because it is now stored
-inside of the Seurat object.
+We will remove the counts object to save some memory because it is now
+stored inside of the Seurat object.
+ rm(counts)
+ gc()
-~~~
-rm(counts)
-gc()
-~~~
{: .language-r}
+ used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
+ Ncells 8663357 462.7 12568895 671.3 NA 12568895 671.3
+ Vcells 183145832 1397.3 716848065 5469.2 16384 1137399190 8677.7
-
-~~~
- used (Mb) gc trigger (Mb) max used (Mb)
-Ncells 8269315 441.7 14166661 756.6 12309998 657.5
-Vcells 182529989 1392.6 596761610 4553.0 629553052 4803.2
-~~~
{: .output}
-
Add on doublet predictions that we did earlier in this lesson.
+ liver <- AddMetaData(liver, as.data.frame(doublet_preds))
-~~~
-liver <- AddMetaData(liver, as.data.frame(doublet_preds))
-~~~
{: .language-r}
-Let's briefly look at the structure of the Seurat object. The counts are stored
-as an [assay](https://github.com/satijalab/seurat/wiki/Assay), which we can query using the `Assays()` function.
+Let’s briefly look at the structure of the Seurat object. The counts are
+stored as an [assay](https://github.com/satijalab/seurat/wiki/Assay),
+which we can query using the `Assays()` function.
+ Seurat::Assays(liver)
-~~~
-Seurat::Assays(liver)
-~~~
{: .language-r}
+ [1] "RNA"
-
-~~~
-[1] "RNA"
-~~~
{: .output}
-The output of this function tells us that we have data in an "RNA assay. We can access this using the [GetAssayData](https://mojaveazure.github.io/seurat-object/reference/AssayData.html) function.
+The output of this function tells us that we have data in an “RNA assay.
+We can access this using the
+[GetAssayData](https://mojaveazure.github.io/seurat-object/reference/AssayData.html)
+function.
+ tmp <- GetAssayData(object = liver, layer = 'counts')
+ tmp[1:5,1:5]
-~~~
-tmp <- GetAssayData(object = liver, layer = 'counts')
-tmp[1:5,1:5]
-~~~
{: .language-r}
+ 5 x 5 sparse Matrix of class "dgCMatrix"
+ AAACGAATCCACTTCG-2 AAAGGTACAGGAAGTC-2 AACTTCTGTCATGGCC-2 AATGGCTCAACGGTAG-2
+ Xkr4 . . . .
+ Rp1 . . . .
+ Sox17 . . 2 4
+ Mrpl15 . . . 1
+ Lypla1 . . 2 1
+ ACACTGAAGTGCAGGT-2
+ Xkr4 .
+ Rp1 .
+ Sox17 .
+ Mrpl15 1
+ Lypla1 .
-
-~~~
-5 x 5 sparse Matrix of class "dgCMatrix"
- AAACGAATCCACTTCG-2 AAAGGTACAGGAAGTC-2 AACTTCTGTCATGGCC-2
-Xkr4 . . .
-Rp1 . . .
-Sox17 . . 2
-Mrpl15 . . .
-Lypla1 . . 2
- AATGGCTCAACGGTAG-2 ACACTGAAGTGCAGGT-2
-Xkr4 . .
-Rp1 . .
-Sox17 4 .
-Mrpl15 1 1
-Lypla1 1 .
-~~~
{: .output}
-As you can see the data that we retrieved is a sparse matrix, just like the counts that we provided to the Seurat object.
+As you can see the data that we retrieved is a sparse matrix, just like
+the counts that we provided to the Seurat object.
What about the metadata? We can access the metadata as follows:
+ head(liver)
-~~~
-head(liver)
-~~~
{: .language-r}
+ orig.ident nCount_RNA nFeature_RNA sample digest
+ AAACGAATCCACTTCG-2 liver: scRNA-Seq 8476 3264 CS48 inVivo
+ AAAGGTACAGGAAGTC-2 liver: scRNA-Seq 8150 3185 CS48 inVivo
+ AACTTCTGTCATGGCC-2 liver: scRNA-Seq 8139 3280 CS48 inVivo
+ AATGGCTCAACGGTAG-2 liver: scRNA-Seq 10083 3716 CS48 inVivo
+ ACACTGAAGTGCAGGT-2 liver: scRNA-Seq 9517 3543 CS48 inVivo
+ ACCACAACAGTCTCTC-2 liver: scRNA-Seq 7189 3064 CS48 inVivo
+ ACGATGTAGTGGTTCT-2 liver: scRNA-Seq 7437 3088 CS48 inVivo
+ ACGCACGCACTAACCA-2 liver: scRNA-Seq 8162 3277 CS48 inVivo
+ ACTGCAATCAACTCTT-2 liver: scRNA-Seq 7278 3144 CS48 inVivo
+ ACTGCAATCGTCACCT-2 liver: scRNA-Seq 9584 3511 CS48 inVivo
+ typeSample cxds_score bcds_score hybrid_score
+ AAACGAATCCACTTCG-2 scRnaSeq NA NA NA
+ AAAGGTACAGGAAGTC-2 scRnaSeq NA NA NA
+ AACTTCTGTCATGGCC-2 scRnaSeq NA NA NA
+ AATGGCTCAACGGTAG-2 scRnaSeq NA NA NA
+ ACACTGAAGTGCAGGT-2 scRnaSeq NA NA NA
+ ACCACAACAGTCTCTC-2 scRnaSeq NA NA NA
+ ACGATGTAGTGGTTCT-2 scRnaSeq NA NA NA
+ ACGCACGCACTAACCA-2 scRnaSeq NA NA NA
+ ACTGCAATCAACTCTT-2 scRnaSeq NA NA NA
+ ACTGCAATCGTCACCT-2 scRnaSeq NA NA NA
-
-~~~
- orig.ident nCount_RNA nFeature_RNA sample digest
-AAACGAATCCACTTCG-2 liver: scRNA-Seq 8476 3264 CS48 inVivo
-AAAGGTACAGGAAGTC-2 liver: scRNA-Seq 8150 3185 CS48 inVivo
-AACTTCTGTCATGGCC-2 liver: scRNA-Seq 8139 3280 CS48 inVivo
-AATGGCTCAACGGTAG-2 liver: scRNA-Seq 10083 3716 CS48 inVivo
-ACACTGAAGTGCAGGT-2 liver: scRNA-Seq 9517 3543 CS48 inVivo
-ACCACAACAGTCTCTC-2 liver: scRNA-Seq 7189 3064 CS48 inVivo
-ACGATGTAGTGGTTCT-2 liver: scRNA-Seq 7437 3088 CS48 inVivo
-ACGCACGCACTAACCA-2 liver: scRNA-Seq 8162 3277 CS48 inVivo
-ACTGCAATCAACTCTT-2 liver: scRNA-Seq 7278 3144 CS48 inVivo
-ACTGCAATCGTCACCT-2 liver: scRNA-Seq 9584 3511 CS48 inVivo
- typeSample cxds_score bcds_score hybrid_score
-AAACGAATCCACTTCG-2 scRnaSeq NA NA NA
-AAAGGTACAGGAAGTC-2 scRnaSeq NA NA NA
-AACTTCTGTCATGGCC-2 scRnaSeq NA NA NA
-AATGGCTCAACGGTAG-2 scRnaSeq NA NA NA
-ACACTGAAGTGCAGGT-2 scRnaSeq NA NA NA
-ACCACAACAGTCTCTC-2 scRnaSeq NA NA NA
-ACGATGTAGTGGTTCT-2 scRnaSeq NA NA NA
-ACGCACGCACTAACCA-2 scRnaSeq NA NA NA
-ACTGCAATCAACTCTT-2 scRnaSeq NA NA NA
-ACTGCAATCGTCACCT-2 scRnaSeq NA NA NA
-~~~
{: .output}
-Notice that there are some columns that were not in our original metadata file;
-specifically the 'nCount_RNA' and 'nFeature_RNA' columns.
-
-* *nCount_RNA* is the total counts for each cell.
-* *nFeature_RNA* is the number of genes with counts > 0 in each cell.
+Notice that there are some columns that were not in our original
+metadata file; specifically the ‘nCount\_RNA’ and ‘nFeature\_RNA’
+columns.
-These were calculated by Seurat when the Seurat object was created. We will use
-these later in the lesson.
+- *nCount\_RNA* is the total counts for each cell.
+- *nFeature\_RNA* is the number of genes with counts > 0 in each
+ cell.
+These were calculated by Seurat when the Seurat object was created. We
+will use these later in the lesson.
## Typical filters for cell quality
Here we briefly review these filters and decide what thresholds we will
use for these data.
-
### Filtering by Mitochondrial Gene Content
-During apoptosis, the cell membrane may break and release transcripts into
-the surrounding media. However, the mitochondrial transcripts may remain inside
-of the mitochondria. This will lead to an apparent, but spurious, increase in
-mitochondrial gene expression. As a result, we use the percentage of
-mitochondrial-encoded reads to filter out cells that were not healthy during
-sample processing. See
-[this link](https://kb.10xgenomics.com/hc/en-us/articles/360001086611)
-from 10X Genomics for additional information.
+During apoptosis, the cell membrane may break and release transcripts
+into the surrounding media. However, the mitochondrial transcripts may
+remain inside of the mitochondria. This will lead to an apparent, but
+spurious, increase in mitochondrial gene expression. As a result, we use
+the percentage of mitochondrial-encoded reads to filter out cells that
+were not healthy during sample processing. See [this
+link](https://kb.10xgenomics.com/hc/en-us/articles/360001086611) from
+10X Genomics for additional information.
-First we compute the percentage mitochondrial gene expression in each cell.
+First we compute the percentage mitochondrial gene expression in each
+cell.
+ liver <- liver %>%
+ PercentageFeatureSet(pattern = "^mt-", col.name = "percent.mt")
-~~~
-liver <- liver %>%
- PercentageFeatureSet(pattern = "^mt-", col.name = "percent.mt")
-~~~
{: .language-r}
-Different cell types may have different levels of mitochondrial RNA content.
-Therefore we must use our knowledge of the particular biological system
-that we are profiling in order to choose an appropriate threshold.
-If we are profiling single nuclei instead of single cells we might
-consider a very low threshold for MT content. If we are profiling a tissue
-where we anticipate broad variability in levels of mitochondrial RNA
-content between cell types, we might use a very lenient threshold
-to start and then return to filter out additional cells after we
-obtain tentative cell type labels that we have obtained by carrying
-out normalization and clustering. In this course we will filter only once
-
+Different cell types may have different levels of mitochondrial RNA
+content. Therefore we must use our knowledge of the particular
+biological system that we are profiling in order to choose an
+appropriate threshold. If we are profiling single nuclei instead of
+single cells we might consider a very low threshold for MT content. If
+we are profiling a tissue where we anticipate broad variability in
+levels of mitochondrial RNA content between cell types, we might use a
+very lenient threshold to start and then return to filter out additional
+cells after we obtain tentative cell type labels that we have obtained
+by carrying out normalization and clustering. In this course we will
+filter only once
+ VlnPlot(liver, features = "percent.mt", group.by = 'sample')
-~~~
-VlnPlot(liver, features = "percent.mt", group.by = 'sample')
-~~~
{: .language-r}
+ Warning: Default search for "data" layer in "RNA" assay yielded no results;
+ utilizing "counts" layer instead.
-
-~~~
-Warning: Default search for "data" layer in "RNA" assay yielded no results;
-utilizing "counts" layer instead.
-~~~
{: .warning}
-
-
-
plot of chunk seurat_counts_plots
-
+
-It is hard to see with so many dots! Let's try another version where we just
-plot the violins:
+It is hard to see with so many dots! Let’s try another version where we
+just plot the violins:
+ VlnPlot(liver, features = "percent.mt", group.by = 'sample', pt.size = 0)
-~~~
-VlnPlot(liver, features = "percent.mt", group.by = 'sample', pt.size = 0)
-~~~
{: .language-r}
+ Warning: Default search for "data" layer in "RNA" assay yielded no results;
+ utilizing "counts" layer instead.
-
-~~~
-Warning: Default search for "data" layer in "RNA" assay yielded no results;
-utilizing "counts" layer instead.
-~~~
{: .warning}
-
-
-
plot of chunk seurat_counts_plots2
-
+
-Library "CS89" (and maybe CS144) have a "long tail" of cells with high
-mitochondrial
-gene expression. We may wish to monitor these libraries throughout QC
-and decide whether it has problems worth ditching the sample.
+Library “CS89” (and maybe CS144) have a “long tail” of cells with high
+mitochondrial gene expression. We may wish to monitor these libraries
+throughout QC and decide whether it has problems worth ditching the
+sample.
-
-
-In most cases it would be ideal to determine separate filtering
+In most cases it would be ideal to determine separate filtering
thresholds on each scRNA-Seq sample. This would account for the fact
-that the characteristics of each sample might vary
-(for many possible reasons) even if the
-same tissue is profiled. However, in this course we will see if we can
-find a single threshold that works decently well across all samples.
-As you can see, the samples we are examining do not look
-drastically different so this may not be such an unrealistic simplification.
-
-We will use a threshold of 14% mitochondrial gene expression which will
-remove the "long tail" of cells with high `percent.mt` values. We could
-also perhaps justify going as low as 10% to be more conservative,
-but we likely would not want to go down to 5%, which would
-remove around half the cells.
-
-
-~~~
-# Don't run yet, we will filter based on several criteria below
-#liver <- subset(liver, subset = percent.mt < 14)
-~~~
+that the characteristics of each sample might vary (for many possible
+reasons) even if the same tissue is profiled. However, in this course we
+will see if we can find a single threshold that works decently well
+across all samples. As you can see, the samples we are examining do not
+look drastically different so this may not be such an unrealistic
+simplification.
+
+We will use a threshold of 14% mitochondrial gene expression which will
+remove the “long tail” of cells with high `percent.mt` values. We could
+also perhaps justify going as low as 10% to be more conservative, but we
+likely would not want to go down to 5%, which would remove around half
+the cells.
+
+ # Don't run yet, we will filter based on several criteria below
+ #liver <- subset(liver, subset = percent.mt < 14)
+
{: .language-r}
### Filtering Cells by Total Gene Counts
-Let's look at how many genes are expressed in each cell.
-Again we'll split by
-the mouse ID so we can see if there are particular samples that are
-very different from the rest.
-Again we will show only the violins for clarity.
+Let’s look at how many genes are expressed in each cell. Again we’ll
+split by the mouse ID so we can see if there are particular samples that
+are very different from the rest. Again we will show only the violins
+for clarity.
+ VlnPlot(liver, 'nFeature_RNA', group.by = 'sample', pt.size = 0)
-~~~
-VlnPlot(liver, 'nFeature_RNA', group.by = 'sample', pt.size = 0)
-~~~
{: .language-r}
+ Warning: Default search for "data" layer in "RNA" assay yielded no results;
+ utilizing "counts" layer instead.
-
-~~~
-Warning: Default search for "data" layer in "RNA" assay yielded no results;
-utilizing "counts" layer instead.
-~~~
{: .warning}
-
-
-
plot of chunk filter_gene_counts
-
-
-Like with the mitochondrial expression percentage, we will strive
-to find a threshold that works reasonably well across all samples.
-For the number of genes expressed we will want to filter out both cells
-that express to *few* genes and cells that express too *many* genes.
-As noted above, damaged or dying cells may leak RNA, resulting in a low
-number of genes expressed, and we want to filter out these cells to
-ignore their "damaged" transcriptomes. On the other hand, cells with way
-more genes expressed than the typical cell might be
-doublets/multiplets and should also be removed.
-
-It looks like filtering out cells that express less than 400 or
-greater than 5,000 genes is a reasonable compromise across our samples.
-(Note the log scale in this plot, which is necessary for seeing the violin
+
+
+Like with the mitochondrial expression percentage, we will strive to
+find a threshold that works reasonably well across all samples. For the
+number of genes expressed we will want to filter out both cells that
+express to *few* genes and cells that express too *many* genes. As noted
+above, damaged or dying cells may leak RNA, resulting in a low number of
+genes expressed, and we want to filter out these cells to ignore their
+“damaged” transcriptomes. On the other hand, cells with way more genes
+expressed than the typical cell might be doublets/multiplets and should
+also be removed.
+
+It looks like filtering out cells that express less than 400 or greater
+than 5,000 genes is a reasonable compromise across our samples. (Note
+the log scale in this plot, which is necessary for seeing the violin
densities at low numbers of genes expressed).
+ VlnPlot(liver, 'nFeature_RNA', group.by = 'sample', pt.size = 0) +
+ scale_y_log10() +
+ geom_hline(yintercept = 600) +
+ geom_hline(yintercept = 5000)
-~~~
-VlnPlot(liver, 'nFeature_RNA', group.by = 'sample', pt.size = 0) +
- scale_y_log10() +
- geom_hline(yintercept = 600) +
- geom_hline(yintercept = 5000)
-~~~
{: .language-r}
+ Warning: Default search for "data" layer in "RNA" assay yielded no results;
+ utilizing "counts" layer instead.
-
-~~~
-Warning: Default search for "data" layer in "RNA" assay yielded no results;
-utilizing "counts" layer instead.
-~~~
{: .warning}
+ Scale for y is already present.
+ Adding another scale for y, which will replace the existing scale.
-
-~~~
-Scale for y is already present.
-Adding another scale for y, which will replace the existing scale.
-~~~
{: .output}
-
-
-
plot of chunk filter_gene_counts_5k
-
+
-~~~
-#liver <- subset(liver, nFeature_RNA > 600 & nFeature_RNA < 5000)
-~~~
-{: .language-r}
+ #liver <- subset(liver, nFeature_RNA > 600 & nFeature_RNA < 5000)
+{: .language-r}
### Filtering Cells by UMI
-A UMI -- unique molecular identifier -- is like a molecular barcode for
+A UMI – unique molecular identifier – is like a molecular barcode for
each RNA molecule in the cell. UMIs are short, distinct oligonucleotides
-attached during the initial
-preparation of cDNA from RNA. Therefore each UMI is unique to a single RNA
-molecule.
-
-Why are UMI useful? The amount of RNA in a single cell is quite low
-(approximately 10-30pg according to
-[this link](https://bionumbers.hms.harvard.edu/bionumber.aspx?id=111205&ver=2&trm=amount+of+rna+per+cell&org=)).
-Thus single cell transcriptomics profiling usually includes a PCR
-amplification step. PCR amplification is fairly "noisy" because small
-stochastic sampling differences can propagate through exponential
+attached during the initial preparation of cDNA from RNA. Therefore each
+UMI is unique to a single RNA molecule.
+
+Why are UMI useful? The amount of RNA in a single cell is quite low
+(approximately 10-30pg according to [this
+link](https://bionumbers.hms.harvard.edu/bionumber.aspx?id=111205&ver=2&trm=amount+of+rna+per+cell&org=)).
+Thus single cell transcriptomics profiling usually includes a PCR
+amplification step. PCR amplification is fairly “noisy” because small
+stochastic sampling differences can propagate through exponential
amplification. Using UMIs, we can throw out all copies of the molecule
-except one (the copies we throw out are called "PCR duplicates").
+except one (the copies we throw out are called “PCR duplicates”).
-![UMI](../fig/lexogen.png)
+
-[Image credit](https://www.lexogen.com/rna-lexicon-what-are-unique-molecular-identifiers-umis-and-why-do-we-need-them/)
+[Image
+credit](https://www.lexogen.com/rna-lexicon-what-are-unique-molecular-identifiers-umis-and-why-do-we-need-them/)
-Several papers (e.g. [Islam et al](https://doi.org/10.1038/nmeth.2772))
+Several papers (e.g. [Islam et al](https://doi.org/10.1038/nmeth.2772))
have demonstrated that UMIs reduce amplification noise in single cell
transcriptomics and thereby increase data fidelity. The only downside of
-UMIs is that they cause us to throw away a lot of our data
-(perhaps as high as 90% of our sequenced reads). Nevertheless, we don't
-want those reads if they are not giving us new information about gene
-expression, so we tolerate this inefficiency.
+UMIs is that they cause us to throw away a lot of our data (perhaps as
+high as 90% of our sequenced reads). Nevertheless, we don’t want those
+reads if they are not giving us new information about gene expression,
+so we tolerate this inefficiency.
CellRanger will automatically process your UMIs and the feature-barcode
-matrix it produces will be free from PCR duplicates.
-Thus, we can think of the number of UMIs as the sequencing depth of
-each cell.
+matrix it produces will be free from PCR duplicates. Thus, we can think
+of the number of UMIs as the sequencing depth of each cell.
Typically the number of genes and number of UMI are highly correlated
and this is mostly the case in our liver dataset:
-~~~
-ggplot(liver@meta.data, aes(x = nCount_RNA, y = nFeature_RNA)) +
- geom_point() +
- theme_bw(base_size = 16) +
- xlab("nUMI") + ylab("nGenes") +
- scale_x_log10() + scale_y_log10()
-~~~
-{: .language-r}
-
-
-
-
plot of chunk genes_umi
-
-
+ ggplot(liver@meta.data, aes(x = nCount_RNA, y = nFeature_RNA)) +
+ geom_point() +
+ theme_bw(base_size = 16) +
+ xlab("nUMI") + ylab("nGenes") +
+ scale_x_log10() + scale_y_log10()
+{: .language-r}
+
+ VlnPlot(liver, 'nCount_RNA', group.by = 'sample', pt.size = 0) +
+ scale_y_log10() +
+ geom_hline(yintercept = 900) +
+ geom_hline(yintercept = 25000)
-~~~
-VlnPlot(liver, 'nCount_RNA', group.by = 'sample', pt.size = 0) +
- scale_y_log10() +
- geom_hline(yintercept = 900) +
- geom_hline(yintercept = 25000)
-~~~
{: .language-r}
+ Warning: Default search for "data" layer in "RNA" assay yielded no results;
+ utilizing "counts" layer instead.
-
-~~~
-Warning: Default search for "data" layer in "RNA" assay yielded no results;
-utilizing "counts" layer instead.
-~~~
{: .warning}
+ Scale for y is already present.
+ Adding another scale for y, which will replace the existing scale.
-
-~~~
-Scale for y is already present.
-Adding another scale for y, which will replace the existing scale.
-~~~
{: .output}
-
-
-
plot of chunk filter_umi
-
+
-~~~
-# Don't run yet, we will filter based on several criteria below
-#liver <- subset(liver, nCount_RNA > 900 & nCount_RNA < 25000)
-~~~
-{: .language-r}
+ # Don't run yet, we will filter based on several criteria below
+ #liver <- subset(liver, nCount_RNA > 900 & nCount_RNA < 25000)
-Again we try to select thresholds that remove most of the strongest outliers
-in all samples.
+{: .language-r}
+Again we try to select thresholds that remove most of the strongest
+outliers in all samples.
> ## Challenge 2
-> List two technical issues that can lead to poor scRNA-seq data quality and which filters we use to detect each one.
+>
+> List two technical issues that can lead to poor scRNA-seq data quality
+> and which filters we use to detect each one.
>
> > ## Solution to Challenge 2
> >
-> > 1 ). Cell membranes may rupture during the disassociation protocol,
-which is indicated
-by high mitochondrial gene expression because the mitochondrial
-transcripts are contained within the mitochondria, while other transcripts
-in the cytoplasm may leak out. Use the mitochondrial percentage filter to
-try to remove these cells.
+> > 1 ). Cell membranes may rupture during the disassociation protocol,
+> > which is indicated by high mitochondrial gene expression because the
+> > mitochondrial transcripts are contained within the mitochondria,
+> > while other transcripts in the cytoplasm may leak out. Use the
+> > mitochondrial percentage filter to try to remove these cells.
> > 2 ). Cells may be doublets of two different cell types. In this case
-they might express many more genes than either cell type alone. Use the
-"number of genes expressed" filter to try to remove these cells.
-> {: .solution}
-{: .challenge}
-
-
-
-
-
+> > they might express many more genes than either cell type alone. Use
+> > the “number of genes expressed” filter to try to remove these
+> > cells.
+> > {: .solution} {: .challenge}
## Doublet detection revisited
-Let's go back to our doublet predictions. How many of the cells that
-are going to be filtered out of our data were predicted to be doublets
-by scds?
+Let’s go back to our doublet predictions. How many of the cells that are
+going to be filtered out of our data were predicted to be doublets by
+scds?
+ liver$keep <- liver$percent.mt < 14 &
+ liver$nFeature_RNA > 600 &
+ liver$nFeature_RNA < 5000 &
+ liver$nCount_RNA > 900 &
+ liver$nCount_RNA < 25000
-~~~
-liver$keep <- liver$percent.mt < 14 &
- liver$nFeature_RNA > 600 &
- liver$nFeature_RNA < 5000 &
- liver$nCount_RNA > 900 &
- liver$nCount_RNA < 25000
-~~~
{: .language-r}
-Using the scds hybrid_score method, the scores range between 0 and 2.
+Using the scds hybrid\_score method, the scores range between 0 and 2.
Higher scores should be more likely to be doublets.
+ ggplot(mutate(liver[[]], class = ifelse(keep, 'QC singlet', 'QC doublet')),
+ aes(x = class, y = hybrid_score)) +
+ geom_violin() +
+ theme_bw(base_size = 18) +
+ xlab("") +
+ ylab("SCDS hybrid score")
-~~~
-ggplot(mutate(liver[[]], class = ifelse(keep, 'QC singlet', 'QC doublet')),
- aes(x = class, y = hybrid_score)) +
- geom_violin() +
- theme_bw(base_size = 18) +
- xlab("") +
- ylab("SCDS hybrid score")
-~~~
{: .language-r}
+ Warning: Removed 42388 rows containing non-finite outside the scale range
+ (`stat_ydensity()`).
-
-~~~
-Warning: Removed 42388 rows containing non-finite outside the scale range
-(`stat_ydensity()`).
-~~~
{: .warning}
-
-
-
plot of chunk doublet_plot
-
-
-Somewhat unsatisfyingly, the scds hybrid scores aren't wildly
-different between the cells we've used QC thresholds to call as doublets
-vs singlets.
-There does seem to be an enrichment of cells with score >0.75 among
-the QC doublets.
-If we had run scds doublet prediction on all cells we might
-compare results with *no* scds score filtering to those with an
-scds score cutoff of, say, 1.0.
-One characteristic of the presence of doublet cells
-is a cell cluster located between two large and well-defined clusters
-that expresses markers of both of them (don't worry, we will learn how
-to cluster and visualize data soon).
-Returning to the scds doublet scores, we could cluster our cells with
-and without doublet score filtering, and see if we note any
-putative doublet clusters.
+
+
+Somewhat unsatisfyingly, the scds hybrid scores aren’t wildly different
+between the cells we’ve used QC thresholds to call as doublets vs
+singlets. There does seem to be an enrichment of cells with score
+>0.75 among the QC doublets. If we had run scds doublet prediction on
+all cells we might compare results with *no* scds score filtering to
+those with an scds score cutoff of, say, 1.0. One characteristic of the
+presence of doublet cells is a cell cluster located between two large
+and well-defined clusters that expresses markers of both of them (don’t
+worry, we will learn how to cluster and visualize data soon). Returning
+to the scds doublet scores, we could cluster our cells with and without
+doublet score filtering, and see if we note any putative doublet
+clusters.
## Subset based on %MT, number of genes, and number of UMI thresholds
+ liver <- subset(liver, subset = percent.mt < 14 &
+ nFeature_RNA > 600 &
+ nFeature_RNA < 5000 &
+ nCount_RNA > 900 &
+ nCount_RNA < 25000)
-~~~
-liver <- subset(liver, subset = percent.mt < 14 &
- nFeature_RNA > 600 &
- nFeature_RNA < 5000 &
- nCount_RNA > 900 &
- nCount_RNA < 25000)
-~~~
{: .language-r}
-
## Batch correction
-We might want to correct for batch effects. This can be difficult
-to do because batch effects are complicated (in general), and may
-affect different cell types in different ways. Although correcting
-for batch effects is an important aspect of
-quality control, we will discuss this procedure in lesson 06 with
-some biological context.
-
-
-
+We might want to correct for batch effects. This can be difficult to do
+because batch effects are complicated (in general), and may affect
+different cell types in different ways. Although correcting for batch
+effects is an important aspect of quality control, we will discuss this
+procedure in lesson 06 with some biological context.
> ## Challenge 3
-> Delete the existing counts and metadata objects. Read in the *ex-vivo* data
-that you saved at the end of Lesson 03 (lesson03_challenge.Rdata) and create
-a Seurat object called 'liver_2'. Look at the filtering quantities and
-decide whether you can use the same cell and feature filters
-that were used to create the Seurat object above.
>
-> > ## Solution to Challenge 3
+> Delete the existing counts and metadata objects. Read in the *ex-vivo*
+> data that you saved at the end of Lesson 03
+> (lesson03\_challenge.Rdata) and create a Seurat object called
+> ‘liver\_2’. Look at the filtering quantities and decide whether you
+> can use the same cell and feature filters that were used to create the
+> Seurat object above.
+>
+> > ## Solution to Challenge 3
+> >
> > `# Remove the existing counts and metadata.`
> > `rm(counts, metadata)`
> > `# Read in citeseq counts & metadata.`
> > `load(file = file.path(data_dir, 'lesson03_challenge.Rdata'))`
> > `# Create Seurat object.`
> > `metadata = as.data.frame(metadata) %>%`
-> > ` column_to_rownames('cell')`
-> > `liver_2 = CreateSeuratObject(count = counts, `
-> > ` project = 'liver: ex-vivo',`
-> > ` meta.data = metadata)`
-> > ``
-> {: .solution}
-{: .challenge}
+> > `column_to_rownames('cell')`
+> > `liver_2 = CreateSeuratObject(count = counts,`
+> > `project = 'liver: ex-vivo',`
+> > `meta.data = metadata)`
+> > \`\` {: .solution} {: .challenge}
> ## Challenge 4
-> Estimate the proportion of mitochondrial genes.
-Create plots of the proportion of features, cells, and mitochondrial genes.
-Filter the Seurat object by mitochondrial gene expression.
>
-> > ## Solution to Challenge 4
+> Estimate the proportion of mitochondrial genes. Create plots of the
+> proportion of features, cells, and mitochondrial genes. Filter the
+> Seurat object by mitochondrial gene expression.
+>
+> > ## Solution to Challenge 4
+> >
> > `liver_2 = liver_2 %>%`
-> > ` PercentageFeatureSet(pattern = "^mt-", col.name = "percent.mt")`
+> > `PercentageFeatureSet(pattern = "^mt-", col.name = "percent.mt")`
> > `VlnPlot(liver_2, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)`
> > `liver_2 = subset(liver_2, subset = percent.mt < 10)`
-> {: .solution}
-{: .challenge}
-
+> > {: .solution} {: .challenge}
## Session Info
+ sessionInfo()
-~~~
-sessionInfo()
-~~~
{: .language-r}
+ R version 4.4.1 (2024-06-14)
+ Platform: x86_64-apple-darwin20
+ Running under: macOS 15.0.1
+
+ Matrix products: default
+ BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
+ LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
+
+ locale:
+ [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+
+ time zone: America/New_York
+ tzcode source: internal
+
+ attached base packages:
+ [1] stats4 stats graphics grDevices utils datasets methods base
+
+ other attached packages:
+ [1] scds_1.20.0 SingleCellExperiment_1.26.0
+ [3] SummarizedExperiment_1.34.0 Biobase_2.64.0
+ [5] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
+ [7] IRanges_2.38.1 S4Vectors_0.42.1
+ [9] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
+ [11] matrixStats_1.4.1 Matrix_1.7-0
+ [13] Seurat_5.1.0 SeuratObject_5.0.2
+ [15] sp_2.1-4 lubridate_1.9.3
+ [17] forcats_1.0.0 stringr_1.5.1
+ [19] dplyr_1.1.4 purrr_1.0.2
+ [21] readr_2.1.5 tidyr_1.3.1
+ [23] tibble_3.2.1 ggplot2_3.5.1
+ [25] tidyverse_2.0.0 knitr_1.48
+ [27] rmarkdown_2.28
+
+ loaded via a namespace (and not attached):
+ [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2
+ [4] R.oo_1.26.0 polyclip_1.10-7 pROC_1.18.5
+ [7] fastDummies_1.7.4 lifecycle_1.0.4 globals_0.16.3
+ [10] lattice_0.22-6 vroom_1.6.5 MASS_7.3-60.2
+ [13] magrittr_2.0.3 plotly_4.10.4 yaml_2.3.10
+ [16] httpuv_1.6.15 sctransform_0.4.1 spam_2.11-0
+ [19] spatstat.sparse_3.1-0 reticulate_1.39.0 cowplot_1.1.3
+ [22] pbapply_1.7-2 RColorBrewer_1.1-3 abind_1.4-8
+ [25] zlibbioc_1.50.0 Rtsne_0.17 R.utils_2.12.3
+ [28] GenomeInfoDbData_1.2.12 ggrepel_0.9.6 irlba_2.3.5.1
+ [31] listenv_0.9.1 spatstat.utils_3.1-0 goftest_1.2-3
+ [34] RSpectra_0.16-2 spatstat.random_3.3-2 fitdistrplus_1.2-1
+ [37] parallelly_1.38.0 leiden_0.4.3.1 codetools_0.2-20
+ [40] DelayedArray_0.30.1 tidyselect_1.2.1 UCSC.utils_1.0.0
+ [43] farver_2.1.2 spatstat.explore_3.3-2 jsonlite_1.8.9
+ [46] progressr_0.14.0 ggridges_0.5.6 survival_3.6-4
+ [49] tools_4.4.1 ica_1.0-3 Rcpp_1.0.13
+ [52] glue_1.8.0 gridExtra_2.3 SparseArray_1.4.8
+ [55] xfun_0.48 withr_3.0.1 fastmap_1.2.0
+ [58] fansi_1.0.6 digest_0.6.37 timechange_0.3.0
+ [61] R6_2.5.1 mime_0.12 colorspace_2.1-1
+ [64] scattermore_1.2 tensor_1.5 spatstat.data_3.1-2
+ [67] R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3
+ [70] data.table_1.16.2 httr_1.4.7 htmlwidgets_1.6.4
+ [73] S4Arrays_1.4.1 uwot_0.2.2 pkgconfig_2.0.3
+ [76] gtable_0.3.5 lmtest_0.9-40 XVector_0.44.0
+ [79] htmltools_0.5.8.1 dotCall64_1.2 scales_1.3.0
+ [82] png_0.1-8 spatstat.univar_3.0-1 rstudioapi_0.16.0
+ [85] tzdb_0.4.0 reshape2_1.4.4 nlme_3.1-164
+ [88] zoo_1.8-12 KernSmooth_2.23-24 vipor_0.4.7
+ [91] parallel_4.4.1 miniUI_0.1.1.1 ggrastr_1.0.2
+ [94] pillar_1.9.0 grid_4.4.1 vctrs_0.6.5
+ [97] RANN_2.6.2 promises_1.3.0 xtable_1.8-4
+ [100] cluster_2.1.6 beeswarm_0.4.0 evaluate_1.0.1
+ [103] cli_3.6.3 compiler_4.4.1 rlang_1.1.4
+ [106] crayon_1.5.3 future.apply_1.11.2 labeling_0.4.3
+ [109] plyr_1.8.9 ggbeeswarm_0.7.2 stringi_1.8.4
+ [112] viridisLite_0.4.2 deldir_2.0-4 munsell_0.5.1
+ [115] lazyeval_0.2.2 spatstat.geom_3.3-3 RcppHNSW_0.6.0
+ [118] hms_1.1.3 patchwork_1.3.0 bit64_4.5.2
+ [121] future_1.34.0 shiny_1.9.1 highr_0.11
+ [124] ROCR_1.0-11 igraph_2.0.3 bit_4.5.0
+ [127] xgboost_1.7.8.1
-
-~~~
-R version 4.4.0 (2024-04-24 ucrt)
-Platform: x86_64-w64-mingw32/x64
-Running under: Windows 10 x64 (build 19045)
-
-Matrix products: default
-
-
-locale:
-[1] LC_COLLATE=English_United States.utf8
-[2] LC_CTYPE=English_United States.utf8
-[3] LC_MONETARY=English_United States.utf8
-[4] LC_NUMERIC=C
-[5] LC_TIME=English_United States.utf8
-
-time zone: America/New_York
-tzcode source: internal
-
-attached base packages:
-[1] stats4 stats graphics grDevices utils datasets methods
-[8] base
-
-other attached packages:
- [1] Seurat_5.1.0 SeuratObject_5.0.2
- [3] sp_2.1-4 scds_1.20.0
- [5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
- [7] Biobase_2.64.0 GenomicRanges_1.56.2
- [9] GenomeInfoDb_1.40.1 IRanges_2.38.1
-[11] S4Vectors_0.42.1 BiocGenerics_0.50.0
-[13] MatrixGenerics_1.16.0 matrixStats_1.4.1
-[15] Matrix_1.7-0 lubridate_1.9.3
-[17] forcats_1.0.0 stringr_1.5.1
-[19] dplyr_1.1.4 purrr_1.0.2
-[21] readr_2.1.5 tidyr_1.3.1
-[23] tibble_3.2.1 ggplot2_3.5.1
-[25] tidyverse_2.0.0 knitr_1.48
-
-loaded via a namespace (and not attached):
- [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.9
- [4] magrittr_2.0.3 ggbeeswarm_0.7.2 spatstat.utils_3.1-0
- [7] farver_2.1.2 zlibbioc_1.50.0 vctrs_0.6.5
- [10] ROCR_1.0-11 spatstat.explore_3.3-2 htmltools_0.5.8.1
- [13] S4Arrays_1.4.1 xgboost_1.7.8.1 SparseArray_1.4.8
- [16] pROC_1.18.5 sctransform_0.4.1 parallelly_1.38.0
- [19] KernSmooth_2.23-24 htmlwidgets_1.6.4 ica_1.0-3
- [22] plyr_1.8.9 plotly_4.10.4 zoo_1.8-12
- [25] igraph_2.0.3 mime_0.12 lifecycle_1.0.4
- [28] pkgconfig_2.0.3 R6_2.5.1 fastmap_1.2.0
- [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.2-1 future_1.34.0
- [34] shiny_1.9.1 digest_0.6.35 colorspace_2.1-1
- [37] patchwork_1.3.0 tensor_1.5 RSpectra_0.16-2
- [40] irlba_2.3.5.1 labeling_0.4.3 progressr_0.14.0
- [43] spatstat.sparse_3.1-0 fansi_1.0.6 timechange_0.3.0
- [46] polyclip_1.10-7 httr_1.4.7 abind_1.4-8
- [49] compiler_4.4.0 withr_3.0.1 fastDummies_1.7.4
- [52] highr_0.11 MASS_7.3-61 DelayedArray_0.30.1
- [55] tools_4.4.0 vipor_0.4.7 lmtest_0.9-40
- [58] beeswarm_0.4.0 httpuv_1.6.15 future.apply_1.11.2
- [61] goftest_1.2-3 glue_1.8.0 nlme_3.1-165
- [64] promises_1.3.0 grid_4.4.0 Rtsne_0.17
- [67] cluster_2.1.6 reshape2_1.4.4 generics_0.1.3
- [70] spatstat.data_3.1-2 gtable_0.3.5 tzdb_0.4.0
- [73] data.table_1.16.2 hms_1.1.3 utf8_1.2.4
- [76] XVector_0.44.0 spatstat.geom_3.3-3 RcppAnnoy_0.0.22
- [79] ggrepel_0.9.6 RANN_2.6.2 pillar_1.9.0
- [82] spam_2.11-0 RcppHNSW_0.6.0 later_1.3.2
- [85] splines_4.4.0 lattice_0.22-6 deldir_2.0-4
- [88] survival_3.7-0 tidyselect_1.2.1 miniUI_0.1.1.1
- [91] pbapply_1.7-2 gridExtra_2.3 scattermore_1.2
- [94] xfun_0.44 stringi_1.8.4 UCSC.utils_1.0.0
- [97] lazyeval_0.2.2 evaluate_1.0.1 codetools_0.2-20
-[100] cli_3.6.3 uwot_0.2.2 xtable_1.8-4
-[103] reticulate_1.39.0 munsell_0.5.1 Rcpp_1.0.13
-[106] spatstat.random_3.3-2 globals_0.16.3 png_0.1-8
-[109] ggrastr_1.0.2 spatstat.univar_3.0-1 parallel_4.4.0
-[112] dotCall64_1.2 listenv_0.9.1 viridisLite_0.4.2
-[115] scales_1.3.0 ggridges_0.5.6 leiden_0.4.3.1
-[118] crayon_1.5.3 rlang_1.1.4 cowplot_1.1.3
-~~~
{: .output}
diff --git a/_episodes/05-Common-Analyses.md b/_episodes/05-Common-Analyses.md
index 7a9b48c..574ff96 100644
--- a/_episodes/05-Common-Analyses.md
+++ b/_episodes/05-Common-Analyses.md
@@ -1,363 +1,308 @@
----
-# Please do not edit this file directly; it is auto generated.
-# Instead, please edit 05-Common-Analyses.md in _episodes_rmd/
-source: Rmd
-title: "Common Analyses"
-teaching: 120
-exercises: 10
-questions:
-- "What are the most common single cell RNA-Seq analyses?"
-objectives:
-- "Understand the form of common Seurat commands and how to chain them together."
-- "Learn how to normalize, find variable features, reduce dimensionality, and cluster your scRNA-Seq data."
-keypoints:
-- "Seurat has flexible functionality to normalize, process, and visualize your scRNA-Seq data."
----
-
-
-
-~~~
-suppressPackageStartupMessages(library(tidyverse))
-suppressPackageStartupMessages(library(Seurat))
-
-data_dir <- 'data'
-~~~
+ suppressPackageStartupMessages(library(tidyverse))
+ suppressPackageStartupMessages(library(Seurat))
+
+ data_dir <- 'data'
+
{: .language-r}
+ # set a seed for reproducibility in case any randomness used below
+ set.seed(1418)
-~~~
-# set a seed for reproducibility in case any randomness used below
-set.seed(1418)
-~~~
{: .language-r}
+## A Note on Seurat Functions
+
+The Seurat package is set up so that we primarily work with a Seurat
+object containing our single cell data and metadata. Let’s say we are
+working with our Seurat object `liver`. The usual way we might call a
+function to do something with our data looks like:
+ liver <- DoSomething(liver, param1 = TRUE, param2 = 0.3)
+However, since the `DoSomething()` function returns the modified Seurat
+object, we can also pipe together multiple commands to do multiple
+things to our object. That could look something like:
-## A Note on Seurat Functions
+ liver <- DoSomething(liver, param1 = TRUE, param2 = 0.3) %>%
+ DoSomethingElse(param1 = 3) %>%
+ DoAThirdThing(param1 = c(1, 4, 6))
+
+The pipe operator passes the results of the first command as the
+argument to the next command.
-The Seurat package is set up so that we primarily work with a
-Seurat object containing our single cell data and metadata.
-Let's say we are working with our Seurat object `liver`.
-The usual way we might call a function to do something with our
-data looks like:
-```
-liver <- DoSomething(liver, param1 = TRUE, param2 = 0.3)
-```
-
-However, since the `DoSomething()` function returns the modified
-Seurat object, we can also pipe together multiple commands to do
-multiple things to our object. That could look something like:
-```
-liver <- DoSomething(liver, param1 = TRUE, param2 = 0.3) %>%
- DoSomethingElse(param1 = 3) %>%
- DoAThirdThing(param1 = c(1, 4, 6))
-```
-
-The pipe operator passes the results of the first command as the argument
-to the next command.
-
-We can just as well use the piping operator `%>%` even if
-we are calling only one function:
-```
-liver <- liver %>%
- DoSomething(param1 = TRUE, param2 = 0.3)
-```
-
-Recently, base `R` added its own pipe operator `|>`. This does essentially the
-same operation, with different behavior only in a few somewhat rare
-instances. You can read more about the differences between
-`%>%` and `|>` at the
-[this link](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/).
+We can just as well use the piping operator `%>%` even if we are calling
+only one function:
+
+ liver <- liver %>%
+ DoSomething(param1 = TRUE, param2 = 0.3)
+
+Recently, base `R` added its own pipe operator `|>`. This does
+essentially the same operation, with different behavior only in a few
+somewhat rare instances. You can read more about the differences between
+`%>%` and `|>` at the [this
+link](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/).
In this lesson (and elsewhere in the course) we may alternate between
these slightly different coding styles. Please ask us for clarification
-if you are having difficulty seeing how our example code is
-doing what it is supposed to do.
+if you are having difficulty seeing how our example code is doing what
+it is supposed to do.
### Normalization
-Instead of working with raw count data measured across cells
-that were sequenced to highly variable depths, we conduct
-normalization to try to make gene expression values follow
-a more stable distribution as well as being more comparable
-between cells.
-
-Single cell gene expression
-count data is usually approximately log-normally distributed.
-Many statistical methods work best when the data is normally distributed.
-We also would like to correct for variability in sequencing depth
-between cells, the nature of which is purely technical.
-Log normalization will give us normalized gene expression which represents
-the log of the number of counts per 10,000 reads.
-
-
-~~~
-liver <- liver %>%
- NormalizeData(normalization.method = "LogNormalize")
-~~~
+Instead of working with raw count data measured across cells that were
+sequenced to highly variable depths, we conduct normalization to try to
+make gene expression values follow a more stable distribution as well as
+being more comparable between cells.
+
+Single cell gene expression count data is usually approximately
+log-normally distributed. Many statistical methods work best when the
+data is normally distributed. We also would like to correct for
+variability in sequencing depth between cells, the nature of which is
+purely technical. Log normalization will give us normalized gene
+expression which represents the log of the number of counts per 10,000
+reads.
+
+ liver <- liver %>%
+ NormalizeData(normalization.method = "LogNormalize")
+
{: .language-r}
-This method of normalizing is pretty simple. The way it works is
-to follow a simple formula like
-`norm_count = log((count + 1)/total_counts) * 10000`
-where `total_counts` is the total number of reads in each cell.
+This method of normalizing is pretty simple. The way it works is to
+follow a simple formula like
+`norm_count = log((count + 1)/total_counts) * 10000` where
+`total_counts` is the total number of reads in each cell.
+
+There are other normalization methods that are more complicated and may
+outperform the log normalization method. Two examples are:
-There are other normalization methods that are more complicated and
-may outperform the log normalization method. Two examples are:
+- Normalization based on multinomial methods as outlined by [Townes et
+ al. 2019](https://pubmed.ncbi.nlm.nih.gov/31870412/)
+- Normalization using regularized negative binomial regression
+ [Hafemeister and Satija
+ 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1),
+ with a [Seurat vignette
+ here](https://satijalab.org/seurat/articles/sctransform_vignette.html)
- * Normalization based on multinomial methods as outlined by
- [Townes et al. 2019](https://pubmed.ncbi.nlm.nih.gov/31870412/)
- * Normalization using regularized negative binomial regression
- [Hafemeister and Satija 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1),
- with a [Seurat vignette here](https://satijalab.org/seurat/articles/sctransform_vignette.html)
-
However, no normalization method has been demonstrated to be universally
and unambiguously better than simple log normalization.
> ## Challenge 1
-> Where is the log-normaliztion stored? Try using the str() command to look at
-> the structure of the liver object (i.e. str(liver)).
+>
+> Where is the log-normaliztion stored? Try using the str() command to
+> look at the structure of the liver object (i.e. str(liver)).
>
> > ## Solution to Challenge 1
-> >
-> > str(liver)
-> > Formal class 'Seurat' [package "SeuratObject"] with 13 slots
-> > ..@ assays :List of 1
-> > ...
-> > ...
-> > ...
-> > ..@ commands :List of 1
-> > .. ..$ NormalizeData.RNA:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
-> > .. .. .. ..@ name : chr "NormalizeData.RNA"
-> > .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2023-09-12 09:43:16"
-> > .. .. .. ..@ assay.used : chr "RNA"
-> > .. .. .. ..@ call.string: chr "NormalizeData(., normalization.method = \"LogNormalize\")"
-> > .. .. .. ..@ params :List of 5
-> > .. .. .. .. ..$ assay : chr "RNA"
-> > .. .. .. .. ..$ normalization.method: chr "LogNormalize"
-> > .. .. .. .. ..$ scale.factor : num 10000
-> > .. .. .. .. ..$ margin : num 1
-> > .. .. .. .. ..$ verbose : logi TRUE
-> > ..@ tools : list()
-> >
-> > This is a lot to absorb! Look for a line containing "@ commands" toward the
-> > bottom of the object which you printed out. Notice that the next line
-> > says "NomalizeData". Further down, you can see a line which says
-> > "$ normalization.method: chr "LogNormalize". This is the line which tells
-> > you that the liver object has stored the log-normalized information.
-> {: .solution}
-{: .challenge}
-
+> >
+> > str(liver) Formal class ‘Seurat’ \[package “SeuratObject”\] with 13
+> > slots ..@ assays :List of 1 … … … ..@ commands :List of 1 .. ..$
+> > NormalizeData.RNA:Formal class ‘SeuratCommand’ \[package
+> > “SeuratObject”\] with 5 slots .. .. .. ..@ name : chr
+> > “NormalizeData.RNA” .. .. .. ..@ time.stamp : POSIXct\[1:1\],
+> > format: “2023-09-12 09:43:16” .. .. .. ..@ assay.used : chr “RNA” ..
+> > .. .. ..@ call.string: chr “NormalizeData(., normalization.method =
+> > "LogNormalize")” .. .. .. ..@ params :List of 5 .. .. .. .. ..$
+> > assay : chr “RNA” .. .. .. .. ..$ normalization.method: chr
+> > “LogNormalize” .. .. .. .. ..$ scale.factor : num 10000 .. .. .. ..
+> > ..$ margin : num 1 .. .. .. .. ..$ verbose : logi TRUE ..@ tools :
+> > list()
+> >
+> > This is a lot to absorb! Look for a line containing “@ commands”
+> > toward the bottom of the object which you printed out. Notice that
+> > the next line says “NomalizeData”. Further down, you can see a line
+> > which says “$ normalization.method: chr”LogNormalize”. This is the
+> > line which tells you that the liver object has stored the
+> > log-normalized information. {: .solution} {: .challenge}
### Finding Variable Features
-Next we will find a subset of features showing high cell-to-cell variation
-in the dataset (that is, they are highly expressed in some cells and lowly
-expressed in others).
+Next we will find a subset of features showing high cell-to-cell
+variation in the dataset (that is, they are highly expressed in some
+cells and lowly expressed in others).
+ liver <- liver %>%
+ FindVariableFeatures(nfeatures = 2000)
-~~~
-liver <- liver %>%
- FindVariableFeatures(nfeatures = 2000)
+ # Identify the 25 most highly variable genes
+ top25 <- head(VariableFeatures(liver), 25)
-# Identify the 25 most highly variable genes
-top25 <- head(VariableFeatures(liver), 25)
+ plot1 <- VariableFeaturePlot(liver)
+ LabelPoints(plot = plot1, points = top25, xnudge = 0,
+ ynudge = 0, repel = TRUE)
-plot1 <- VariableFeaturePlot(liver)
-LabelPoints(plot = plot1, points = top25, xnudge = 0,
- ynudge = 0, repel = TRUE)
-~~~
{: .language-r}
-
-
-
plot of chunk var_features
-
+
-Once again, let's look at the Seurat liver object and see how it stores the variable
-genes.
+Once again, let’s look at the Seurat liver object and see how it stores
+the variable genes.
+ str(liver)
-~~~
-str(liver)
-~~~
{: .language-r}
+ Formal class 'Seurat' [package "SeuratObject"] with 13 slots
+ ..@ assays :List of 1
+ .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
+ .. .. .. ..@ layers :List of 2
+ .. .. .. .. ..$ counts:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
+ .. .. .. .. .. .. ..@ i : int [1:103377764] 10 14 28 30 32 49 52 53 56 67 ...
+ .. .. .. .. .. .. ..@ p : int [1:44254] 0 3264 6449 9729 13445 16988 20052 23140 26417 29561 ...
+ .. .. .. .. .. .. ..@ Dim : int [1:2] 20120 44253
+ .. .. .. .. .. .. ..@ Dimnames:List of 2
+ .. .. .. .. .. .. .. ..$ : NULL
+ .. .. .. .. .. .. .. ..$ : NULL
+ .. .. .. .. .. .. ..@ x : num [1:103377764] 1 1 1 2 1 6 1 1 2 1 ...
+ .. .. .. .. .. .. ..@ factors : list()
+ .. .. .. .. ..$ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
+ .. .. .. .. .. .. ..@ i : int [1:103377764] 10 14 28 30 32 49 52 53 56 67 ...
+ .. .. .. .. .. .. ..@ p : int [1:44254] 0 3264 6449 9729 13445 16988 20052 23140 26417 29561 ...
+ .. .. .. .. .. .. ..@ Dim : int [1:2] 20120 44253
+ .. .. .. .. .. .. ..@ Dimnames:List of 2
+ .. .. .. .. .. .. .. ..$ : NULL
+ .. .. .. .. .. .. .. ..$ : NULL
+ .. .. .. .. .. .. ..@ x : num [1:103377764] 0.779 0.779 0.779 1.212 0.779 ...
+ .. .. .. .. .. .. ..@ factors : list()
+ .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
+ .. .. .. .. .. ..@ .Data: logi [1:44253, 1:2] TRUE TRUE TRUE TRUE TRUE TRUE ...
+ .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
+ .. .. .. .. .. .. .. ..$ : chr [1:44253] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
+ .. .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
+ .. .. .. .. .. ..$ dim : int [1:2] 44253 2
+ .. .. .. .. .. ..$ dimnames:List of 2
+ .. .. .. .. .. .. ..$ : chr [1:44253] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
+ .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
+ .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
+ .. .. .. .. .. ..@ .Data: logi [1:20120, 1:2] TRUE TRUE TRUE TRUE TRUE TRUE ...
+ .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
+ .. .. .. .. .. .. .. ..$ : chr [1:20120] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
+ .. .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
+ .. .. .. .. .. ..$ dim : int [1:2] 20120 2
+ .. .. .. .. .. ..$ dimnames:List of 2
+ .. .. .. .. .. .. ..$ : chr [1:20120] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
+ .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
+ .. .. .. ..@ default : int 1
+ .. .. .. ..@ assay.orig: chr(0)
+ .. .. .. ..@ meta.data :'data.frame': 20120 obs. of 8 variables:
+ .. .. .. .. ..$ vf_vst_counts_mean : num [1:20120] 0.000136 0.000429 0.295528 0.547918 0.447337 ...
+ .. .. .. .. ..$ vf_vst_counts_variance : num [1:20120] 0.000136 0.001062 0.749957 0.900155 0.549817 ...
+ .. .. .. .. ..$ vf_vst_counts_variance.expected : num [1:20120] 0.000142 0.000476 0.412618 0.907597 0.689343 ...
+ .. .. .. .. ..$ vf_vst_counts_variance.standardized: num [1:20120] 0.955 2.045 1.818 0.992 0.798 ...
+ .. .. .. .. ..$ vf_vst_counts_variable : logi [1:20120] FALSE TRUE TRUE FALSE FALSE FALSE ...
+ .. .. .. .. ..$ vf_vst_counts_rank : int [1:20120] NA 999 1228 NA NA NA NA NA NA 1428 ...
+ .. .. .. .. ..$ var.features : chr [1:20120] NA "Rp1" "Sox17" NA ...
+ .. .. .. .. ..$ var.features.rank : int [1:20120] NA 999 1228 NA NA NA NA NA NA 1428 ...
+ .. .. .. ..@ misc : Named list()
+ .. .. .. ..@ key : chr "rna_"
+ ..@ meta.data :'data.frame': 44253 obs. of 11 variables:
+ .. ..$ orig.ident : Factor w/ 1 level "liver: scRNA-Seq": 1 1 1 1 1 1 1 1 1 1 ...
+ .. ..$ nCount_RNA : num [1:44253] 8476 8150 8139 10083 9517 ...
+ .. ..$ nFeature_RNA: int [1:44253] 3264 3185 3280 3716 3543 3064 3088 3277 3144 3511 ...
+ .. ..$ sample : chr [1:44253] "CS48" "CS48" "CS48" "CS48" ...
+ .. ..$ digest : chr [1:44253] "inVivo" "inVivo" "inVivo" "inVivo" ...
+ .. ..$ typeSample : chr [1:44253] "scRnaSeq" "scRnaSeq" "scRnaSeq" "scRnaSeq" ...
+ .. ..$ cxds_score : num [1:44253] NA NA NA NA NA NA NA NA NA NA ...
+ .. ..$ bcds_score : num [1:44253] NA NA NA NA NA NA NA NA NA NA ...
+ .. ..$ hybrid_score: num [1:44253] NA NA NA NA NA NA NA NA NA NA ...
+ .. ..$ percent.mt : num [1:44253] 3.04 5.35 4.67 4.75 3.89 ...
+ .. ..$ keep : logi [1:44253] TRUE TRUE TRUE TRUE TRUE TRUE ...
+ ..@ active.assay: chr "RNA"
+ ..@ active.ident: Factor w/ 1 level "liver: scRNA-Seq": 1 1 1 1 1 1 1 1 1 1 ...
+ .. ..- attr(*, "names")= chr [1:44253] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
+ ..@ graphs : list()
+ ..@ neighbors : list()
+ ..@ reductions : list()
+ ..@ images : list()
+ ..@ project.name: chr "liver: scRNA-Seq"
+ ..@ misc : list()
+ ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
+ .. ..$ : int [1:3] 5 0 2
+ ..@ commands :List of 2
+ .. ..$ NormalizeData.RNA :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
+ .. .. .. ..@ name : chr "NormalizeData.RNA"
+ .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2024-10-15 20:17:40"
+ .. .. .. ..@ assay.used : chr "RNA"
+ .. .. .. ..@ call.string: chr "NormalizeData(., normalization.method = \"LogNormalize\")"
+ .. .. .. ..@ params :List of 5
+ .. .. .. .. ..$ assay : chr "RNA"
+ .. .. .. .. ..$ normalization.method: chr "LogNormalize"
+ .. .. .. .. ..$ scale.factor : num 10000
+ .. .. .. .. ..$ margin : num 1
+ .. .. .. .. ..$ verbose : logi TRUE
+ .. ..$ FindVariableFeatures.RNA:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
+ .. .. .. ..@ name : chr "FindVariableFeatures.RNA"
+ .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2024-10-15 20:17:51"
+ .. .. .. ..@ assay.used : chr "RNA"
+ .. .. .. ..@ call.string: chr "FindVariableFeatures(., nfeatures = 2000)"
+ .. .. .. ..@ params :List of 12
+ .. .. .. .. ..$ assay : chr "RNA"
+ .. .. .. .. ..$ selection.method : chr "vst"
+ .. .. .. .. ..$ loess.span : num 0.3
+ .. .. .. .. ..$ clip.max : chr "auto"
+ .. .. .. .. ..$ mean.function :function (mat, display_progress)
+ .. .. .. .. ..$ dispersion.function:function (mat, display_progress)
+ .. .. .. .. ..$ num.bin : num 20
+ .. .. .. .. ..$ binning.method : chr "equal_width"
+ .. .. .. .. ..$ nfeatures : num 2000
+ .. .. .. .. ..$ mean.cutoff : num [1:2] 0.1 8
+ .. .. .. .. ..$ dispersion.cutoff : num [1:2] 1 Inf
+ .. .. .. .. ..$ verbose : logi TRUE
+ ..@ tools : list()
-
-~~~
-Formal class 'Seurat' [package "SeuratObject"] with 13 slots
- ..@ assays :List of 1
- .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
- .. .. .. ..@ layers :List of 2
- .. .. .. .. ..$ counts:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
- .. .. .. .. .. .. ..@ i : int [1:103377764] 10 14 28 30 32 49 52 53 56 67 ...
- .. .. .. .. .. .. ..@ p : int [1:44254] 0 3264 6449 9729 13445 16988 20052 23140 26417 29561 ...
- .. .. .. .. .. .. ..@ Dim : int [1:2] 20120 44253
- .. .. .. .. .. .. ..@ Dimnames:List of 2
- .. .. .. .. .. .. .. ..$ : NULL
- .. .. .. .. .. .. .. ..$ : NULL
- .. .. .. .. .. .. ..@ x : num [1:103377764] 1 1 1 2 1 6 1 1 2 1 ...
- .. .. .. .. .. .. ..@ factors : list()
- .. .. .. .. ..$ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
- .. .. .. .. .. .. ..@ i : int [1:103377764] 10 14 28 30 32 49 52 53 56 67 ...
- .. .. .. .. .. .. ..@ p : int [1:44254] 0 3264 6449 9729 13445 16988 20052 23140 26417 29561 ...
- .. .. .. .. .. .. ..@ Dim : int [1:2] 20120 44253
- .. .. .. .. .. .. ..@ Dimnames:List of 2
- .. .. .. .. .. .. .. ..$ : NULL
- .. .. .. .. .. .. .. ..$ : NULL
- .. .. .. .. .. .. ..@ x : num [1:103377764] 0.779 0.779 0.779 1.212 0.779 ...
- .. .. .. .. .. .. ..@ factors : list()
- .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
- .. .. .. .. .. ..@ .Data: logi [1:44253, 1:2] TRUE TRUE TRUE TRUE TRUE TRUE ...
- .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
- .. .. .. .. .. .. .. ..$ : chr [1:44253] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
- .. .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
- .. .. .. .. .. ..$ dim : int [1:2] 44253 2
- .. .. .. .. .. ..$ dimnames:List of 2
- .. .. .. .. .. .. ..$ : chr [1:44253] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
- .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
- .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
- .. .. .. .. .. ..@ .Data: logi [1:20120, 1:2] TRUE TRUE TRUE TRUE TRUE TRUE ...
- .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
- .. .. .. .. .. .. .. ..$ : chr [1:20120] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
- .. .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
- .. .. .. .. .. ..$ dim : int [1:2] 20120 2
- .. .. .. .. .. ..$ dimnames:List of 2
- .. .. .. .. .. .. ..$ : chr [1:20120] "Xkr4" "Rp1" "Sox17" "Mrpl15" ...
- .. .. .. .. .. .. ..$ : chr [1:2] "counts" "data"
- .. .. .. ..@ default : int 1
- .. .. .. ..@ assay.orig: chr(0)
- .. .. .. ..@ meta.data :'data.frame': 20120 obs. of 8 variables:
- .. .. .. .. ..$ vf_vst_counts_mean : num [1:20120] 0.000136 0.000429 0.295528 0.547918 0.447337 ...
- .. .. .. .. ..$ vf_vst_counts_variance : num [1:20120] 0.000136 0.001062 0.749957 0.900155 0.549817 ...
- .. .. .. .. ..$ vf_vst_counts_variance.expected : num [1:20120] 0.000142 0.000476 0.412618 0.907597 0.689343 ...
- .. .. .. .. ..$ vf_vst_counts_variance.standardized: num [1:20120] 0.955 2.045 1.818 0.992 0.798 ...
- .. .. .. .. ..$ vf_vst_counts_variable : logi [1:20120] FALSE TRUE TRUE FALSE FALSE FALSE ...
- .. .. .. .. ..$ vf_vst_counts_rank : int [1:20120] NA 999 1228 NA NA NA NA NA NA 1428 ...
- .. .. .. .. ..$ var.features : chr [1:20120] NA "Rp1" "Sox17" NA ...
- .. .. .. .. ..$ var.features.rank : int [1:20120] NA 999 1228 NA NA NA NA NA NA 1428 ...
- .. .. .. ..@ misc : Named list()
- .. .. .. ..@ key : chr "rna_"
- ..@ meta.data :'data.frame': 44253 obs. of 11 variables:
- .. ..$ orig.ident : Factor w/ 1 level "liver: scRNA-Seq": 1 1 1 1 1 1 1 1 1 1 ...
- .. ..$ nCount_RNA : num [1:44253] 8476 8150 8139 10083 9517 ...
- .. ..$ nFeature_RNA: int [1:44253] 3264 3185 3280 3716 3543 3064 3088 3277 3144 3511 ...
- .. ..$ sample : chr [1:44253] "CS48" "CS48" "CS48" "CS48" ...
- .. ..$ digest : chr [1:44253] "inVivo" "inVivo" "inVivo" "inVivo" ...
- .. ..$ typeSample : chr [1:44253] "scRnaSeq" "scRnaSeq" "scRnaSeq" "scRnaSeq" ...
- .. ..$ cxds_score : num [1:44253] NA NA NA NA NA NA NA NA NA NA ...
- .. ..$ bcds_score : num [1:44253] NA NA NA NA NA NA NA NA NA NA ...
- .. ..$ hybrid_score: num [1:44253] NA NA NA NA NA NA NA NA NA NA ...
- .. ..$ percent.mt : num [1:44253] 3.04 5.35 4.67 4.75 3.89 ...
- .. ..$ keep : logi [1:44253] TRUE TRUE TRUE TRUE TRUE TRUE ...
- ..@ active.assay: chr "RNA"
- ..@ active.ident: Factor w/ 1 level "liver: scRNA-Seq": 1 1 1 1 1 1 1 1 1 1 ...
- .. ..- attr(*, "names")= chr [1:44253] "AAACGAATCCACTTCG-2" "AAAGGTACAGGAAGTC-2" "AACTTCTGTCATGGCC-2" "AATGGCTCAACGGTAG-2" ...
- ..@ graphs : list()
- ..@ neighbors : list()
- ..@ reductions : list()
- ..@ images : list()
- ..@ project.name: chr "liver: scRNA-Seq"
- ..@ misc : list()
- ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
- .. ..$ : int [1:3] 5 0 2
- ..@ commands :List of 2
- .. ..$ NormalizeData.RNA :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
- .. .. .. ..@ name : chr "NormalizeData.RNA"
- .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2024-10-15 17:17:30"
- .. .. .. ..@ assay.used : chr "RNA"
- .. .. .. ..@ call.string: chr "NormalizeData(., normalization.method = \"LogNormalize\")"
- .. .. .. ..@ params :List of 5
- .. .. .. .. ..$ assay : chr "RNA"
- .. .. .. .. ..$ normalization.method: chr "LogNormalize"
- .. .. .. .. ..$ scale.factor : num 10000
- .. .. .. .. ..$ margin : num 1
- .. .. .. .. ..$ verbose : logi TRUE
- .. ..$ FindVariableFeatures.RNA:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
- .. .. .. ..@ name : chr "FindVariableFeatures.RNA"
- .. .. .. ..@ time.stamp : POSIXct[1:1], format: "2024-10-15 17:17:39"
- .. .. .. ..@ assay.used : chr "RNA"
- .. .. .. ..@ call.string: chr "FindVariableFeatures(., nfeatures = 2000)"
- .. .. .. ..@ params :List of 12
- .. .. .. .. ..$ assay : chr "RNA"
- .. .. .. .. ..$ selection.method : chr "vst"
- .. .. .. .. ..$ loess.span : num 0.3
- .. .. .. .. ..$ clip.max : chr "auto"
- .. .. .. .. ..$ mean.function :function (mat, display_progress)
- .. .. .. .. ..$ dispersion.function:function (mat, display_progress)
- .. .. .. .. ..$ num.bin : num 20
- .. .. .. .. ..$ binning.method : chr "equal_width"
- .. .. .. .. ..$ nfeatures : num 2000
- .. .. .. .. ..$ mean.cutoff : num [1:2] 0.1 8
- .. .. .. .. ..$ dispersion.cutoff : num [1:2] 1 Inf
- .. .. .. .. ..$ verbose : logi TRUE
- ..@ tools : list()
-~~~
{: .output}
-Look in the "@ commands" section. You will notice that there are now two items in the
-liver of commands. "$ NormalizeData.RNA" is the first item and "indVariableFeatures.RNA"
-is the second one. This is where the the information about the variable genes is stored.
-You can see this in the line which says:
+Look in the “@ commands” section. You will notice that there are now two
+items in the liver of commands. “$ NormalizeData.RNA” is the first item
+and “indVariableFeatures.RNA” is the second one. This is where the the
+information about the variable genes is stored. You can see this in the
+line which says:
-'''
-@ call.string: chr "FindVariableFeatures(., nfeatures = 2000)
-'''
+’’’ @ call.string: chr “FindVariableFeatures(., nfeatures = 2000) ’’’
-## Cell Cycle Assignment
+## Cell Cycle Assignment
-We will also show how to predict cell cycle state.
-This approach is outlined in the Seurat vignette at
-[this link](https://satijalab.org/seurat/articles/cell_cycle_vignette.html).
+We will also show how to predict cell cycle state. This approach is
+outlined in the Seurat vignette at [this
+link](https://satijalab.org/seurat/articles/cell_cycle_vignette.html).
+ cc.genes <- readLines(file.path(data_dir,
+ 'regev_lab_cell_cycle_genes_mm.fixed.txt'))
+ s.genes <- cc.genes[1:43]
+ g2m.genes <- cc.genes[44:98]
+ liver <- CellCycleScoring(liver,
+ s.features = s.genes,
+ g2m.features = g2m.genes,
+ set.ident = FALSE)
-~~~
-cc.genes <- readLines(file.path(data_dir,
- 'regev_lab_cell_cycle_genes_mm.fixed.txt'))
-s.genes <- cc.genes[1:43]
-g2m.genes <- cc.genes[44:98]
-
-liver <- CellCycleScoring(liver,
- s.features = s.genes,
- g2m.features = g2m.genes,
- set.ident = FALSE)
-~~~
{: .language-r}
-Seurat will provide a quantitative estimate of the cell's chance of being
-in different phases of the cell cycle `S.Score` and `G2M.Score`, as well as
-a categorical prediction of which phase the cell is in
-(`Phase` -- G1, G2M, S).
+Seurat will provide a quantitative estimate of the cell’s chance of
+being in different phases of the cell cycle `S.Score` and `G2M.Score`,
+as well as a categorical prediction of which phase the cell is in
+(`Phase` – G1, G2M, S).
### Scale Data
-Now we apply a linear transformation that is often used in initial
+Now we apply a linear transformation that is often used in initial
scRNA-Seq processing. This transformation standardizes the expression of
-each gene, setting the mean across cells to 0 and variance to 1.
-This helps all genes to contribute to the inferred variability rather
-than just the highly-expressed genes.
-
-We will "regress out" the signals of technical confounders including
-%MT and the number of genes expressed. We might also choose to regress out
-other variables such as cell cycle stage (if we wish to examine
-cellular heterogeneity independent of cell cycle), number of UMIs,
-etc.
+each gene, setting the mean across cells to 0 and variance to 1. This
+helps all genes to contribute to the inferred variability rather than
+just the highly-expressed genes.
+We will “regress out” the signals of technical confounders including %MT
+and the number of genes expressed. We might also choose to regress out
+other variables such as cell cycle stage (if we wish to examine cellular
+heterogeneity independent of cell cycle), number of UMIs, etc.
+ liver <- liver %>%
+ ScaleData(vars.to.regress = c("percent.mt", "nFeature_RNA"))
-~~~
-liver <- liver %>%
- ScaleData(vars.to.regress = c("percent.mt", "nFeature_RNA"))
-~~~
{: .language-r}
### Principal Component Analysis (PCA)
@@ -365,291 +310,268 @@ liver <- liver %>%
Next we reduce the dimensionality of the data. You have probably heard
-of PCA as a technique for summarizing major axes of variation in a dataset.
+of PCA as a technique for summarizing major axes of variation in a
+dataset.
-Here is a brief [tutorial](https://setosa.io/ev/principal-component-analysis/)
-if PCA is new to you.
+Here is a brief
+[tutorial](https://setosa.io/ev/principal-component-analysis/) if PCA is
+new to you.
Here, we perform PCA on the single cell gene expression data in order to
-place each cell in a multidimensional space with lower dimension (say 20-40)
-than the complete expression matrix (~20,000 genes).
+place each cell in a multidimensional space with lower dimension (say
+20-40) than the complete expression matrix (~20,000 genes).
+ liver <- liver %>%
+ RunPCA(verbose = FALSE, npcs = 100)
-~~~
-liver <- liver %>%
- RunPCA(verbose = FALSE, npcs = 100)
-~~~
{: .language-r}
-It is usually not very useful to view the raw PCs themselves.
-There's nothing obvious that we can glean from the following PC plot:
+It is usually not very useful to view the raw PCs themselves. There’s
+nothing obvious that we can glean from the following PC plot:
+ DimPlot(liver, reduction = "pca")
-~~~
-DimPlot(liver, reduction = "pca")
-~~~
{: .language-r}
-
-
-
plot of chunk pcplot
-
+
Instead we will take some of the PCs and use them for a further
summarization of the data. Namely, we will use the PCs as input to the
-UMAP (or t-SNE) algorithm which projects our cells onto a 2D space
-(the computer screen).
+UMAP (or t-SNE) algorithm which projects our cells onto a 2D space (the
+computer screen).
A significant challenge in scRNA-Seq is deciding how many PCs to use.
-You can think of each PC as capturing pathway-level transcriptional variation
-across the cells in your dataset. Thus even if the transcriptome is sparse
-within each cell (it is), you can still compare different cells across major
-axes of variation. We don't want to use too *few* PCs, otherwise we
-might miss significant axes of variation (e.g. a cell subtype or minor cell
-type). We also don't want to use too *many* PCs since, as you go out in PC-space,
-the PCs increasingly represent more noise and less biological reality.
+You can think of each PC as capturing pathway-level transcriptional
+variation across the cells in your dataset. Thus even if the
+transcriptome is sparse within each cell (it is), you can still compare
+different cells across major axes of variation. We don’t want to use too
+*few* PCs, otherwise we might miss significant axes of variation (e.g. a
+cell subtype or minor cell type). We also don’t want to use too *many*
+PCs since, as you go out in PC-space, the PCs increasingly represent
+more noise and less biological reality.
We will use a very simple method to choose the number of PCs: the elbow
-method. Using this method we look for where the elbow plot stops dropping
-precipitously.
+method. Using this method we look for where the elbow plot stops
+dropping precipitously.
+
+ ElbowPlot(liver, ndims = 100)
-~~~
-ElbowPlot(liver, ndims = 100)
-~~~
{: .language-r}
-
-
-
plot of chunk elbow
-
+
-Let's zoom in more and see what things like under 50 PCs.
+Let’s zoom in more and see what things like under 50 PCs.
+
+ ElbowPlot(liver, ndims = 50)
-~~~
-ElbowPlot(liver, ndims = 50)
-~~~
{: .language-r}
-
-
-
plot of chunk elbow2
-
-
-We would say that the standard deviation in PCs really starts to stablize
-around N = 24 PCs. Let's use this value moving forward.
-For a more in-depth analysis we would try a variety of values and
-attempt to decide which value gives us the results that are most
-biologically sensible.
-
-There is also a function in Seurat, `JackStraw()`, that one may use
-to try to determine the statistical significance of principal
-component scores. Specifically, it randomly permutes a subset of data,
-and calculates projected PCA scores for these random genes. Then it
-compares the PCA scores for the random genes with the observed PCA scores
-to determine statistical signifance. We are not using this function here
-because it is a bit slow and because it often does not give any better
+
+
+We would say that the standard deviation in PCs really starts to
+stablize around N = 24 PCs. Let’s use this value moving forward. For a
+more in-depth analysis we would try a variety of values and attempt to
+decide which value gives us the results that are most biologically
+sensible.
+
+There is also a function in Seurat, `JackStraw()`, that one may use to
+try to determine the statistical significance of principal component
+scores. Specifically, it randomly permutes a subset of data, and
+calculates projected PCA scores for these random genes. Then it compares
+the PCA scores for the random genes with the observed PCA scores to
+determine statistical signifance. We are not using this function here
+because it is a bit slow and because it often does not give any better
results than the simple method of looking at an elbow plot.
-All this said, the major question here -- how many PCs to select in order
+All this said, the major question here – how many PCs to select in order
to capture the important variation within your scRNA-Seq data in a
-reduced dimension space -- is still unresolved and your best bet is
-to explore different values and see what they give you!
+reduced dimension space – is still unresolved and your best bet is to
+explore different values and see what they give you!
+ num_pc <- 24
+ ElbowPlot(liver, ndims = 40) + geom_vline(xintercept = num_pc)
-~~~
-num_pc <- 24
-ElbowPlot(liver, ndims = 40) + geom_vline(xintercept = num_pc)
-~~~
{: .language-r}
-
-
-
plot of chunk elbow3
-
+
-## Clustering
+## Clustering
Seurat uses a graph-based approach to cluster cells with similar
-transcriptomic profiles.
-We start by constructing the shared nearest-neighbor graph. This
-computes a distance metric between the cells (in PC-space) and
-constructs the shared nearest-neighbor graph by calculating the
-neighborhood overlap (Jaccard index) between every cell and its
-20 (by default) nearest neighbors.
-Then the shared nearest-neighbor graph is used to identify
-cell clusters using a modularity optimization based clustering
-algorithm.
-The Seurat help pages for the functions below,
+transcriptomic profiles. We start by constructing the shared
+nearest-neighbor graph. This computes a distance metric between the
+cells (in PC-space) and constructs the shared nearest-neighbor graph by
+calculating the neighborhood overlap (Jaccard index) between every cell
+and its 20 (by default) nearest neighbors. Then the shared
+nearest-neighbor graph is used to identify cell clusters using a
+modularity optimization based clustering algorithm. The Seurat help
+pages for the functions below,
[FindNeighbors](https://satijalab.org/seurat/reference/findneighbors)
-and
-[FindClusters](https://satijalab.org/seurat/reference/findclusters),
-provide some references if you are interested in digging into
-further details of this clustering procedure.
-
+and [FindClusters](https://satijalab.org/seurat/reference/findclusters),
+provide some references if you are interested in digging into further
+details of this clustering procedure.
+ liver <- FindNeighbors(liver, reduction = 'pca',
+ dims = 1:num_pc, verbose = FALSE) %>%
+ FindClusters(verbose = FALSE, resolution = 0.3)
-~~~
-liver <- FindNeighbors(liver, reduction = 'pca',
- dims = 1:num_pc, verbose = FALSE) %>%
- FindClusters(verbose = FALSE, resolution = 0.3)
-~~~
{: .language-r}
-
-## Dimensionality reduction (UMAP, tSNE, etc)
+## Dimensionality reduction (UMAP, tSNE, etc)
-As mentioned above, dimensionality reduction allows you to actually
-visualize your data! The two methods below are widely used in the
-single cell community.
+As mentioned above, dimensionality reduction allows you to actually
+visualize your data! The two methods below are widely used in the single
+cell community.
-> Uniform Manifold Approximation and Projection (UMAP) [van der Maaten & Hinton, 2008](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf).
+> Uniform Manifold Approximation and Projection (UMAP) [van der Maaten &
+> Hinton,
+> 2008](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf).
-> t-Distributed Stochastic Neighbor Embedding (t-SNE) [McUnnes et al](https://arxiv.org/abs/1802.03426)
+> t-Distributed Stochastic Neighbor Embedding (t-SNE) [McUnnes et
+> al](https://arxiv.org/abs/1802.03426)
-These methods generally do a very effective job of putting similar points near each other in the reduced-dimensionality space. Thus cells from
-the same clusters are likely to be placed in the same region of the UMAP/t-SNE.
+These methods generally do a very effective job of putting similar
+points near each other in the reduced-dimensionality space. Thus cells
+from the same clusters are likely to be placed in the same region of the
+UMAP/t-SNE.
-UMAP is more widely used the t-SNE at the current time.
-Note that you should caution yourself not to overinterpret UMAP plots.
-Although UMAP does optimize both local and global similarity for points
-being projected onto a 2D space, UMAP contains no guarantee that similar points
-must be near each other.
+UMAP is more widely used the t-SNE at the current time. Note that you
+should caution yourself not to overinterpret UMAP plots. Although UMAP
+does optimize both local and global similarity for points being
+projected onto a 2D space, UMAP contains no guarantee that similar
+points must be near each other.
+ liver <- RunUMAP(liver, reduction = 'pca', dims = 1:num_pc,
+ verbose = FALSE)
-~~~
-liver <- RunUMAP(liver, reduction = 'pca', dims = 1:num_pc,
- verbose = FALSE)
-~~~
{: .language-r}
-Note that we are using the principal components computed from normalized gene
-expression to compute UMAP dimensionality reduction and we are also using the
-principal components to compute a shared nearest neighbor graph and find
-clusters. These two tasks are independent and could be done in either order.
-Very often the points that are near each other in UMAP space are also near
-neighbors and belong to the same cluster, but this is not always the case.
+Note that we are using the principal components computed from normalized
+gene expression to compute UMAP dimensionality reduction and we are also
+using the principal components to compute a shared nearest neighbor
+graph and find clusters. These two tasks are independent and could be
+done in either order. Very often the points that are near each other in
+UMAP space are also near neighbors and belong to the same cluster, but
+this is not always the case.
-Finally, we plot the clusters which we found using the principal components in
-UMAP space.
+Finally, we plot the clusters which we found using the principal
+components in UMAP space.
+ UMAPPlot(liver, label = TRUE, label.size = 6)
-~~~
-UMAPPlot(liver, label = TRUE, label.size = 6)
-~~~
{: .language-r}
-
-
-
plot of chunk plot_umap
-
+
-Let's color the cells in each cluster by cell cycle phase using `UMAPPlot()`.
+Let’s color the cells in each cluster by cell cycle phase using
+`UMAPPlot()`.
+ UMAPPlot(liver, group.by = 'Phase')
-~~~
-UMAPPlot(liver, group.by = 'Phase')
-~~~
{: .language-r}
-