Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eda notes #58

Merged
merged 5 commits into from
Oct 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 15 additions & 13 deletions episodes/eda_qc.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,14 @@ This is the same data we examined in the previous lesson.

From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA.

::: callout
Depending on your data source, identifying and discarding empty droplets may not be necessary. Some academic institutions have research cores dedicated to single cell work that perform the sample preparation and sequencing. Many of these cores will also perform empty droplet filtering and other initial QC steps. Specific details on the steps in common pipelines like [10x Genomics' CellRanger](https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials) can usually be found in the documentation that came with the sequencing material.

The main point is: if the sequencing outputs were provided to you by someone else, make sure to communicate with them about what pre-processing steps have been performed, if any.
:::

We can visualize barcode read totals to visualize the distinction between empty droplets and properly profiled single cells in a so-called "knee plot":

```{r}
bcrank <- barcodeRanks(counts(sce))

Expand All @@ -90,12 +98,6 @@ The distribution of total counts (called the unique molecular identifier or UMI

A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this may unnecessarily discard libraries derived from cell types with low RNA content.

::: callout
Depending on your data source, identifying and discarding empty droplets may not be necessary. Some academic institutions have research cores dedicated to single cell work that perform the sample preparation and sequencing. Many of these cores will also perform empty droplet filtering and other initial QC steps. Specific details on the steps in common pipelines like [10x Genomics' CellRanger](https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials) can usually be found in the documentation that came with the sequencing material.

The main point is: if the sequencing outputs were provided to you by someone else, make sure to communicate with them about what pre-processing steps have been performed, if any.
:::

:::: challenge

What is the median number of total counts in the raw data?
Expand Down Expand Up @@ -220,13 +222,13 @@ sce$discard <- reasons$discard

:::: challenge

Maybe our sample preparation was poor and we want the QC to be more strict. How could we change the set the QC filtering to use 4 MADs as the threshold for outlier calling?
Maybe our sample preparation was poor and we want the QC to be more strict. How could we change the set the QC filtering to use 2.5 MADs as the threshold for outlier calling?

::: solution
You set `nmads = 4` like so:
You set `nmads = 2.5` like so:

```{r}
reasons_strict <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent", nmads = 4)
reasons_strict <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent", nmads = 2.5)
```

You would then need to reassign the `discard` column as well, but we'll stick with the 3 MADs default for now.
Expand Down Expand Up @@ -315,7 +317,7 @@ table(clust)
```

```{r}
deconv.sf <- calculateSumFactors(sce, cluster = clust)
deconv.sf <- pooledSizeFactors(sce, cluster = clust)

summary(deconv.sf)

Expand Down Expand Up @@ -388,8 +390,8 @@ dec.sce <- modelGeneVar(sce)

fit.sce <- metadata(dec.sce)

mean_var_df = data.frame(mean = fit.sce$mean,
var = fit.sce$var)
mean_var_df <- data.frame(mean = fit.sce$mean,
var = fit.sce$var)

ggplot(mean_var_df, aes(mean, var)) +
geom_point() +
Expand Down Expand Up @@ -450,7 +452,7 @@ As the name suggests, dimensionality reduction aims to reduce the number of dime

### Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimensionality reduction technique that provides a parsimonious summarization of the data by replacing the original variables (genes) by fewer linear combinations of these variables, that are orthogonal and have successively maximal variance. Such linear combinations seek to "separate out" the observations (cells), while loosing as little information as possible.
Principal component analysis (PCA) is a dimensionality reduction technique that provides a parsimonious summarization of the data by replacing the original variables (genes) by fewer linear combinations of these variables, that are orthogonal and have successively maximal variance. Such linear combinations seek to "separate out" the observations (cells), while losing as little information as possible.

Without getting into the technical details, one nice feature of PCA is that the principal components (PCs) are ordered by how much variance of the original data they "explain". Furthermore, by focusing on the top $k$ PC we are focusing on the most important directions of variability, which hopefully correspond to biological rather than technical variance. (It is however good practice to check this by e.g. looking at correlation between technical QC metrics and PCs).

Expand Down
Loading