From 1f86776076be4f784aa81e17c2a73833cb650e62 Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Wed, 9 Oct 2024 11:45:56 -0400 Subject: [PATCH 1/5] notes --- episodes/eda_qc.Rmd | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index b3e91f8..2ddacfe 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -66,6 +66,14 @@ This is the same data we examined in the previous lesson. From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA. +::: callout +Depending on your data source, identifying and discarding empty droplets may not be necessary. Some academic institutions have research cores dedicated to single cell work that perform the sample preparation and sequencing. Many of these cores will also perform empty droplet filtering and other initial QC steps. Specific details on the steps in common pipelines like [10x Genomics' CellRanger](https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials) can usually be found in the documentation that came with the sequencing material. + +The main point is: if the sequencing outputs were provided to you by someone else, make sure to communicate with them about what pre-processing steps have been performed, if any. +::: + +We can visualize barcode read totals to visualize the distinction between empty droplets and properly profiled single cells in a so-called "knee plot": + ```{r} bcrank <- barcodeRanks(counts(sce)) @@ -90,12 +98,6 @@ The distribution of total counts (called the unique molecular identifier or UMI A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this may unnecessarily discard libraries derived from cell types with low RNA content. -::: callout -Depending on your data source, identifying and discarding empty droplets may not be necessary. Some academic institutions have research cores dedicated to single cell work that perform the sample preparation and sequencing. Many of these cores will also perform empty droplet filtering and other initial QC steps. Specific details on the steps in common pipelines like [10x Genomics' CellRanger](https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials) can usually be found in the documentation that came with the sequencing material. - -The main point is: if the sequencing outputs were provided to you by someone else, make sure to communicate with them about what pre-processing steps have been performed, if any. -::: - :::: challenge What is the median number of total counts in the raw data? From 5f75f0dda9386ae939ea4074b4e49155dcfcaa65 Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Wed, 9 Oct 2024 12:39:30 -0400 Subject: [PATCH 2/5] fix challenge number --- episodes/eda_qc.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index 2ddacfe..808dac0 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -222,13 +222,13 @@ sce$discard <- reasons$discard :::: challenge -Maybe our sample preparation was poor and we want the QC to be more strict. How could we change the set the QC filtering to use 4 MADs as the threshold for outlier calling? +Maybe our sample preparation was poor and we want the QC to be more strict. How could we change the set the QC filtering to use 2.5 MADs as the threshold for outlier calling? ::: solution -You set `nmads = 4` like so: +You set `nmads = 2.5` like so: ```{r} -reasons_strict <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent", nmads = 4) +reasons_strict <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent", nmads = 2.5) ``` You would then need to reassign the `discard` column as well, but we'll stick with the 3 MADs default for now. From 149113a86f6286de22c606e49f9e948f44bfca9f Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Wed, 9 Oct 2024 12:52:18 -0400 Subject: [PATCH 3/5] function name --- episodes/eda_qc.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index 808dac0..adf5cb2 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -317,7 +317,7 @@ table(clust) ``` ```{r} -deconv.sf <- calculateSumFactors(sce, cluster = clust) +deconv.sf <- pooledSizeFactors(sce, cluster = clust) summary(deconv.sf) From 50cf70fe63b4d38ad8d04bf4766a71649d76e337 Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Wed, 9 Oct 2024 13:39:08 -0400 Subject: [PATCH 4/5] missed equals sign --- episodes/eda_qc.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index adf5cb2..30c50e6 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -390,8 +390,8 @@ dec.sce <- modelGeneVar(sce) fit.sce <- metadata(dec.sce) -mean_var_df = data.frame(mean = fit.sce$mean, - var = fit.sce$var) +mean_var_df <- data.frame(mean = fit.sce$mean, + var = fit.sce$var) ggplot(mean_var_df, aes(mean, var)) + geom_point() + From bb21a6760ea5f7fea1bbe1854087cf0716eeb175 Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Wed, 9 Oct 2024 13:44:30 -0400 Subject: [PATCH 5/5] typo --- episodes/eda_qc.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index 30c50e6..b4e0f5a 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -452,7 +452,7 @@ As the name suggests, dimensionality reduction aims to reduce the number of dime ### Principal Component Analysis (PCA) -Principal component analysis (PCA) is a dimensionality reduction technique that provides a parsimonious summarization of the data by replacing the original variables (genes) by fewer linear combinations of these variables, that are orthogonal and have successively maximal variance. Such linear combinations seek to "separate out" the observations (cells), while loosing as little information as possible. +Principal component analysis (PCA) is a dimensionality reduction technique that provides a parsimonious summarization of the data by replacing the original variables (genes) by fewer linear combinations of these variables, that are orthogonal and have successively maximal variance. Such linear combinations seek to "separate out" the observations (cells), while losing as little information as possible. Without getting into the technical details, one nice feature of PCA is that the principal components (PCs) are ordered by how much variance of the original data they "explain". Furthermore, by focusing on the top $k$ PC we are focusing on the most important directions of variability, which hopefully correspond to biological rather than technical variance. (It is however good practice to check this by e.g. looking at correlation between technical QC metrics and PCs).