Skip to content

Commit

Permalink
Cleaned up docs about the expectations on the reference data.
Browse files Browse the repository at this point in the history
  • Loading branch information
LTLA committed Nov 24, 2024
1 parent 0a49952 commit 8abb0c7
Show file tree
Hide file tree
Showing 9 changed files with 63 additions and 35 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: SingleR
Title: Reference-Based Single-Cell RNA-Seq Annotation
Version: 2.9.1
Date: 2024-11-17
Version: 2.9.2
Date: 2024-11-24
Authors@R: c(person("Dvir", "Aran", email="[email protected]", role=c("aut", "cph")),
person("Aaron", "Lun", email="[email protected]", role=c("ctb", "cre")),
person("Daniel", "Bunis", role="ctb"),
Expand Down
2 changes: 1 addition & 1 deletion R/SingleR.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#' @param test A numeric matrix of single-cell expression values where rows are genes and columns are cells.
#' Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
#' @inheritParams trainSingleR
#' @param ref A numeric matrix of (usually log-transformed) expression values from a reference dataset,
#' @param ref A numeric matrix of (usually normalized and log-transformed) expression values from a reference dataset,
#' or a \linkS4class{SummarizedExperiment} object containing such a matrix;
#' see \code{\link{trainSingleR}} for details.
#'
Expand Down
2 changes: 1 addition & 1 deletion R/matchReferences.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#' Match labels from a pair of references, corresponding to the same underlying cell type or state
#' but with differences in nomenclature.
#'
#' @param ref1,ref2 Numeric matrices of single-cell (usually log-transformed) expression values where rows are genes and columns are cells.
#' @param ref1,ref2 Numeric matrices of single-cell (usually normalized and log-transformed) expression values where rows are genes and columns are cells.
#' Alternatively, \linkS4class{SummarizedExperiment} objects containing such matrices.
#' @param labels1,labels2 A character vector or factor of known labels for all cells in \code{ref1} and \code{ref2}, respectively.
#' @param ... Further arguments to pass to \code{\link{SingleR}}.
Expand Down
40 changes: 26 additions & 14 deletions R/trainSingleR.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#'
#' @param ref A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples).
#' Each row should be named with the gene name.
#' In general, the expression values are expected to be log-transformed, see Details.
#' In general, the expression values are expected to be normalized and log-transformed, see Details.
#'
#' Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
#'
Expand All @@ -28,8 +28,7 @@
#' @param sd.thresh Deprecated and ignored.
#' @param de.method String specifying how DE genes should be detected between pairs of labels.
#' Defaults to \code{"classic"}, which sorts genes by the log-fold changes and takes the top \code{de.n}.
#' Setting to \code{"wilcox"} or \code{"t"} will use Wilcoxon ranked sum test or Welch t-test between labels, respectively,
#' and take the top \code{de.n} upregulated genes per comparison.
#' Other options are \code{"wilcox"} and \code{"t"}, see Details.
#' Ignored if \code{genes} is a list of markers/DE genes.
#' @param de.n An integer scalar specifying the number of DE genes to use when \code{genes="de"}.
#' If \code{de.method="classic"}, defaults to \code{500 * (2/3) ^ log2(N)} where \code{N} is the number of unique labels.
Expand Down Expand Up @@ -73,10 +72,25 @@
#' The resulting objects can be re-used across multiple classification steps with different test data sets via \code{\link{classifySingleR}}.
#' This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.
#'
#' The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between labels.
#' This is done by identifying the median expression within each label, and computing differences between medians for each pair of labels.
#' For each label, the top \code{de.n} genes with the largest differences compared to another label are chosen as markers to distinguish the two labels.
#' The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between pairs of labels in the reference dataset.
#' The expression values are expected to be log-transformed and normalized.
#' For each pair of labels, the top \code{de.n} genes with strongest upregulation in one label are chosen as markers to distinguish it from the other label.
#' The exact ranking depends on the \code{de.method=} argument:
#' \itemize{
#' \item The default \code{de.method="classic"} will use \code{\link{getClassicMarkers}} to compute the median expression for each label and each gene.
#' Then, for each pair of labels, the top \code{de.n} genes with the largest positive differences are chosen as markers to distinguish the first label from the second.
#' This is intended for reference datasets derived from bulk transcriptomic data (e.g., microarrays) with a high density of non-zero values.
#' It is less effective for single-cell data, where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
#' \item \code{de.method="wilcox"} will rank genes based on the area under the curve (AUC) in each pairwise comparison between labels.
#' The top \code{de.n} genes with the largest AUCs above 0.5 are chosen as markers for the first label compared to the second.
#' This is analogous to ranking on significance in the Wilcoxon ranked sum test and is intended for use with single-cell data.
#' The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
#' \item \code{de.method="t"} will rank genes on the Cohen's d in each pairiwse comparison.
#' The top \code{de.n} genes with the largest positive Cohen's d are chosen as markers for the first label compared to the second.
#' This is roughly analogous to ranking on significance in the t-test and is faster than the AUC.
#' The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
#' }
#' Alternatively, users can detect markers externally and pass a list of markers to \code{genes} (see \dQuote{Custom gene specification}).
#'
#' Classification with \code{classifySingleR} assumes that the test dataset contains all marker genes that were detected from the reference.
#' If the test and reference datasets do not have the same genes in the same order, we can set \code{test.genes} to the row names of the test dataset.
Expand All @@ -88,7 +102,7 @@
#' It has the same effect as filtering out undesirable rows from \code{ref} prior to calling \code{trainSingleR}.
#' Unlike \code{test.genes}, setting \code{restrict} does not introduce further checks on \code{rownames(test)} in \code{classifySingleR}.
#'
#' @section Custom feature specification:
#' @section Custom gene specification:
#' Rather than relying on the in-built feature selection, users can pass in their own features of interest to \code{genes}.
#' The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels.
#' For example:
Expand All @@ -115,10 +129,12 @@
#' This allows the function to handle pre-defined marker lists for specific cell populations.
#' However, it obviously captures less information than marker sets for the pairwise comparisons.
#'
#' If \code{genes} is manually passed, \code{ref} can be the raw counts or any monotonic transformation thereof.
#' If \code{genes} is manually passed, \code{ref} can contain the raw counts or any monotonic transformation thereof.
#' There is no need to supply (log-)normalized expression values for the benefit of the automatic marker detection.
#' Similarly, for manual \code{genes}, the values of \code{de.method}, \code{de.n} and \code{sd.thresh} have no effect.
#'
#' Check out the Examples to see how manual \code{genes} can be passed to \code{trainSingleR}.
#'
#' @section Dealing with multiple references:
#' The default \pkg{SingleR} policy for dealing with multiple references is to perform the classification for each reference separately and combine the results
#' (see \code{?\link{combineRecomputedResults}} for an explanation).
Expand All @@ -131,12 +147,8 @@
#' where each inner list corresponds to a reference in \code{ref} and can be of any format described in \dQuote{Custom feature specification}.
#' Thus, it is possible for \code{genes} to be - wait for it - a list (per reference) of lists (per label) of lists (per label) of character vectors.
#'
#' @section Note on single-cell references:
#' The default marker selection is based on log-fold changes between the per-label medians and is very much designed with bulk references in mind.
#' It may not be effective for single-cell reference data where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
#' Users are recommended to either set \code{de.method} to another DE ranking method, or detect markers externally and pass a list of markers to \code{genes} (see Examples).
#'
#' In addition, it is generally unnecessary to have single-cell resolution on the reference profiles.
#' @section Aggregating single-cell references:
#' It is generally unnecessary to have single-cell resolution on the reference profiles.
#' We can instead set \code{aggr.ref=TRUE} to aggregate per-cell references into a set of pseudo-bulk profiles using \code{\link{aggregateReference}}.
#' This improves classification speed while using vector quantization to preserve within-label heterogeneity and mitigate the loss of information.
#' Note that any aggregation is done \emph{after} marker gene detection; this ensures that the relevant tests can appropriately penalize within-label variation.
Expand Down
2 changes: 1 addition & 1 deletion man/SingleR.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/getClassicMarkers.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/matchReferences.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

40 changes: 26 additions & 14 deletions man/trainSingleR.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions vignettes/SingleR.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ hpca.se

Our test dataset consists of some human embryonic stem cells [@lamanno2016molecular] from the `r Biocpkg("scRNAseq")` package.
For the sake of speed, we will only label the first 100 cells from this dataset.
The test dataset does not need to be log-transformed, as only the ranks will be used by `SingleR()`.

```{r}
library(scRNAseq)
Expand Down Expand Up @@ -110,6 +111,9 @@ To speed up this demonstration, we will subset to the first 100 cells.
```{r}
sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
# Log-transformation is not actually required for SingleR itself,
# but we'll do it anyway as we need it for plotting heatmaps later.
sceG <- logNormCounts(sceG)
```

Expand Down

0 comments on commit 8abb0c7

Please sign in to comment.