Cleaned up docs about the expectations on the reference data.

SingleR-inc · Nov 24, 2024 · 8abb0c7 · 8abb0c7
1 parent 0a49952
commit 8abb0c7
Show file tree

Hide file tree

Showing 9 changed files with 63 additions and 35 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: SingleR
 Title: Reference-Based Single-Cell RNA-Seq Annotation
-Version: 2.9.1
-Date: 2024-11-17
+Version: 2.9.2
+Date: 2024-11-24
 Authors@R: c(person("Dvir", "Aran", email="[email protected]", role=c("aut", "cph")),
     person("Aaron", "Lun", email="[email protected]", role=c("ctb", "cre")),
     person("Daniel", "Bunis", role="ctb"),

diff --git a/R/SingleR.R b/R/SingleR.R
@@ -6,7 +6,7 @@
 #' @param test A numeric matrix of single-cell expression values where rows are genes and columns are cells.
 #' Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
 #' @inheritParams trainSingleR
-#' @param ref A numeric matrix of (usually log-transformed) expression values from a reference dataset,
+#' @param ref A numeric matrix of (usually normalized and log-transformed) expression values from a reference dataset,
 #' or a \linkS4class{SummarizedExperiment} object containing such a matrix;
 #' see \code{\link{trainSingleR}} for details.
 #'

diff --git a/R/matchReferences.R b/R/matchReferences.R
@@ -3,7 +3,7 @@
 #' Match labels from a pair of references, corresponding to the same underlying cell type or state
 #' but with differences in nomenclature.
 #' 
-#' @param ref1,ref2 Numeric matrices of single-cell (usually log-transformed) expression values where rows are genes and columns are cells.
+#' @param ref1,ref2 Numeric matrices of single-cell (usually normalized and log-transformed) expression values where rows are genes and columns are cells.
 #' Alternatively, \linkS4class{SummarizedExperiment} objects containing such matrices.
 #' @param labels1,labels2 A character vector or factor of known labels for all cells in \code{ref1} and \code{ref2}, respectively.
 #' @param ... Further arguments to pass to \code{\link{SingleR}}.

diff --git a/R/trainSingleR.R b/R/trainSingleR.R
@@ -4,7 +4,7 @@
 #' 
 #' @param ref A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples).
 #' Each row should be named with the gene name.
-#' In general, the expression values are expected to be log-transformed, see Details.
+#' In general, the expression values are expected to be normalized and log-transformed, see Details.
 #' 
 #' Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
 #'
@@ -28,8 +28,7 @@
 #' @param sd.thresh Deprecated and ignored.
 #' @param de.method String specifying how DE genes should be detected between pairs of labels.
 #' Defaults to \code{"classic"}, which sorts genes by the log-fold changes and takes the top \code{de.n}.
-#' Setting to \code{"wilcox"} or \code{"t"} will use Wilcoxon ranked sum test or Welch t-test between labels, respectively,
-#' and take the top \code{de.n} upregulated genes per comparison.
+#' Other options are \code{"wilcox"} and \code{"t"}, see Details.
 #' Ignored if \code{genes} is a list of markers/DE genes.
 #' @param de.n An integer scalar specifying the number of DE genes to use when \code{genes="de"}.
 #' If \code{de.method="classic"}, defaults to \code{500 * (2/3) ^ log2(N)} where \code{N} is the number of unique labels.
@@ -73,10 +72,25 @@
 #' The resulting objects can be re-used across multiple classification steps with different test data sets via \code{\link{classifySingleR}}.
 #' This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.
 #' 
-#' The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between labels.
-#' This is done by identifying the median expression within each label, and computing differences between medians for each pair of labels.
-#' For each label, the top \code{de.n} genes with the largest differences compared to another label are chosen as markers to distinguish the two labels.
+#' The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between pairs of labels in the reference dataset.
 #' The expression values are expected to be log-transformed and normalized.
+#' For each pair of labels, the top \code{de.n} genes with strongest upregulation in one label are chosen as markers to distinguish it from the other label.
+#' The exact ranking depends on the \code{de.method=} argument:
+#' \itemize{
+#' \item The default \code{de.method="classic"} will use \code{\link{getClassicMarkers}} to compute the median expression for each label and each gene.
+#' Then, for each pair of labels, the top \code{de.n} genes with the largest positive differences are chosen as markers to distinguish the first label from the second.
+#' This is intended for reference datasets derived from bulk transcriptomic data (e.g., microarrays) with a high density of non-zero values. 
+#' It is less effective for single-cell data, where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
+#' \item \code{de.method="wilcox"} will rank genes based on the area under the curve (AUC) in each pairwise comparison between labels.
+#' The top \code{de.n} genes with the largest AUCs above 0.5 are chosen as markers for the first label compared to the second.
+#' This is analogous to ranking on significance in the Wilcoxon ranked sum test and is intended for use with single-cell data.
+#' The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
+#' \item \code{de.method="t"} will rank genes on the Cohen's d in each pairiwse comparison.
+#' The top \code{de.n} genes with the largest positive Cohen's d are chosen as markers for the first label compared to the second.
+#' This is roughly analogous to ranking on significance in the t-test and is faster than the AUC.
+#' The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
+#' }
+#' Alternatively, users can detect markers externally and pass a list of markers to \code{genes} (see \dQuote{Custom gene specification}).
 #'
 #' Classification with \code{classifySingleR} assumes that the test dataset contains all marker genes that were detected from the reference.
 #' If the test and reference datasets do not have the same genes in the same order, we can set \code{test.genes} to the row names of the test dataset.
@@ -88,7 +102,7 @@
 #' It has the same effect as filtering out undesirable rows from \code{ref} prior to calling \code{trainSingleR}.
 #' Unlike \code{test.genes}, setting \code{restrict} does not introduce further checks on \code{rownames(test)} in \code{classifySingleR}.
 #'
-#' @section Custom feature specification:
+#' @section Custom gene specification:
 #' Rather than relying on the in-built feature selection, users can pass in their own features of interest to \code{genes}.
 #' The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels.
 #' For example:
@@ -115,10 +129,12 @@
 #' This allows the function to handle pre-defined marker lists for specific cell populations.
 #' However, it obviously captures less information than marker sets for the pairwise comparisons.
 #'
-#' If \code{genes} is manually passed, \code{ref} can be the raw counts or any monotonic transformation thereof.
+#' If \code{genes} is manually passed, \code{ref} can contain the raw counts or any monotonic transformation thereof.
 #' There is no need to supply (log-)normalized expression values for the benefit of the automatic marker detection.
 #' Similarly, for manual \code{genes}, the values of \code{de.method}, \code{de.n} and \code{sd.thresh} have no effect.
 #'
+#' Check out the Examples to see how manual \code{genes} can be passed to \code{trainSingleR}.
+#'
 #' @section Dealing with multiple references:
 #' The default \pkg{SingleR} policy for dealing with multiple references is to perform the classification for each reference separately and combine the results 
 #' (see \code{?\link{combineRecomputedResults}} for an explanation).
@@ -131,12 +147,8 @@
 #' where each inner list corresponds to a reference in \code{ref} and can be of any format described in \dQuote{Custom feature specification}.
 #' Thus, it is possible for \code{genes} to be - wait for it - a list (per reference) of lists (per label) of lists (per label) of character vectors.
 #'
-#' @section Note on single-cell references:
-#' The default marker selection is based on log-fold changes between the per-label medians and is very much designed with bulk references in mind.
-#' It may not be effective for single-cell reference data where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
-#' Users are recommended to either set \code{de.method} to another DE ranking method, or detect markers externally and pass a list of markers to \code{genes} (see Examples).
-#'
-#' In addition, it is generally unnecessary to have single-cell resolution on the reference profiles.
+#' @section Aggregating single-cell references:
+#' It is generally unnecessary to have single-cell resolution on the reference profiles.
 #' We can instead set \code{aggr.ref=TRUE} to aggregate per-cell references into a set of pseudo-bulk profiles using \code{\link{aggregateReference}}.
 #' This improves classification speed while using vector quantization to preserve within-label heterogeneity and mitigate the loss of information.
 #' Note that any aggregation is done \emph{after} marker gene detection; this ensures that the relevant tests can appropriately penalize within-label variation.

diff --git a/man/SingleR.Rd b/man/SingleR.Rd
diff --git a/man/getClassicMarkers.Rd b/man/getClassicMarkers.Rd
diff --git a/man/matchReferences.Rd b/man/matchReferences.Rd
diff --git a/man/trainSingleR.Rd b/man/trainSingleR.Rd
diff --git a/vignettes/SingleR.Rmd b/vignettes/SingleR.Rmd
@@ -55,6 +55,7 @@ hpca.se
 
 Our test dataset consists of some human embryonic stem cells [@lamanno2016molecular] from the `r Biocpkg("scRNAseq")` package. 
 For the sake of speed, we will only label the first 100 cells from this dataset.
+The test dataset does not need to be log-transformed, as only the ranks will be used by `SingleR()`.
 
 ```{r}
 library(scRNAseq)
@@ -110,6 +111,9 @@ To speed up this demonstration, we will subset to the first 100 cells.
 ```{r}
 sceG <- GrunPancreasData()
 sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
+
+# Log-transformation is not actually required for SingleR itself,
+# but we'll do it anyway as we need it for plotting heatmaps later.
 sceG <- logNormCounts(sceG) 
 ```