-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous column names in SPE objects with multiple samples #100
Comments
If |
Hi, I'm not a fan of pasting numbers since you might want to re-order the samples later and thus "sample 1" might become "sample 10". This happened to us in one project (Lukas: Visium IF AD). Hence they Initially the Best, |
@lcolladotor I don't think I have sufficient context to understand your concern, but isn't the |
Well, I guess that you could already re-run the code that pastes numbers based on the sample ids to fix the issue I have. Here's a full ## Some example data, imagine we read sample_id C, then B, then A
df <- data.frame(
sample_id = rep(LETTERS[3:1], each = 5),
barcode = rep(c("AAA", "AAC", "ABC", "ACD", "ADC"), 3)
)
## Code that appends the number based on the unique sample_id
add_key <- function(df) {
with(df, paste0(barcode, "-", as.integer(factor(sample_id, levels = unique(sample_id)))))
}
df$key_initial <- add_key(df)
## Code that adds the actual sample_id instead of a number
df$key_sample_id <- with(df, paste0(barcode, "_", sample_id))
## Note how the first sample_id is C, so it gets a -1 at the end
## then B, so it gets a -2, then A so it gets a -3 at the end.
unique(df$sample_id)
#> [1] "C" "B" "A"
with(df, table(sample_id, meaning = gsub(".*-", "", key_initial)))
#> meaning
#> sample_id 1 2 3
#> A 0 0 5
#> B 0 5 0
#> C 5 0 0
## Now we see what it looks like with sample_id
with(df, table(sample_id, meaning = gsub(".*_", "", key_sample_id)))
#> meaning
#> sample_id A B C
#> A 5 0 0
#> B 0 5 0
#> C 0 0 5
## Re-order data after having created it. Maybe we want to have sample_id's
## listed as A, then B, then C now.
df <- df[c(11:15, 6:10, 1:5), ]
## After having re-ordered the data, now the first sample_id gets a -3 which
## can be confusing. Since the -1 lost the meaning of being the first
## sample_id.
unique(df$sample_id)
#> [1] "A" "B" "C"
with(df, table(sample_id, meaning = gsub(".*-", "", key_initial)))
#> meaning
#> sample_id 1 2 3
#> A 0 0 5
#> B 0 5 0
#> C 5 0 0
## Re-running the code that adds the -Number at the end restores the meaning
## of -1 being the first sample.
df$key_new <- add_key(df)
with(df, table(sample_id, meaning = gsub(".*-", "", key_new)))
#> meaning
#> sample_id 1 2 3
#> A 5 0 0
#> B 0 5 0
#> C 0 0 5
## This hasn't changed in meaning even after re-ordering the data
with(df, table(sample_id, meaning = gsub(".*_", "", key_sample_id)))
#> meaning
#> sample_id A B C
#> A 5 0 0
#> B 0 5 0
#> C 0 0 5 Created on 2022-02-22 by the reprex package (v2.0.1) |
That's why I prefer to use the |
FYI After my initial comment #100 (comment) I ended up making |
I don't much care if it's the |
Getting a little off-topic but related to @drighelli's comment:
As a user familiar with suppressPackageStartupMessages(library(SpatialExperiment))
suppressPackageStartupMessages(library(DropletUtils))
dir <- system.file(
file.path("extdata", "10xVisium"),
package = "SpatialExperiment")
sample_ids <- c("section1", "section2")
samples <- file.path(dir, sample_ids)
# SPE uses `sample_id` whereas SCE uses `sample.names`.
spe <- read10xVisium(
samples = samples,
sample_id = sample_ids,
type = "sparse",
data = "raw",
images = "lowres",
load = FALSE)
sce <- read10xCounts(
samples = file.path(samples, "raw_feature_bc_matrix"),
sample.names = sample_ids,
# NOTE: col.names = FALSE is the default
col.names = TRUE,
type = "sparse")
# SPE has duplicated colnames, SCE doesn't
anyDuplicated(colnames(spe))
#> [1] 51
anyDuplicated(colnames(sce))
#> [1] 0
# SCE colnames are prefixed by a number (non-default option) but SPE aren't
head(colnames(spe))
#> [1] "AAACAACGAATAGTTC-1" "AAACAAGTATCTCCCA-1" "AAACAATCTACTAGCA-1"
#> [4] "AAACACCAATAACTGC-1" "AAACAGAGCGACTCCT-1" "AAACAGCTTTCAGAAG-1"
head(colnames(sce))
#> [1] "1_AAACAACGAATAGTTC-1" "1_AAACAAGTATCTCCCA-1" "1_AAACAATCTACTAGCA-1"
#> [4] "1_AAACACCAATAACTGC-1" "1_AAACAGAGCGACTCCT-1" "1_AAACAGCTTTCAGAAG-1"
tail(colnames(spe))
#> [1] "AAAGGTAAGCTGTACC-1" "AAAGGTCAACGACATG-1" "AAAGTAGCATTGCTCA-1"
#> [4] "AAAGTCACTGATGTAA-1" "AAAGTCGACCCTCAGT-1" "AAAGTGCCATCAATTA-1"
tail(colnames(sce))
#> [1] "2_AAAGGTAAGCTGTACC-1" "2_AAAGGTCAACGACATG-1" "2_AAAGTAGCATTGCTCA-1"
#> [4] "2_AAAGTCACTGATGTAA-1" "2_AAAGTCGACCCTCAGT-1" "2_AAAGTGCCATCAATTA-1"
# coldata differs (in both colnames and columns)
colData(spe)
#> DataFrame with 99 rows and 1 column
#> sample_id
#> <character>
#> AAACAACGAATAGTTC-1 section1
#> AAACAAGTATCTCCCA-1 section1
#> AAACAATCTACTAGCA-1 section1
#> AAACACCAATAACTGC-1 section1
#> AAACAGAGCGACTCCT-1 section1
#> ... ...
#> AAAGGTCAACGACATG-1 section2
#> AAAGTAGCATTGCTCA-1 section2
#> AAAGTCACTGATGTAA-1 section2
#> AAAGTCGACCCTCAGT-1 section2
#> AAAGTGCCATCAATTA-1 section2
colData(sce)
#> DataFrame with 100 rows and 2 columns
#> Sample Barcode
#> <character> <character>
#> 1_AAACAACGAATAGTTC-1 section1 AAACAACGAATAGTTC-1
#> 1_AAACAAGTATCTCCCA-1 section1 AAACAAGTATCTCCCA-1
#> 1_AAACAATCTACTAGCA-1 section1 AAACAATCTACTAGCA-1
#> 1_AAACACCAATAACTGC-1 section1 AAACACCAATAACTGC-1
#> 1_AAACAGAGCGACTCCT-1 section1 AAACAGAGCGACTCCT-1
#> ... ... ...
#> 2_AAAGGTCAACGACATG-1 section2 AAAGGTCAACGACATG-1
#> 2_AAAGTAGCATTGCTCA-1 section2 AAAGTAGCATTGCTCA-1
#> 2_AAAGTCACTGATGTAA-1 section2 AAAGTCACTGATGTAA-1
#> 2_AAAGTCGACCCTCAGT-1 section2 AAAGTCGACCCTCAGT-1
#> 2_AAAGTGCCATCAATTA-1 section2 AAAGTGCCATCAATTA-1
# rowdata differs (in both colnames and columns)
rowData(spe)
#> DataFrame with 50 rows and 1 column
#> symbol
#> <character>
#> ENSMUSG00000051951 Xkr4
#> ENSMUSG00000089699 Gm1992
#> ENSMUSG00000102343 Gm37381
#> ENSMUSG00000025900 Rp1
#> ENSMUSG00000025902 Sox17
#> ... ...
#> ENSMUSG00000025938 Slco5a1
#> ENSMUSG00000099498 Gm29283
#> ENSMUSG00000042414 Prdm14
#> ENSMUSG00000005886 Ncoa2
#> ENSMUSG00000101476 Gm29570
rowData(sce)
#> DataFrame with 50 rows and 3 columns
#> ID Symbol Type
#> <character> <character> <character>
#> ENSMUSG00000051951 ENSMUSG00000051951 Xkr4 Gene Expression
#> ENSMUSG00000089699 ENSMUSG00000089699 Gm1992 Gene Expression
#> ENSMUSG00000102343 ENSMUSG00000102343 Gm37381 Gene Expression
#> ENSMUSG00000025900 ENSMUSG00000025900 Rp1 Gene Expression
#> ENSMUSG00000025902 ENSMUSG00000025902 Sox17 Gene Expression
#> ... ... ... ...
#> ENSMUSG00000025938 ENSMUSG00000025938 Slco5a1 Gene Expression
#> ENSMUSG00000099498 ENSMUSG00000099498 Gm29283 Gene Expression
#> ENSMUSG00000042414 ENSMUSG00000042414 Prdm14 Gene Expression
#> ENSMUSG00000005886 ENSMUSG00000005886 Ncoa2 Gene Expression
#> ENSMUSG00000101476 ENSMUSG00000101476 Gm29570 Gene Expression Created on 2022-02-23 by the reprex package (v2.0.1) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.2 (2021-11-01)
#> os Ubuntu 20.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en_AU:en
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2022-02-23
#> pandoc 2.17.1.1 @ /usr/lib/rstudio/bin/quarto/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> beachmat 2.10.0 2021-10-26 [1] RSPM (R 4.1.2)
#> Biobase * 2.54.0 2021-10-26 [1] RSPM (R 4.1.2)
#> BiocGenerics * 0.40.0 2021-10-26 [1] RSPM (R 4.1.2)
#> BiocParallel 1.28.3 2021-12-09 [1] RSPM (R 4.1.2)
#> bitops 1.0-7 2021-04-24 [1] RSPM (R 4.1.0)
#> cli 3.2.0 2022-02-14 [1] RSPM (R 4.1.2)
#> DelayedArray 0.20.0 2021-10-26 [1] RSPM (R 4.1.2)
#> DelayedMatrixStats 1.16.0 2021-10-26 [1] RSPM (R 4.1.2)
#> digest 0.6.29 2021-12-01 [1] RSPM (R 4.1.2)
#> dqrng 0.3.0 2021-05-01 [1] RSPM (R 4.1.0)
#> DropletUtils * 1.14.2 2022-01-09 [1] RSPM (R 4.1.2)
#> edgeR 3.36.0 2021-10-26 [1] RSPM (R 4.1.2)
#> evaluate 0.15 2022-02-18 [1] RSPM (R 4.1.0)
#> fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0)
#> fs 1.5.2 2021-12-08 [1] RSPM (R 4.1.2)
#> GenomeInfoDb * 1.30.1 2022-01-30 [1] RSPM (R 4.1.2)
#> GenomeInfoDbData 1.2.7 2021-10-28 [1] RSPM (R 4.1.1)
#> GenomicRanges * 1.46.1 2021-11-18 [1] RSPM (R 4.1.2)
#> glue 1.6.1 2022-01-22 [1] RSPM (R 4.1.2)
#> HDF5Array 1.22.1 2021-11-14 [1] RSPM (R 4.1.2)
#> highr 0.9 2021-04-16 [1] RSPM (R 4.1.0)
#> htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0)
#> IRanges * 2.28.0 2021-10-26 [1] RSPM (R 4.1.2)
#> knitr 1.37 2021-12-16 [1] RSPM (R 4.1.0)
#> lattice 0.20-45 2021-09-22 [4] CRAN (R 4.1.1)
#> limma 3.50.1 2022-02-17 [1] Bioconductor
#> locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.1.0)
#> magick 2.7.3 2021-08-18 [1] CRAN (R 4.1.2)
#> magrittr 2.0.2 2022-01-26 [1] RSPM (R 4.1.2)
#> Matrix 1.4-0 2021-12-08 [4] CRAN (R 4.1.2)
#> MatrixGenerics * 1.6.0 2021-10-26 [1] RSPM (R 4.1.2)
#> matrixStats * 0.61.0 2021-09-17 [1] RSPM (R 4.1.1)
#> R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.1.0)
#> R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.1.0)
#> R.utils 2.11.0 2021-09-26 [1] RSPM (R 4.1.0)
#> Rcpp 1.0.8 2022-01-13 [1] RSPM (R 4.1.2)
#> RCurl 1.98-1.6 2022-02-08 [1] RSPM (R 4.1.2)
#> reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0)
#> rhdf5 2.38.0 2021-10-26 [1] RSPM (R 4.1.2)
#> rhdf5filters 1.6.0 2021-10-26 [1] RSPM (R 4.1.2)
#> Rhdf5lib 1.16.0 2021-10-26 [1] RSPM (R 4.1.2)
#> rjson 0.2.21 2022-01-09 [1] RSPM (R 4.1.0)
#> rlang 1.0.1 2022-02-03 [1] RSPM (R 4.1.2)
#> rmarkdown 2.11 2021-09-14 [1] RSPM (R 4.1.1)
#> rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0)
#> S4Vectors * 0.32.3 2021-11-21 [1] RSPM (R 4.1.2)
#> scuttle 1.4.0 2021-10-26 [1] RSPM (R 4.1.2)
#> sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.1.2)
#> SingleCellExperiment * 1.16.0 2021-10-26 [1] RSPM (R 4.1.2)
#> sparseMatrixStats 1.6.0 2021-10-26 [1] RSPM (R 4.1.2)
#> SpatialExperiment * 1.4.0 2021-10-26 [1] Bioconductor
#> stringi 1.7.6 2021-11-29 [1] RSPM (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [1] RSPM (R 4.1.0)
#> SummarizedExperiment * 1.24.0 2021-10-26 [1] RSPM (R 4.1.2)
#> withr 2.4.3 2021-11-30 [1] RSPM (R 4.1.2)
#> xfun 0.29 2021-12-14 [1] RSPM (R 4.1.0)
#> XVector 0.34.0 2021-10-26 [1] RSPM (R 4.1.2)
#> yaml 2.3.5 2022-02-21 [1] RSPM (R 4.1.0)
#> zlibbioc 1.40.0 2021-10-26 [1] RSPM (R 4.1.2)
#>
#> [1] /home/peter/R/x86_64-pc-linux-gnu-library/4.1
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ────────────────────────────────────────────────────────────────────────────── One other minor thing I noticed with SPE is that it's tab-completion of the |
|
Thanks @PeteHaitch for a so exhaustive overview of this issue. In my personal opinion, I like the idea to have a consistency between the About the Also, I don't know if the community has "standards" for the About the $-autocompletion, I don't know why R-studio acts like that, but when using the R console it works fine to me! |
Interesting, I'd meant to check that and I now see that in command-line R it works as you observe, so it's an RStudio-specific thing but would still be good to resolve. |
Re @PeteHaitch's comment on
|
Thanks @PeteHaitch for the detailed examples. I agree that some of these differences vs. the defaults from Specifically, I think we could think about changing the following in
On the question of whether to prefix colnames with |
Thanks, @lmweber. I'd favour using the exact same The
I agree, and it's consistent with what |
This was raised by @PeteHaitch on Slack.
Currently when we create a SPE object containing multiple Visium samples with
read10xVisium()
, the column names (barcode IDs) are repeated, since 10x Genomics uses the same set of 4992 barcode IDs for each capture area.We could think about disambiguating this in
read10xVisium()
, e.g. using something likekey_id <- paste(sample_id, barcode_id, sep = "_")
for the column names, which is how we have done it in ourspatialLIBD
Shiny apps with @lcolladotor (where column names need to be unique).There is also the following (slightly different) precedent from
DropletUtils::read10xCounts()
from single-cell, also mentioned by @PeteHaitch on Slack:"If col.names=TRUE and length(sample)==1, each column is named by the cell barcode. For multiple samples, the index of each sample in samples is concatenated to the cell barcode to form the column name. This avoids problems with multiple instances of the same cell barcodes in different samples."
Note that in our case Space Ranger already appends a-1
to all barcode IDs for Visium data.The text was updated successfully, but these errors were encountered: