Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] reduce_ranges silently creates list columns and fails to do so so if nothing is reduced #101

Open
Nicolai-vKuegelgen opened this issue Mar 1, 2023 · 0 comments

Comments

@Nicolai-vKuegelgen
Copy link

This bug has two sides - one of which was reported as #92 - but there seems to be more to this, so I'm opening a new issue.

Bug description

1) Creation of List columns

Currently the reduce_ranges function implicitly creates list-columns when reducing any metadata column without any other functions (reduce_ranges(col1 = col1)).
If one tries to specifically create list columns (reduce_ranges(col1 = list(col1))), as one would do in i.e. tidyverse functions, then each entry of the list will contain the whole metadata column. This is also the case even when using group_by.

Based on the tutorial referred to in #92 I assume that this current behavior has not always been the case and is unintended or a bug. Given that plyranges::summarise does create list columns this way when used after group_by, but not when used alone, there is at least some inconsistency here, even if the current behavior is intended.

2) Trying to create list columns crashes when no ranges are reduced

Since reduce_ranges(col1 = col1) can create list-columns I tried to use it that way. However, this leads to an error when nothing would be reduced but metadata columns need to be handled somehow.

Reproducible code example

suppressMessages(library(plyranges))

gr <- data.frame(seqnames = "chr1",
                 start = c(1, 5, 25,  45, 60, 65),
                 end = c(10, 20, 35, 55, 70, 75),
                 strand = c("*", "*", "*", "*", "*", "*"),
                 gene=c("geneA", "geneB", "geneC", "geneD", "geneE", "geneF"),
                 genetype = c('coding', 'coding', 'coding', 
                             'coding', 'coding', 'non-coding')) %>%
    as_granges()

## This creates list columns from the overlapping genes
gr %>% reduce_ranges(gene = gene)
#> GRanges object with 4 ranges and 1 metadata column:
#>       seqnames    ranges strand |            gene
#>          <Rle> <IRanges>  <Rle> | <CharacterList>
#>   [1]     chr1      1-20      * |     geneA,geneB
#>   [2]     chr1     25-35      * |           geneC
#>   [3]     chr1     45-55      * |           geneD
#>   [4]     chr1     60-75      * |     geneE,geneF
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## This puts all genes into each list entry
gr %>% reduce_ranges(gene = list(gene))
#> GRanges object with 4 ranges and 1 metadata column:
#>       seqnames    ranges strand |                        gene
#>          <Rle> <IRanges>  <Rle> |                      <List>
#>   [1]     chr1      1-20      * | geneA,geneB,geneC,geneD,...
#>   [2]     chr1     25-35      * | geneA,geneB,geneC,geneD,...
#>   [3]     chr1     45-55      * | geneA,geneB,geneC,geneD,...
#>   [4]     chr1     60-75      * | geneA,geneB,geneC,geneD,...
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Nothing overlaps anymore: error
gr[2:5] %>% reduce_ranges(gene = gene)
#> Error in summarize_rng(.data, dots): length(ans[[i]]) == nr is not TRUE

## without metadata columns this does not happen
gr[2:5] %>% reduce_ranges()
#> GRanges object with 4 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      5-20      *
#>   [2]     chr1     25-35      *
#>   [3]     chr1     45-55      *
#>   [4]     chr1     60-70      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## group_by works 
gr %>% group_by(genetype) %>% reduce_ranges(gene = gene)
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |    genetype            gene
#>          <Rle> <IRanges>  <Rle> | <character> <CharacterList>
#>   [1]     chr1      1-20      * |      coding     geneA,geneB
#>   [2]     chr1     25-35      * |      coding           geneC
#>   [3]     chr1     45-55      * |      coding           geneD
#>   [4]     chr1     60-70      * |      coding           geneE
#>   [5]     chr1     65-75      * |  non-coding           geneF
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## but has the same issue when explicitly creating lists
gr %>% group_by(genetype) %>% reduce_ranges(gene = list(gene))
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |    genetype                        gene
#>          <Rle> <IRanges>  <Rle> | <character>                      <List>
#>   [1]     chr1      1-20      * |      coding geneA,geneB,geneC,geneD,...
#>   [2]     chr1     25-35      * |      coding geneA,geneB,geneC,geneD,...
#>   [3]     chr1     45-55      * |      coding geneA,geneB,geneC,geneD,...
#>   [4]     chr1     60-70      * |      coding geneA,geneB,geneC,geneD,...
#>   [5]     chr1     65-75      * |  non-coding geneA,geneB,geneC,geneD,...
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## if nothing is reduced due to groups, the error is still there
gr[2:6] %>% group_by(genetype) %>% reduce_ranges(gene = gene)
#> Error in summarize_rng(.data, dots): length(ans[[i]]) == nr is not TRUE

## without metadata columns again no error
gr[2:6] %>% group_by(genetype) %>% reduce_ranges()
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |    genetype
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1      5-20      * |      coding
#>   [2]     chr1     25-35      * |      coding
#>   [3]     chr1     45-55      * |      coding
#>   [4]     chr1     60-70      * |      coding
#>   [5]     chr1     65-75      * |  non-coding
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2023-03-01 with reprex v2.0.2

R session information

suppressMessages(library(plyranges))
options(width = 120)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Ubuntu 22.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-03-01
#>  pandoc   2.19.2 @ /home/vonkunic_c/miniconda3/envs/cnv-pipeline/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  assertthat             0.2.1     2019-03-21 [1] CRAN (R 4.1.3)
#>  Biobase                2.54.0    2021-10-26 [1] Bioconductor
#>  BiocGenerics         * 0.40.0    2021-10-26 [1] Bioconductor
#>  BiocIO                 1.4.0     2021-10-26 [1] Bioconductor
#>  BiocParallel           1.28.3    2021-12-09 [1] Bioconductor
#>  Biostrings             2.62.0    2021-10-26 [1] Bioconductor
#>  bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.1.3)
#>  cli                    3.6.0     2023-01-09 [1] CRAN (R 4.1.3)
#>  crayon                 1.5.2     2022-09-29 [1] CRAN (R 4.1.3)
#>  DBI                    1.1.3     2022-06-18 [1] CRAN (R 4.1.3)
#>  DelayedArray           0.20.0    2021-10-26 [1] Bioconductor
#>  digest                 0.6.31    2022-12-11 [1] CRAN (R 4.1.3)
#>  dplyr                  1.0.10    2022-09-01 [1] CRAN (R 4.1.3)
#>  evaluate               0.20      2023-01-17 [1] CRAN (R 4.1.3)
#>  fansi                  1.0.4     2023-01-22 [1] CRAN (R 4.1.3)
#>  fastmap                1.1.0     2021-01-25 [1] CRAN (R 4.1.3)
#>  fs                     1.6.0     2023-01-23 [1] CRAN (R 4.1.3)
#>  generics               0.1.3     2022-07-05 [1] CRAN (R 4.1.3)
#>  GenomeInfoDb         * 1.30.1    2022-01-30 [1] Bioconductor
#>  GenomeInfoDbData       1.2.7     2023-01-26 [1] Bioconductor
#>  GenomicAlignments      1.30.0    2021-10-26 [1] Bioconductor
#>  GenomicRanges        * 1.46.1    2021-11-18 [1] Bioconductor
#>  glue                   1.6.2     2022-02-24 [1] CRAN (R 4.1.3)
#>  htmltools              0.5.4     2022-12-07 [1] CRAN (R 4.1.3)
#>  IRanges              * 2.28.0    2021-10-26 [1] Bioconductor
#>  knitr                  1.42      2023-01-25 [1] CRAN (R 4.1.3)
#>  lattice                0.20-45   2021-09-22 [1] CRAN (R 4.1.3)
#>  lifecycle              1.0.3     2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr               2.0.3     2022-03-30 [1] CRAN (R 4.1.3)
#>  Matrix                 1.5-3     2022-11-11 [1] CRAN (R 4.1.3)
#>  MatrixGenerics         1.6.0     2021-10-26 [1] Bioconductor
#>  matrixStats            0.63.0    2022-11-18 [1] CRAN (R 4.1.3)
#>  pillar                 1.8.1     2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.1.3)
#>  plyranges            * 1.14.0    2021-10-26 [1] Bioconductor
#>  R6                     2.5.1     2021-08-19 [1] CRAN (R 4.1.3)
#>  RCurl                  1.98-1.9  2022-10-03 [1] CRAN (R 4.1.3)
#>  reprex                 2.0.2     2022-08-17 [1] CRAN (R 4.1.3)
#>  restfulr               0.0.15    2022-06-16 [1] CRAN (R 4.1.3)
#>  rjson                  0.2.21    2022-01-09 [1] CRAN (R 4.1.3)
#>  rlang                  1.0.6     2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown              2.20      2023-01-19 [1] CRAN (R 4.1.3)
#>  Rsamtools              2.10.0    2021-10-26 [1] Bioconductor
#>  rstudioapi             0.14      2022-08-22 [1] CRAN (R 4.1.3)
#>  rtracklayer            1.54.0    2021-10-26 [1] Bioconductor
#>  S4Vectors            * 0.32.4    2022-03-24 [1] Bioconductor
#>  sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.1.3)
#>  SummarizedExperiment   1.24.0    2021-10-26 [1] Bioconductor
#>  tibble                 3.1.8     2022-07-22 [1] CRAN (R 4.1.3)
#>  tidyselect             1.2.0     2022-10-10 [1] CRAN (R 4.1.3)
#>  utf8                   1.2.2     2021-07-24 [1] CRAN (R 4.1.3)
#>  vctrs                  0.5.2     2023-01-23 [1] CRAN (R 4.1.3)
#>  withr                  2.5.0     2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun                   0.36      2022-12-21 [1] CRAN (R 4.1.3)
#>  XML                    3.99-0.13 2022-12-04 [1] CRAN (R 4.1.3)
#>  XVector                0.34.0    2021-10-26 [1] Bioconductor
#>  yaml                   2.3.7     2023-01-23 [1] CRAN (R 4.1.3)
#>  zlibbioc               1.40.0    2021-10-26 [1] Bioconductor
#> 
#>  [1] /home/vonkunic_c/miniconda3/envs/cnv-pipeline/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Created on 2023-03-01 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant