Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should GenomicRanges and GenomicRangesList check that the row_names of mcols is the same as the name of the ranges in the parent object? #121

Open
DiDeoxy opened this issue Aug 20, 2024 · 4 comments

Comments

@DiDeoxy
Copy link

DiDeoxy commented Aug 20, 2024

Currently, GenomicRanges and GenomicRangesList only checks that the mcols BiocFrame has the same number of rows as there are ranges. However, it seems prudent to also ensure that mcols has the same row_names as the names of the ranges in the parent object. The lack of this guarantee makes it impossible to select mcols rows from a subset of ranges rows and many other operations. This also applies to the names of the IRanges object in GenomicRanges.

@jkanche
Copy link
Member

jkanche commented Sep 23, 2024

If I understand correctly, R does not enforce this either. Users are allowed to modify row names, and there are no built-in checks to ensure that the row names of IRanges and GenomicRanges objects match. It is assumed that there is a 1-to-1 index mapping between these two structures.

@DiDeoxy
Copy link
Author

DiDeoxy commented Oct 1, 2024

I did not look at the Bioconductor code code before this. What I see is that GRanges prevents mcols from having rownames: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L55. Probably something that BiocPy should also enforce? Also the ranges is prevented from having mcols but I am unsure if that is relevant: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L48

@jkanche
Copy link
Member

jkanche commented Oct 1, 2024

Weirdly, it doesn't seem to prevent creation of a genomic ranges object. It seems to elevate the mcols of iranges and merge it with genomicranges. I think this is confusing when objects magically change things and I prefer explicit construction of these objects.

> gr <- GRanges(
+   seqnames = c("chr1", "chr1"), 
+   ranges = IRanges(c(2, 9), width=c(6, 11), mcols=DataFrame(a = 1:2)),
+   strand = c("+", "-"))
> gr
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |   mcols.a
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1       2-7      + |         1
  [2]     chr1      9-19      - |         2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> ranges(gr)
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         2         7         6
  [2]         9        19        11

@DiDeoxy
Copy link
Author

DiDeoxy commented Oct 1, 2024

Transformation stuff happens here: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L187 through line 197.

I don't disagree, its just handling these differences when trying to read from or write to disk is very challenging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants