Inconsistencies with R SummarizedExperiment and RangedSummarizedExperiment #77

DiDeoxy · 2024-09-10T17:14:31Z

Making SummarizedExperiments

In Bioconductor there is no RangedSummarizedExperiment that does not have row ranges, if you try to build a RangedSummarizedExperiment without rowRanges in Bioconductor you get a SummarizedExperiment back instead. I think in BiocPy the best way to handle this issue is to make row_ranges required in RangedSummarizedExperiment and/or a SummarizedExperiment/RangedSummarizedExperiment constructor function.

Row data in RangedSummarizedExperiment is incorrect:

In Bioconductor the rowData accessor calls the mcols accessor on the SummarizedExperiment: https://github.com/Bioconductor/SummarizedExperiment/blob/c5b7ca2f8d975af13a18b3b5931449f092657f5c/R/SummarizedExperiment-class.R#L121

In Bioconductor the mcols accessor of RangedSummarizedExperiment calls the mcols accessor of the rowRanges : https://github.com/Bioconductor/SummarizedExperiment/blob/c5b7ca2f8d975af13a18b3b5931449f092657f5c/R/RangedSummarizedExperiment-class.R#L215

Thus in Bioconductor the rowData of a RangedSummarizedExperiment is the mcols of the rowRanges.

In BiocPy the get_row_data method accesses the _rows BiocFrame attribute of the RangedSummarizedExperiment. It is likely necessary to remove the row_data init parameter of RangedSummarizedExperiment as part of fixing this issue, alternatively you could have any supplied row_data be column concatenated to the mcols of the required row_ranges.

There are 4+ row indices in RangedSummarizedExperiment

This relates to the issue raised about GenomicRanges: BiocPy/GenomicRanges#121

The IRanges of the GenomicRanges has a names index, the GenomicRanges object has a names index, the mcols of the GenomicRanges has a row_names index, the row_data of the RangedSummarizedExperiment has a row_names index. None of these are cross validated to ensure they have the same values in the same order, this will likely lead to confusion by users in the future. Which one of these is the correct one? It would be good to resolve this upon object construction.

jkanche · 2024-09-23T16:48:00Z

This behavior is managed differently in the R package through a dedicated constructor method. However, in the Python package, there is no equivalent constructor function. Users are expected to directly instantiate objects by invoking the appropriate class constructors.

Making SummarizedExperiments
In Bioconductor there is no RangedSummarizedExperiment that does not have row ranges, if you try to build a RangedSummarizedExperiment without rowRanges in Bioconductor you get a SummarizedExperiment back instead. I think in BiocPy the best way to handle this issue is to make row_ranges required in RangedSummarizedExperiment and/or a SummarizedExperiment/RangedSummarizedExperiment constructor function.

Given the complicated relationship between row_data and row_ranges in RangedSummarizedExperiment, we decided not to replicate this behavior in the Python versions of these classes. There is an open issue in the Bioconductor package’s GitHub repository that addresses the inconsistent behavior during mutations of these objects.

Row data in RangedSummarizedExperiment is incorrect:
...

it's not immediately clear why they require validation. I don't believe the R implementations enforce validation either. These names are primarily human-readable identifiers for intervals or regions; they are not supposed to be indices.

There are 4+ row indices in RangedSummarizedExperiment
This relates to the issue raised about GenomicRanges: BiocPy/GenomicRanges#121

The IRanges of the GenomicRanges has a names index, the GenomicRanges object has a names index, the mcols of the GenomicRanges has a row_names index, the row_data of the RangedSummarizedExperiment has a row_names index. None of these are cross validated to ensure they have the same values in the same order, this will likely lead to confusion by users in the future. Which one of these is the correct one? It would be good to resolve this upon object construction.

DiDeoxy · 2024-10-01T17:05:23Z

The seqnames of the GRanges and the names of the IRanges object in contains are not validated in Bioconductor, point taken.

However, since the mcols of RangedSummarizedExperiment are the mcols of its GRanges component and the mcols of a GRanges object is not allowed to have rownames: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L55. It could therefore make sense to prevent the mcols of RangedSummarizedExperiment in BiocPy from having row_names given the choice to not fully replicate the structure of the Bioconductor RangedSummarizedExperiment version in Biocpy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies with R SummarizedExperiment and RangedSummarizedExperiment #77

Inconsistencies with R SummarizedExperiment and RangedSummarizedExperiment #77

DiDeoxy commented Sep 10, 2024 •

edited

Loading

jkanche commented Sep 23, 2024 •

edited

Loading

DiDeoxy commented Oct 1, 2024

Inconsistencies with R SummarizedExperiment and RangedSummarizedExperiment #77

Inconsistencies with R SummarizedExperiment and RangedSummarizedExperiment #77

Comments

DiDeoxy commented Sep 10, 2024 • edited Loading

Making SummarizedExperiments

Row data in RangedSummarizedExperiment is incorrect:

There are 4+ row indices in RangedSummarizedExperiment

jkanche commented Sep 23, 2024 • edited Loading

DiDeoxy commented Oct 1, 2024

DiDeoxy commented Sep 10, 2024 •

edited

Loading

jkanche commented Sep 23, 2024 •

edited

Loading