Overhaul GRanges and GRangesList classes (#40)

This is a complete rewrite of both these classes following the functional paradigm from our [developer notes](https://github.com/BiocPy/developer_guide#use-functional-discipline). `GenomicRanges` is now closer to Bioconductor's GenomicRanges class both in the design and implementation. The class does not rely on pandas anymore. While we try to provide backwards compatibility to construct a GenomicRanges object from a pandas dataframe using the `from_pandas` method, please note that the default constructor to genomic ranges does not accept a pandas data frame anymore! Most range based methods have been reimplemented and the heavy lifting is done in the [IRanges package](https://github.com/BiocPy/IRanges) for interval operations. The package indirectly depends on [NCLS](https://github.com/pyranges/ncls) interval tree data structure to perform search and overlap operations. Tests, documentation and README has been updated to reflect these changes.
BiocPy · Nov 30, 2023 · 445207b · 445207b
1 parent 25f28fa
commit 445207b
Show file tree

Hide file tree

Showing 36 changed files with 3,503 additions and 3,939 deletions.
diff --git a/README.md b/README.md
@@ -4,9 +4,9 @@
 
 # GenomicRanges
 
-GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
+GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html). **_Intervals are inclusive on both ends and starts at 1._**
 
-**_Intervals are inclusive on both ends and starts at 1._**
+**Note: V0.4.0 is a complete overhaul of the package, as such the constructor to GenomicRanges has changed!**
 
 To get started, install the package from [PyPI](https://pypi.org/project/genomicranges/)
 
@@ -35,12 +35,55 @@ print(gr)
 ## ... truncating the console print ...
 ```
 
+### from `IRanges` (Preferred way)
+
+If you have all relevant information to create a GenomicRanges object
+
+```python
+from genomicranges import GenomicRanges
+from iranges import IRanges
+from biocframe import BiocFrame
+from random import random
+
+gr = GenomicRanges(
+    seqnames=[
+        "chr1",
+        "chr2",
+        "chr3",
+        "chr2",
+        "chr3",
+    ],
+    ranges=IRanges(start=[x for x in range(101, 106)], width=[11, 21, 25, 30, 5]),
+    strand=["*", "-", "*", "+", "-"],
+    mcols=BiocFrame(
+        {
+            "score": range(0, 5),
+            "GC": [random() for _ in range(5)],
+        }
+    ),
+)
+
+print(gr)
+```
+
+    ## output
+    GenomicRanges with 5 ranges and 5 metadata columns
+        seqnames    ranges           strand     score                  GC
+           <str> <IRanges> <ndarray[int64]>   <range>              <list>
+    [0]     chr1 101 - 112                * |       0  0.2593301003406461
+    [1]     chr2 102 - 123                - |       1  0.7207993213776644
+    [2]     chr3 103 - 128                * |       2 0.23391468067222065
+    [3]     chr2 104 - 134                + |       3  0.7671026589720187
+    [4]     chr3 105 - 110                - |       4 0.03355777784472458
+    ------
+    seqinfo(3 sequences): chr1 chr2 chr3
+
 ### Pandas DataFrame
 
 A common representation in Python is a pandas `DataFrame` for all tabular datasets. `DataFrame` must contain columns "seqnames", "starts", and "ends" to represent genomic intervals. Here's an example:
 
 ```python
-import genomicranges
+import genomicranges import GenomicRanges
 import pandas as pd
 
 df = pd.DataFrame(
@@ -54,20 +97,21 @@ df = pd.DataFrame(
     }
 )
 
-gr = genomicranges.from_pandas(df)
+gr = GenomicRanges.from_pandas(df)
 print(gr)
 ```
 
     ## output
-    GenomicRanges with 5 intervals & 2 metadata columns
-    ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
-    ┃ row_names ┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃ GC <list>           ┃
-    ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
-    │ 0         │ chr1            │ 101           │ 112         │ *             │ 0            │ 0.22617584001235103 │
-    │ 1         │ chr2            │ 102           │ 103         │ -             │ 1            │ 0.25464256182466394 │
-    │ ...       │ ...             │ ...           │ ...         │ ...           │ ...          │ ...                 │
-    │ 4         │ chr2            │ 109           │ 111         │ -             │ 4            │ 0.5414168889911801  │
-    └───────────┴─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┴─────────────────────┘
+    GenomicRanges with 5 ranges and 5 metadata columns
+      seqnames    ranges           strand    score                  GC
+         <str> <IRanges> <ndarray[int64]>   <list>              <list>
+    0     chr1 101 - 112                * |      0  0.4862658925128007
+    1     chr2 102 - 103                - |      1 0.27948386889389953
+    2     chr1 103 - 128                * |      2  0.5162697718607901
+    3     chr3 104 - 134                + |      3  0.5979843806415466
+    4     chr2 109 - 111                - |      4 0.04740781186083798
+    ------
+    seqinfo(3 sequences): chr1 chr2 chr3
 
 ### Interval Operations
 
@@ -86,10 +130,13 @@ query = genomicranges.from_pandas(
     )
 )
 
-hits = subject.nearest(query)
+hits = subject.nearest(query, ignore_strand=True)
 print(hits)
 ```
 
+    ## output
+    [[0, 1], [1677082, 1677083, 1677084], [1003411, 1003412]]
+
 ## `GenomicRangesList`
 
 Just as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.
@@ -100,25 +147,18 @@ To construct a GenomicRangesList
 
 ```python
 gr1 = GenomicRanges(
-    {
-        "seqnames": ["chr1", "chr2", "chr1", "chr3"],
-        "starts": [1, 3, 2, 4],
-        "ends": [10, 30, 50, 60],
-        "strand": ["-", "+", "*", "+"],
-        "score": [1, 2, 3, 4],
-    }
+    seqnames=["chr1", "chr2", "chr1", "chr3"],
+    ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
+    strand=["-", "+", "*", "+"],
+    mcols=BiocFrame({"score": [1, 2, 3, 4]}),
 )
 
 gr2 = GenomicRanges(
-    {
-        "seqnames": ["chr2", "chr4", "chr5"],
-        "starts": [3, 6, 4],
-        "ends": [30, 50, 60],
-        "strand": ["-", "+", "*"],
-        "score": [2, 3, 4],
-    }
+    seqnames=["chr2", "chr4", "chr5"],
+    ranges=IRanges([3, 6, 4], [30, 50, 60]),
+    strand=["-", "+", "*"],
+    mcols=BiocFrame({"score": [2, 3, 4]}),
 )
-
 grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
 print(grl)
 ```
@@ -127,24 +167,25 @@ print(grl)
     GenomicRangesList with 2 genomic elements
 
     Name: gene1
-                GenomicRanges with 4 intervals & 1 metadata columns
-    ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
-    ┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
-    ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-    │ chr1            │ 1             │ 10          │ -             │ 1            │
-    │ chr2            │ 3             │ 30          │ +             │ 2            │
-    │ chr3            │ 4             │ 60          │ +             │ 4            │
-    └─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘
+    GenomicRanges with 4 ranges and 4 metadata columns
+         seqnames    ranges            strand     score
+            <str> <IRanges>         <ndarray>    <list>
+    [0]     chr1    1 - 11                -   |      1
+    [1]     chr2    3 - 33                +   |      2
+    [2]     chr1    2 - 52                *   |      3
+    [3]     chr3    4 - 64                +   |      4
+    ------
+    seqinfo(3 sequences): chr1 chr2 chr3
 
     Name: gene2
-                GenomicRanges with 3 intervals & 1 metadata columns
-    ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
-    ┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
-    ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-    │ chr2            │ 3             │ 30          │ -             │ 2            │
-    │ chr4            │ 6             │ 50          │ +             │ 3            │
-    │ chr5            │ 4             │ 60          │ *             │ 4            │
-    └─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘
+    GenomicRanges with 3 ranges and 3 metadata columns
+         seqnames    ranges            strand    score
+            <str> <IRanges>         <ndarray>   <list>
+    [0]     chr2    3 - 33                -   |      2
+    [1]     chr4    6 - 56                +   |      3
+    [2]     chr5    4 - 64                *   |      4
+    ------
+    seqinfo(3 sequences): chr2 chr4 chr5
 
 ## Further information
 

diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -32,7 +32,7 @@ df = pd.DataFrame(
     }
 )
 
-gr = genomicranges.from_pandas(df)
+gr = GenomicRanges.from_pandas(df)
 ```
 
 ### From UCSC or GTF file
@@ -47,21 +47,35 @@ gr = genomicranges.read_gtf(<PATH TO GTF>)
 gr = genomicranges.read_ucsc(genome="hg19")
 ```
 
-### From a dictionary
+### from `IRanges` (Preferred way)
 
-Or simply pass in a dictionary with the required keys.
+If you have all relevant information to create a ``GenomicRanges`` object
 
 ```python
+from genomicranges import GenomicRanges
+from iranges import IRanges
+from biocframe import BiocFrame
+from random import random
+
 gr = GenomicRanges(
-    {
-        "seqnames": ["chr1", "chr2", "chr3"],
-        "starts": [100, 115, 119],
-        "ends": [103, 116, 120],
-    }
+    seqnames=[
+        "chr1",
+        "chr2",
+        "chr3",
+        "chr2",
+        "chr3",
+    ],
+    ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]),
+    strand=["*", "-", "*", "+", "-"],
+    mcols=BiocFrame(
+        {
+            "score": range(0, 5),
+            "GC": [random() for _ in range(5)],
+        }
+    ),
 )
 ```
 
-
 ### Set sequence information
 
 The package also provides a `SeqInfo` class to update or modify sequence information stored in the object. Read more about this class in [GenomeInfoDb package](https://bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html).
@@ -442,10 +456,10 @@ rank = gr.rank()
 
 ## Combine `GenomicRanges` objects by rows
 
-Use the `combine` generic from [biocgenerics](https://github.com/BiocPy/generics) to concatenate multiple GenomicRanges objects.
+Use the `combine` generic from [biocutils](https://github.com/BiocPy/generics) to concatenate multiple GenomicRanges objects.
 
 ```python
-from biocgenerics.combine import combine
+from biocutils.combine import combine
 combined_gr = combine(gr, gr1, gr2, ...)
 ```
 
@@ -482,23 +496,17 @@ Currently, this class is limited in functionality, purely a read-only class with
 
 ```python
 a = GenomicRanges(
-    {
-        "seqnames": ["chr1", "chr2", "chr1", "chr3"],
-        "starts": [1, 3, 2, 4],
-        "ends": [10, 30, 50, 60],
-        "strand": ["-", "+", "*", "+"],
-        "score": [1, 2, 3, 4],
-    }
+    seqnames=["chr1", "chr2", "chr1", "chr3"],
+    ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
+    strand=["-", "+", "*", "+"],
+    mcols=BiocFrame({"score": [1, 2, 3, 4]}),
 )
 
 b = GenomicRanges(
-    {
-        "seqnames": ["chr2", "chr4", "chr5"],
-        "starts": [3, 6, 4],
-        "ends": [30, 50, 60],
-        "strand": ["-", "+", "*"],
-        "score": [2, 3, 4],
-    }
+    seqnames=["chr2", "chr4", "chr5"],
+    ranges=IRanges([3, 6, 4], [30, 50, 60]),
+    strand=["-", "+", "*"],
+    mcols=BiocFrame({"score": [2, 3, 4]}),
 )
 
 grl = GenomicRangesList(gene1=a, gene2=b)
@@ -524,7 +532,7 @@ grlb = GenomicRangesList(ranges=[b, a], names=["b", "c"])
 grlc = grla.combine(grlb)
 
 # or use the combine generic
-from biocgenerics.combine import combine
+from biocutils.combine import combine
 cgrl = combine(grla, grlb)
 ```
 

diff --git a/setup.cfg b/setup.cfg
@@ -49,11 +49,10 @@ package_dir =
 # For more information, check out https://semver.org/.
 install_requires =
     importlib-metadata; python_version<"3.8"
-    pandas
-    biocframe==0.3.16
+    biocframe>=0.5.3,<0.6.0
+    iranges[optional]>=0.2.0,<0.3.0
+    biocutils>=0.1.3,<0.2.0
     numpy
-    biocgenerics
-    biocutils>=0.1.1
 
 [options.packages.find]
 where = src
@@ -65,12 +64,14 @@ exclude =
 # `pip install GenomicRanges[PDF]` like:
 optional =
     joblib
+    pandas
 
 # Add here test requirements (semicolon/line-separated)
 testing =
     setuptools
     pytest
     pytest-cov
+    pandas
 
 [options.entry_points]
 # Add here console scripts like: