Skip to content

Commit

Permalink
Overhaul GRanges and GRangesList classes (#40)
Browse files Browse the repository at this point in the history
This is a complete rewrite of both these classes following the functional paradigm from our [developer notes](https://github.com/BiocPy/developer_guide#use-functional-discipline).

`GenomicRanges` is now closer to Bioconductor's GenomicRanges class both in the design and implementation. 

The class does not rely on pandas anymore. While we try to provide backwards compatibility to construct a GenomicRanges object from a pandas dataframe using the `from_pandas` method, please note that the default constructor to genomic ranges does not accept a pandas data frame anymore!

Most range based methods have been reimplemented and the heavy lifting is done in the [IRanges package](https://github.com/BiocPy/IRanges) for interval operations. The package indirectly depends on [NCLS](https://github.com/pyranges/ncls) interval tree data structure to perform search and overlap operations. 

Tests, documentation and README has been updated to reflect these changes.
  • Loading branch information
jkanche authored Nov 30, 2023
1 parent 25f28fa commit 445207b
Show file tree
Hide file tree
Showing 36 changed files with 3,503 additions and 3,939 deletions.
131 changes: 86 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

# GenomicRanges

GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html). **_Intervals are inclusive on both ends and starts at 1._**

**_Intervals are inclusive on both ends and starts at 1._**
**Note: V0.4.0 is a complete overhaul of the package, as such the constructor to GenomicRanges has changed!**

To get started, install the package from [PyPI](https://pypi.org/project/genomicranges/)

Expand Down Expand Up @@ -35,12 +35,55 @@ print(gr)
## ... truncating the console print ...
```

### from `IRanges` (Preferred way)

If you have all relevant information to create a GenomicRanges object

```python
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random

gr = GenomicRanges(
seqnames=[
"chr1",
"chr2",
"chr3",
"chr2",
"chr3",
],
ranges=IRanges(start=[x for x in range(101, 106)], width=[11, 21, 25, 30, 5]),
strand=["*", "-", "*", "+", "-"],
mcols=BiocFrame(
{
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
),
)

print(gr)
```

## output
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand score GC
<str> <IRanges> <ndarray[int64]> <range> <list>
[0] chr1 101 - 112 * | 0 0.2593301003406461
[1] chr2 102 - 123 - | 1 0.7207993213776644
[2] chr3 103 - 128 * | 2 0.23391468067222065
[3] chr2 104 - 134 + | 3 0.7671026589720187
[4] chr3 105 - 110 - | 4 0.03355777784472458
------
seqinfo(3 sequences): chr1 chr2 chr3

### Pandas DataFrame

A common representation in Python is a pandas `DataFrame` for all tabular datasets. `DataFrame` must contain columns "seqnames", "starts", and "ends" to represent genomic intervals. Here's an example:

```python
import genomicranges
import genomicranges import GenomicRanges
import pandas as pd

df = pd.DataFrame(
Expand All @@ -54,20 +97,21 @@ df = pd.DataFrame(
}
)

gr = genomicranges.from_pandas(df)
gr = GenomicRanges.from_pandas(df)
print(gr)
```

## output
GenomicRanges with 5 intervals & 2 metadata columns
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ row_names ┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃ GC <list> ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ chr1 │ 101 │ 112 │ * │ 0 │ 0.22617584001235103 │
│ 1 │ chr2 │ 102 │ 103 │ - │ 1 │ 0.25464256182466394 │
│ ... │ ... │ ... │ ... │ ... │ ... │ ... │
│ 4 │ chr2 │ 109 │ 111 │ - │ 4 │ 0.5414168889911801 │
└───────────┴─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┴─────────────────────┘
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand score GC
<str> <IRanges> <ndarray[int64]> <list> <list>
0 chr1 101 - 112 * | 0 0.4862658925128007
1 chr2 102 - 103 - | 1 0.27948386889389953
2 chr1 103 - 128 * | 2 0.5162697718607901
3 chr3 104 - 134 + | 3 0.5979843806415466
4 chr2 109 - 111 - | 4 0.04740781186083798
------
seqinfo(3 sequences): chr1 chr2 chr3

### Interval Operations

Expand All @@ -86,10 +130,13 @@ query = genomicranges.from_pandas(
)
)

hits = subject.nearest(query)
hits = subject.nearest(query, ignore_strand=True)
print(hits)
```

## output
[[0, 1], [1677082, 1677083, 1677084], [1003411, 1003412]]

## `GenomicRangesList`

Just as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.
Expand All @@ -100,25 +147,18 @@ To construct a GenomicRangesList

```python
gr1 = GenomicRanges(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3"],
"starts": [1, 3, 2, 4],
"ends": [10, 30, 50, 60],
"strand": ["-", "+", "*", "+"],
"score": [1, 2, 3, 4],
}
seqnames=["chr1", "chr2", "chr1", "chr3"],
ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
strand=["-", "+", "*", "+"],
mcols=BiocFrame({"score": [1, 2, 3, 4]}),
)

gr2 = GenomicRanges(
{
"seqnames": ["chr2", "chr4", "chr5"],
"starts": [3, 6, 4],
"ends": [30, 50, 60],
"strand": ["-", "+", "*"],
"score": [2, 3, 4],
}
seqnames=["chr2", "chr4", "chr5"],
ranges=IRanges([3, 6, 4], [30, 50, 60]),
strand=["-", "+", "*"],
mcols=BiocFrame({"score": [2, 3, 4]}),
)

grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
print(grl)
```
Expand All @@ -127,24 +167,25 @@ print(grl)
GenomicRangesList with 2 genomic elements

Name: gene1
GenomicRanges with 4 intervals & 1 metadata columns
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ chr1 │ 1 │ 10 │ - │ 1 │
│ chr2 │ 3 │ 30 │ + │ 2 │
│ chr3 │ 4 │ 60 │ + │ 4 │
└─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘
GenomicRanges with 4 ranges and 4 metadata columns
seqnames ranges strand score
<str> <IRanges> <ndarray> <list>
[0] chr1 1 - 11 - | 1
[1] chr2 3 - 33 + | 2
[2] chr1 2 - 52 * | 3
[3] chr3 4 - 64 + | 4
------
seqinfo(3 sequences): chr1 chr2 chr3

Name: gene2
GenomicRanges with 3 intervals & 1 metadata columns
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list>
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ chr2 │ 3 │ 30 │ - │ 2
│ chr4 │ 6 │ 50 │ + │ 3 │
│ chr5 │ 4 │ 60 │ * │ 4 │
└─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘
GenomicRanges with 3 ranges and 3 metadata columns
seqnames ranges strand score
<str> <IRanges> <ndarray> <list>
[0] chr2 3 - 33 - | 2
[1] chr4 6 - 56 + | 3
[2] chr5 4 - 64 * | 4
------
seqinfo(3 sequences): chr2 chr4 chr5

## Further information

Expand Down
60 changes: 34 additions & 26 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ df = pd.DataFrame(
}
)

gr = genomicranges.from_pandas(df)
gr = GenomicRanges.from_pandas(df)
```

### From UCSC or GTF file
Expand All @@ -47,21 +47,35 @@ gr = genomicranges.read_gtf(<PATH TO GTF>)
gr = genomicranges.read_ucsc(genome="hg19")
```

### From a dictionary
### from `IRanges` (Preferred way)

Or simply pass in a dictionary with the required keys.
If you have all relevant information to create a ``GenomicRanges`` object

```python
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random

gr = GenomicRanges(
{
"seqnames": ["chr1", "chr2", "chr3"],
"starts": [100, 115, 119],
"ends": [103, 116, 120],
}
seqnames=[
"chr1",
"chr2",
"chr3",
"chr2",
"chr3",
],
ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]),
strand=["*", "-", "*", "+", "-"],
mcols=BiocFrame(
{
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
),
)
```


### Set sequence information

The package also provides a `SeqInfo` class to update or modify sequence information stored in the object. Read more about this class in [GenomeInfoDb package](https://bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html).
Expand Down Expand Up @@ -442,10 +456,10 @@ rank = gr.rank()

## Combine `GenomicRanges` objects by rows

Use the `combine` generic from [biocgenerics](https://github.com/BiocPy/generics) to concatenate multiple GenomicRanges objects.
Use the `combine` generic from [biocutils](https://github.com/BiocPy/generics) to concatenate multiple GenomicRanges objects.

```python
from biocgenerics.combine import combine
from biocutils.combine import combine
combined_gr = combine(gr, gr1, gr2, ...)
```

Expand Down Expand Up @@ -482,23 +496,17 @@ Currently, this class is limited in functionality, purely a read-only class with

```python
a = GenomicRanges(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3"],
"starts": [1, 3, 2, 4],
"ends": [10, 30, 50, 60],
"strand": ["-", "+", "*", "+"],
"score": [1, 2, 3, 4],
}
seqnames=["chr1", "chr2", "chr1", "chr3"],
ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
strand=["-", "+", "*", "+"],
mcols=BiocFrame({"score": [1, 2, 3, 4]}),
)

b = GenomicRanges(
{
"seqnames": ["chr2", "chr4", "chr5"],
"starts": [3, 6, 4],
"ends": [30, 50, 60],
"strand": ["-", "+", "*"],
"score": [2, 3, 4],
}
seqnames=["chr2", "chr4", "chr5"],
ranges=IRanges([3, 6, 4], [30, 50, 60]),
strand=["-", "+", "*"],
mcols=BiocFrame({"score": [2, 3, 4]}),
)

grl = GenomicRangesList(gene1=a, gene2=b)
Expand All @@ -524,7 +532,7 @@ grlb = GenomicRangesList(ranges=[b, a], names=["b", "c"])
grlc = grla.combine(grlb)

# or use the combine generic
from biocgenerics.combine import combine
from biocutils.combine import combine
cgrl = combine(grla, grlb)
```

Expand Down
9 changes: 5 additions & 4 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,10 @@ package_dir =
# For more information, check out https://semver.org/.
install_requires =
importlib-metadata; python_version<"3.8"
pandas
biocframe==0.3.16
biocframe>=0.5.3,<0.6.0
iranges[optional]>=0.2.0,<0.3.0
biocutils>=0.1.3,<0.2.0
numpy
biocgenerics
biocutils>=0.1.1

[options.packages.find]
where = src
Expand All @@ -65,12 +64,14 @@ exclude =
# `pip install GenomicRanges[PDF]` like:
optional =
joblib
pandas

# Add here test requirements (semicolon/line-separated)
testing =
setuptools
pytest
pytest-cov
pandas

[options.entry_points]
# Add here console scripts like:
Expand Down
Loading

0 comments on commit 445207b

Please sign in to comment.