Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement combine generic #29

Merged
merged 14 commits into from
Oct 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 89 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,39 +4,40 @@

# GenomicRanges

GenomicRanges is a Python container class designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).

## Install
**_Intervals are inclusive on both ends and starts at 1._**

Package is published to [PyPI](https://pypi.org/project/genomicranges/)
To get started, install the package from [PyPI](https://pypi.org/project/genomicranges/)

```shell
pip install genomicranges
```

## Usage
## `GenomicRanges`

The package provides several ways to represent genomic annotations and intervals.
`GenomicRanges` is the base class to represent and operate over genomic regions and annotations.

### Initialize a `GenomicRanges` object
### From UCSC or GTF file

#### From UCSC or GTF file

You can easily access UCSC genomes or load a genome annotation from a GTF file using the following methods:
You can easily download and parse genome annotations from UCSC or load a genome annotation from a GTF file,

```python
import genomicranges

gr = genomicranges.from_gtf(<PATH TO GTF>)
gr = genomicranges.read_gtf(<PATH TO GTF>)
# OR
gr = genomicranges.from_ucsc(genome="hg19")
```
#### Pandas DataFrame
gr = genomicranges.read_ucsc(genome="hg19")

A common representation in Python is a pandas DataFrame for all tabular datasets. You can convert a DataFrame into a `GenomicRanges` object. Please note that intervals are inclusive on both ends, and your DataFrame must contain columns seqnames, starts, and ends to represent genomic coordinates.
print(gr)
## output
## GenomicRanges with 1760959 intervals & 10 metadata columns.
## ... truncating the console print ...
```

Here's an example:
### Pandas DataFrame

A common representation in Python is a pandas `DataFrame` for all tabular datasets. `DataFrame` must contain columns "seqnames", "starts", and "ends" to represent genomic intervals. Here's an example:

```python
import genomicranges
Expand All @@ -54,11 +55,23 @@ df = pd.DataFrame(
)

gr = genomicranges.from_pandas(df)
print(gr)
```

## output
GenomicRanges with 5 intervals & 2 metadata columns
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ row_names ┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃ GC <list> ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ chr1 │ 101 │ 112 │ * │ 0 │ 0.22617584001235103 │
│ 1 │ chr2 │ 102 │ 103 │ - │ 1 │ 0.25464256182466394 │
│ ... │ ... │ ... │ ... │ ... │ ... │ ... │
│ 4 │ chr2 │ 109 │ 111 │ - │ 4 │ 0.5414168889911801 │
└───────────┴─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┴─────────────────────┘

### Interval Operations

GenomicRanges currently supports most commonly used [interval based operations](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
`GenomicRanges` supports most [interval based operations](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).

```python
subject = genomicranges.from_ucsc(genome="hg38")
Expand All @@ -77,8 +90,67 @@ hits = subject.nearest(query)
print(hits)
```

For more usage examples, check out the [documentation](https://biocpy.github.io/GenomicRanges/).
## `GenomicRangesList`

Just as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.

**Currently, this class is limited in functionality.**

To construct a GenomicRangesList

```python
gr1 = GenomicRanges(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3"],
"starts": [1, 3, 2, 4],
"ends": [10, 30, 50, 60],
"strand": ["-", "+", "*", "+"],
"score": [1, 2, 3, 4],
}
)

gr2 = GenomicRanges(
{
"seqnames": ["chr2", "chr4", "chr5"],
"starts": [3, 6, 4],
"ends": [30, 50, 60],
"strand": ["-", "+", "*"],
"score": [2, 3, 4],
}
)

grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
print(grl)
```

## output
GenomicRangesList with 2 genomic elements

Name: gene1
GenomicRanges with 4 intervals & 1 metadata columns
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ chr1 │ 1 │ 10 │ - │ 1 │
│ chr2 │ 3 │ 30 │ + │ 2 │
│ chr3 │ 4 │ 60 │ + │ 4 │
└─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘

Name: gene2
GenomicRanges with 3 intervals & 1 metadata columns
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ chr2 │ 3 │ 30 │ - │ 2 │
│ chr4 │ 6 │ 50 │ + │ 3 │
│ chr5 │ 4 │ 60 │ * │ 4 │
└─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘

## Further information

- [Tutorial](https://biocpy.github.io/GenomicRanges/tutorial.html)
- [API documentation](https://biocpy.github.io/GenomicRanges/api/modules.html)
- [Bioc/GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)

<!-- pyscaffold-notes -->

Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
"sphinx.ext.ifconfig",
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
"sphinx_autodoc_typehints",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ furo
# sphinx_rtd_theme
myst-parser[linkify]
sphinx>=3.2.1
sphinx-autodoc-typehints
Loading
Loading