Skip to content

Commit

Permalink
Enhance documentation (#95)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche authored Jun 14, 2024
1 parent 594bd00 commit 52171ea
Showing 1 changed file with 47 additions and 37 deletions.
84 changes: 47 additions & 37 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ Moreover, the package also provides a `SeqInfo` class to update or modify sequen

The `GenomicRanges` class is designed to seamlessly operate with upstream packages like `RangeSummarizedExperiment` or `SingleCellExperiment` representations, providing consistent and stable functionality.

:::{note}
These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html#functional-discipline) section.

:::

## Installation

Expand All @@ -31,39 +32,6 @@ pip install genomicranges

We support multiple ways to initialize a `GenomicRanges` object.

## Preferred way

To construct a `GenomicRanges` object, we need to provide sequence information and genomic coordinates. This is achieved through the combination of the `seqnames` and `ranges` parameters. Additionally, you have the option to specify the `strand`, represented as a list of "+" (or 1) for the forward strand, "-" (or -1) for the reverse strand, or "*" (or 0) if the strand is unknown. You can also provide a NumPy vector that utilizes either the string or numeric representation to specify the `strand`. Optionally, you can use the `mcols` parameter to provide additional metadata about each genomic region.

```{code-cell}
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
gr = GenomicRanges(
seqnames=[
"chr1",
"chr2",
"chr3",
"chr2",
"chr3",
],
ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]),
strand=["*", "-", "*", "+", "-"],
mcols=BiocFrame(
{
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
),
)
print(gr)
```

The input for `mcols` is expected to be a `BiocFrame` object and will be converted to a `BiocFrame` in case a pandas `DataFrame` is supplied.

## From Bioinformatic file formats

### From `biobear`
Expand All @@ -89,8 +57,6 @@ print(len(gg), len(df))

You can also import genomes from UCSC or load a genome annotation from a GTF file. This requires installation of additional packages **pandas** and **joblib** to parse and extract various attributes from the gtf file.

A future version of this package might implement or take advantage of existing genomic parser packages in Python to support various file formats.

```python
import genomicranges

Expand All @@ -102,11 +68,49 @@ human_gr = genomicranges.read_ucsc(genome="hg19")
print(human_gr)
```


## Preferred way

To construct a `GenomicRanges` object, we need to provide sequence information and genomic coordinates. This is achieved through the combination of the `seqnames` and `ranges` parameters. Additionally, you have the option to specify the `strand`, represented as a list of "+" (or 1) for the forward strand, "-" (or -1) for the reverse strand, or "*" (or 0) if the strand is unknown. You can also provide a NumPy vector that utilizes either the string or numeric representation to specify the `strand`. Optionally, you can use the `mcols` parameter to provide additional metadata about each genomic region.

```{code-cell}
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
gr = GenomicRanges(
seqnames=[
"chr1",
"chr2",
"chr3",
"chr2",
"chr3",
],
ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]),
strand=["*", "-", "*", "+", "-"],
mcols=BiocFrame(
{
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
),
)
print(gr)
```

:::{note}
The input for `mcols` is expected to be a `BiocFrame` object and will be converted to a `BiocFrame` in case a pandas `DataFrame` is supplied.
:::

## Pandas `DataFrame`

If your genomic coordinates are represented as a pandas `DataFrame`, convert this into `GenomicRanges` if it contains the necessary columns.

::: {important}
The `DataFrame` must contain columns `seqnames`, `starts` and `ends` to represent genomic coordinates. The rest of the columns are considered metadata and will be available in the `mcols` slot of the `GenomicRanges` object.
:::

```{code-cell}
from genomicranges import GenomicRanges
Expand Down Expand Up @@ -193,7 +197,9 @@ print(gr.mcols)

### Setters

All property-based setters are `in_place` operations, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
:::{important}
All property-based setters are `in_place` operations, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html#functional-discipline) section.
:::

```{code-cell}
modified_mcols = gr.mcols.set_column("score", range(1,6))
Expand Down Expand Up @@ -420,7 +426,9 @@ binned_avg_gr = subject.binned_average(bins=bins_gr, scorename="score", outname=
print(binned_avg_gr)
```

::: {tip}
Now you might wonder how can I generate these ***bins***?
:::

# Generate tiles or bins

Expand Down Expand Up @@ -544,7 +552,9 @@ query_hits = gr.follow(find_regions)
print(query_hits)
```

::: {note}
Similar to `IRanges` operations, these methods typically return a list of indices from `subject` for each interval in `query`.
:::

# Comparison, rank and order operations

Expand Down

0 comments on commit 52171ea

Please sign in to comment.