Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is GenomicRanges object open-ended? #132

Open
kijiy opened this issue Nov 13, 2024 · 4 comments
Open

Is GenomicRanges object open-ended? #132

kijiy opened this issue Nov 13, 2024 · 4 comments

Comments

@kijiy
Copy link

kijiy commented Nov 13, 2024

Hi, it's great to have GenomicRanges in Python and really appreciated.

I've encountered a coordinate issue when loading a GENCODE gtf file with the genomicranges.read_gtf() function (I'm using v0.4.34).
I have a small GTF file that is the top 1000 lines of GENCODE v46 human GTF file:

##description: evidence-based annotation of the human genome (GRCh38), version 46 (Ensembl 112)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2024-03-26
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

When I load this in Python with the following code:

import genomicranges
gtf = genomicranges.read_gtf("test.gtf")

Which successfully reads the input:

GenomicRanges with 995 ranges and 22 metadata columns
    seqnames          ranges          strand   source    feature  score  frame                                    group            gene_id      gene_type gene_name  level                 tag     transcript_id transcript_type transcript_name
       <str>       <IRanges> <ndarray[int8]>   <list>     <list> <list> <list>                                   <list>             <list>         <list>    <list> <list>              <list>            <list>          <list>          <list>
  0     chr1   11869 - 14409               + | HAVANA       gene      .      . gene_id "ENSG00000290825.1"; gene_typ...  ENSG00000290825.1         lncRNA   DDX11L2      2 overlaps_pseudogene               nan             nan             nan
  1     chr1   11869 - 14409               + | HAVANA transcript      .      . gene_id "ENSG00000290825.1"; transcri...  ENSG00000290825.1         lncRNA   DDX11L2      2   Ensembl_canonical ENST00000456328.2          lncRNA     DDX11L2-202
  2     chr1   11869 - 12227               + | HAVANA       exon      .      . gene_id "ENSG00000290825.1"; transcri...  ENSG00000290825.1         lncRNA   DDX11L2      2   Ensembl_canonical ENST00000456328.2          lncRNA     DDX11L2-202
         ...             ...             ... |    ...        ...    ...    ...                                      ...                ...            ...       ...    ...                 ...               ...             ...             ...
992     chr1 935772 - 935896               + | HAVANA        CDS      .      1 gene_id "ENSG00000187634.13"; transcr... ENSG00000187634.13 protein_coding    SAMD11      2                CCDS ENST00000616016.5  protein_coding      SAMD11-209
993     chr1 939040 - 939129               + | HAVANA       exon      .      . gene_id "ENSG00000187634.13"; transcr... ENSG00000187634.13 protein_coding    SAMD11      2                CCDS ENST00000616016.5  protein_coding      SAMD11-209
994     chr1 939040 - 939129               + | HAVANA        CDS      .      2 gene_id "ENSG00000187634.13"; transcr... ENSG00000187634.13 protein_coding    SAMD11      2                CCDS ENST00000616016.5  protein_coding      SAMD11-209
    transcript_support_level    havana_transcript exon_number           exon_id    hgnc_id           havana_gene    ont        protein_id      ccdsid
                      <list>               <list>      <list>            <list>     <list>                <list> <list>            <list>      <list>
  0                      nan                  nan         nan               nan        nan                   nan    nan               nan         nan
  1                        1 OTTHUMT00000362751.1         nan               nan        nan                   nan    nan               nan         nan
  2                        1 OTTHUMT00000362751.1           1 ENSE00002234944.1        nan                   nan    nan               nan         nan
                         ...                  ...         ...               ...        ...                   ...    ...               ...         ...
992                        5 OTTHUMT00000316521.3           5 ENSE00003911340.1 HGNC:28706 OTTHUMG00000040719.11    nan ENSP00000478421.2 CCDS90834.1
993                        5 OTTHUMT00000316521.3           6 ENSE00003912201.1 HGNC:28706 OTTHUMG00000040719.11    nan ENSP00000478421.2 CCDS90834.1
994                        5 OTTHUMT00000316521.3           6 ENSE00003912201.1 HGNC:28706 OTTHUMG00000040719.11    nan ENSP00000478421.2 CCDS90834.1

As it's described in the manual, I thought genomicranges handles the data as 1-based, which should be close-ended. However, the widths of the ranges indicate it's open-ended:

print(gtf.ranges)
IRanges object with 995 ranges and 0 metadata columns
                 start              end            width
      <ndarray[int32]> <ndarray[int32]> <ndarray[int32]>
  [0]            11869            14409             2540
  [1]            11869            14409             2540
  [2]            11869            12227              358
                   ...              ...              ...
[992]           935772           935896              124
[993]           939040           939129               89
[994]           939040           939129               89

The same thing seems to be happening for the example code, where I tried:

from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random

gr = GenomicRanges(
    seqnames=[
        "chr1",
        "chr2",
        "chr3",
        "chr2",
        "chr3",
    ],
    ranges=IRanges([x for x in range(101, 106)], [11, 21, 25, 30, 5]),
    strand=["*", "-", "*", "+", "-"],
    mcols=BiocFrame(
        {
            "score": range(0, 5),
            "GC": [random() for _ in range(5)],
        }
    ),
)

print(gr.ranges)

returning the following, which again indicates open-ended range:

IRanges object with 5 ranges and 0 metadata columns
               start              end            width
    <ndarray[int32]> <ndarray[int32]> <ndarray[int32]>
[0]              101              112               11
[1]              102              123               21
[2]              103              128               25
[3]              104              134               30
[4]              105              110                5

GTF file is 1-start for both start and end as long as I read the description, and when I load the same GTF file with R rtracklayer, it reads as 1-start close-ended.

library(rtracklayer)
library(tidyverse)

gtf <- import.gff("gencode.v46.annotation.gtf")
gtf %>% granges %>% as.data.frame %>% head %>% print

output:

seqnames start   end width strand
1     chr1 11869 14409  2541      +
2     chr1 11869 14409  2541      +
3     chr1 11869 12227   359      +
4     chr1 12613 12721   109      +
5     chr1 13221 14409  1189      +
6     chr1 12010 13670  1661      +

As seen in the above output, the widths are 2541 for the first range when loaded by R, while the same track's range is 2540 when loaded by Python genomicranges.

Please let me know if I misunderstand something. Thank you!

@jkanche
Copy link
Member

jkanche commented Nov 13, 2024

I believe this is mostly an issue when we print the objects; internally the methods work as close-ended. But will verify again this week.

@kijiy
Copy link
Author

kijiy commented Nov 14, 2024

Thanks, so Python IRanges seems to be implemented as half-open-ended as
the README shows the width is shown -1 from the actual width (if close-ended):

## output
 IRanges object with 4 ranges and 0 metadata columns
                start              end            width
 <ndarray[int32]> <ndarray[int32]> <ndarray[int32]>
 [0]                1                5                4
 [1]                2                7                5
 [2]                3                9                6
 [3]                4               11                7

but otherwise all the functions handle the ranges in a close-end manner?
If so, should I just +1 the width after loading GTF file and treat the coordinates in a close-ended 1-based manner?

@jkanche
Copy link
Member

jkanche commented Nov 14, 2024

Probably related to BiocPy/IRanges#42.

We're working on rewriting some of these in C/C++ relatively soon and i'll be more explicit on this issue.

@kijiy
Copy link
Author

kijiy commented Nov 15, 2024

Yes, that's exactly what I meant; thanks for sharing. I think the width values are confusing, and the gtf parser needs to be fixed to include the end written on the gtf file. I'm looking forward to the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants