Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent behaviour of join_nearest_downstream() with one unstranded range set? #106

Open
eggrandio opened this issue Dec 5, 2023 · 1 comment

Comments

@eggrandio
Copy link

I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().

I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).

I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).

Is this the correct way of doing it ?

The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).

I am working with Arabidopsis data, so as genes I am using these data:

ath_coding_genes <- read_gff("./Arabidopsis_thaliana.TAIR10.57.gff3.gz") %>% 
  filter(type == "gene")

Then I try to find the nearest gene that is downstream of my regions:

my_regions <- dget("original_data.txt") #these regions are unstranded
genes_downstream <- plyranges::join_nearest_downstream(my_regions, ath_coding_genes, distance = TRUE)

This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:

test_peak <- "Merged-1-19698653-1"
test_gene <- "AT1G52890"

> my_regions %>% filter(name == test_peak)
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |                name     score
         <Rle>         <IRanges>  <Rle> |         <character> <numeric>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1         1
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> ath_coding_genes %>% filter(gene_id == test_gene) %>% plyranges::select(gene_id)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |     gene_id
         <Rle>         <IRanges>  <Rle> | <character>
  [1]        1 19696923-19698482      - |   AT1G52890
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name     gene_id
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name     gene_id
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

Both regions are 120 nt apart:

> reg1 <- my_regions %>% filter(name == test_peak)
> reg2 <- ath_coding_genes %>% filter(gene_id == test_gene)
> distance(reg1, reg2)
[1] 120

If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.

> nearest_genes <- plyranges::join_nearest(my_regions, ath_coding_genes, distance = TRUE)
> nearest_genes %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |                name     gene_id
         <Rle>         <IRanges>  <Rle> |         <character> <character>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1   AT1G52890
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> nearest_genes %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |                name     gene_id
         <Rle>         <IRanges>  <Rle> |         <character> <character>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1   AT1G52890
  [2]        1 19698483-19698583      * | Merged-1-19698533-1   AT1G52890
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
@eggrandio
Copy link
Author

eggrandio commented Dec 7, 2023

For reference, I have written a function using basic GRanges that does what I expected (find which genes are downstream or overlapping a set of regions).

If the observerd behaviour of join_nearest_downstream() is expected, I would modify the documentation and/or include a warning message if using unstranded GRanges.

find_preceding = function(peak_regions, gene_granges) {
  # First we find the genes that have an overlapping region in their coding sequence:
  overlapping_hits <- findOverlaps(peak_regions, gene_granges)
  
  overlapping_regions <- peak_regions[queryHits(overlapping_hits)]
  overlapping_genes <- gene_granges[subjectHits(overlapping_hits)]
  overlapping_gene_ids <- overlapping_genes$gene_id
  
  overlapping_granges <- overlapping_regions %>% 
    mutate(gene_id = overlapping_gene_ids,
           dist = distance(overlapping_regions, overlapping_genes)) %>% 
    mutate(name = paste0(name, "_", gene_id))
  
  # Then we remove the overlapping regions and search for regions preceding genes
  non_overlapping_regions <- peak_regions[-queryHits(overlapping_hits)]
  
  preceding_hits <- precede(non_overlapping_regions, gene_granges)
  preceding_genes <- gene_granges[preceding_hits]
  preceding_gene_ids <- preceding_genes$gene_id
  
  preceding_granges <- non_overlapping_regions %>% 
    mutate(gene_id = preceding_gene_ids,
           dist = distance(non_overlapping_regions, preceding_genes)) %>% 
    mutate(name = paste0(name, "_", gene_id))
  
  # Finally, we merge the overlapping and preceding granges
  final_granges <- bind_ranges(overlapping_granges, preceding_granges) %>% 
    sortSeqlevels() %>% sort()
  return(final_granges)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant