You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().
I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).
I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).
Is this the correct way of doing it ?
The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).
I am working with Arabidopsis data, so as genes I am using these data:
Then I try to find the nearest gene that is downstream of my regions:
my_regions <- dget("original_data.txt") #these regions are unstranded
genes_downstream <- plyranges::join_nearest_downstream(my_regions, ath_coding_genes, distance = TRUE)
This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:
test_peak <- "Merged-1-19698653-1"
test_gene <- "AT1G52890"
> my_regions %>% filter(name == test_peak)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 1
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> ath_coding_genes %>% filter(gene_id == test_gene) %>% plyranges::select(gene_id)
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
[1] 1 19696923-19698482 - | AT1G52890
-------
seqinfo: 7 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.
> nearest_genes <- plyranges::join_nearest(my_regions, ath_coding_genes, distance = TRUE)
> nearest_genes %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 AT1G52890
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> nearest_genes %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 AT1G52890
[2] 1 19698483-19698583 * | Merged-1-19698533-1 AT1G52890
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
The text was updated successfully, but these errors were encountered:
For reference, I have written a function using basic GRanges that does what I expected (find which genes are downstream or overlapping a set of regions).
If the observerd behaviour of join_nearest_downstream() is expected, I would modify the documentation and/or include a warning message if using unstranded GRanges.
find_preceding = function(peak_regions, gene_granges) {
# First we find the genes that have an overlapping region in their coding sequence:
overlapping_hits <- findOverlaps(peak_regions, gene_granges)
overlapping_regions <- peak_regions[queryHits(overlapping_hits)]
overlapping_genes <- gene_granges[subjectHits(overlapping_hits)]
overlapping_gene_ids <- overlapping_genes$gene_id
overlapping_granges <- overlapping_regions %>%
mutate(gene_id = overlapping_gene_ids,
dist = distance(overlapping_regions, overlapping_genes)) %>%
mutate(name = paste0(name, "_", gene_id))
# Then we remove the overlapping regions and search for regions preceding genes
non_overlapping_regions <- peak_regions[-queryHits(overlapping_hits)]
preceding_hits <- precede(non_overlapping_regions, gene_granges)
preceding_genes <- gene_granges[preceding_hits]
preceding_gene_ids <- preceding_genes$gene_id
preceding_granges <- non_overlapping_regions %>%
mutate(gene_id = preceding_gene_ids,
dist = distance(non_overlapping_regions, preceding_genes)) %>%
mutate(name = paste0(name, "_", gene_id))
# Finally, we merge the overlapping and preceding granges
final_granges <- bind_ranges(overlapping_granges, preceding_granges) %>%
sortSeqlevels() %>% sort()
return(final_granges)
}
I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().
I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).
I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).
Is this the correct way of doing it ?
The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).
I am working with Arabidopsis data, so as genes I am using these data:
Then I try to find the nearest gene that is downstream of my regions:
This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:
Both regions are 120 nt apart:
If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.
The text was updated successfully, but these errors were encountered: