-
Notifications
You must be signed in to change notification settings - Fork 0
/
isoform-analysis.Rmd
278 lines (229 loc) · 8.3 KB
/
isoform-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# Isoform analysis
Objective: compare the structure of isoforms within a gene using
grouping and disjoin operations.
There are many packages in Bioconductor that allow for isoform-level
analysis across samples (also called transcript-level analysis).
Packages that facilitate differential transcript analysis include
*DEXSeq* and *DRIMSeq* (demonstrated in the *rnaseqDTU* workflow), and
newer packages *satuRn* and *fishpond*.
Once you've identified isoforms of interest within a gene, perhaps
isoforms that switch in terms of their expression after cells are
treated, one can use the
[IsoformSwitchAnalyzeR](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR)
Bioconductor package to visualize and analyze a set of isoform
switches [@Vitting-Seerup2019]. For example, one can test the
functional consequences of a set of isoform switches in terms of the
gain or loss of protein domains, or splicing-centric changes (e.g.
alternative 3' or 5' acceptor sites, alternative transcription start
or ends sites, etc.) *IsoformSwitchAnalyzeR* is a multi-feature and
mature package for this type of analysis, but we can perform some
simpler within-gene isoform comparisons using *plyranges*, mostly for
demonstration.
Here we will suppose that we have somehow identified isoforms of
interest, and we want to compare these isoforms to other isoforms of
the same gene. For simplicity, we will focus on one isoform per gene,
for a particular set of interesting genes, just picking isoforms at
random for genes on `chr1`. We could use *plyranges* to compare
various metadata about isoforms or exons e.g. RNA-seq or ChIP-seq
coverage, sequence content, etc. But here we will just compare
isoforms alone by the interval definitions. Then to reformulate:
Specific objective 1: compare one isoform per gene to the others, in
terms of the extent from TSS to TES. What makes this isoform distinct?
We will start again with the transcript database we've used before:
```{r message=FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txp <- transcripts(txdb)
```
For further operations it will be convenient to have the `tx_id` be a
character variable, and we filter now to a set of genes of interest
(here just picking those on `chr1`):
```{r message=FALSE}
library(plyranges)
txp <- txp %>%
mutate(tx_id = as.character(tx_id)) %>%
filter(seqnames == "chr1")
```
It is sometimes useful to have other identifiers, such as the gene ID
(Entrez)...
```{r message=FALSE}
txp <- txp %>%
mutate(gene_id = mapIds(
txdb, keys=tx_id,
column="GENEID", keytype="TXID")
) %>%
filter(!is.na(gene_id))
```
...and the gene symbol. For simplicity we will keep genes that have a
non-NA symbol but this step is not necessary.
```{r message=FALSE}
library(org.Hs.eg.db)
txp <- txp %>%
mutate(symbol = mapIds(
org.Hs.eg.db, keys=gene_id,
column="SYMBOL", keytype="ENTREZID")
) %>%
filter(!is.na(symbol))
```
The following is one way to identify which isoforms belong to
multi-isoform genes:
```{r message=FALSE}
# this is slow for some reason...
## txp %>%
## group_by(gene_id) %>%
## mutate(ntxp = n()) %>%
## ungroup()
# this is faster...
txp <- txp %>%
mutate(ntxp = as.integer(table(txp$gene_id)[ gene_id ]))
```
We can now filter to the multi-isoform genes:
```{r}
txp <- txp %>%
filter(ntxp > 1)
```
Here we arbitrarily pick one isoform per gene, first by identifying
those in a tibble...
```{r message=FALSE}
library(tibble)
set.seed(3)
pick_one <- txp %>%
as_tibble() %>%
group_by(gene_id) %>%
slice(sample.int(n(), size=1)) %>%
dplyr::pull(tx_id)
```
...then we can label these in our `txp` object. We will track these
with an integer, 1 for the isoform of interest, and 0 for the others.
```{r}
txp <- txp %>%
mutate(the_one = as.integer(tx_id %in% pick_one))
```
To identify which "parts" of the TSS-to-TES interval belong to which
isoform, we can use `disjoin_ranges`. This breaks up the ranges, here
grouped by gene, into distinct parts, and labels those according to
whatever metadata variables we specify. Here we specify to combine
`tx_id` into a collapsed string, but we could also perform numeric
operations, e.g. `min` or `mean` etc. And we can specify more than one
new variable to be created during the `disjoin_ranges` operation. As
with `reduce_ranges`, this operation can be `_directed` or not.
```{r}
txp %>%
group_by(gene_id) %>%
disjoin_ranges(tx_ids = paste(tx_id,collapse=","))
```
Here we try to answer the specific objective, by labeling which parts
belong exclusively to the isoform of interest by computing
`min(the_one)` (try to convince yourself that this does in fact
identify these intervals).
```{r}
txp %>%
group_by(gene_id) %>%
disjoin_ranges(the_one_parts = min(the_one)) %>%
filter(the_one_parts > 0)
```
Do the parts identified make sense if we check one gene?
```{r}
txp %>%
filter(gene_id == "9651")
```
Let's pause and consider what we've answered so far. We asked, for a
given isoform per gene, what parts (intervals) uniquely define that
isoform, when we just consider TSS-to-TES extent (ignoring the
exonic/intronic structure). We started here mostly for simplicity, but
typically we care about *transcribed* sequence, so let's repeat this
task, now considering what exonic parts are unique to one isoform per
gene.
Specific objective 2: compare one isoform per gene to the others, in
terms of the exonic intervals. What makes this isoform distinct?
To start, we will need a list of the exons, grouped by transcript.
```{r}
ebt <- exonsBy(txdb, by="tx")
ebt <- ebt[txp$tx_id] # subset to those txp/genes of interest
```
Here, we could have also used `bind_ranges` but I find that for very
large lists of ranges, `unlist` is faster:
```{r}
exons <- unlist(ebt) %>%
select(exon_id, exon_rank) %>%
mutate(tx_id = rep(names(ebt), lengths(ebt)))
exons
```
The `exons` ranges are missing some of our key metadata from `txp`. We
can add this, by first `select`-ing what we want from `txp`, turning
this into a tibble and `left_join`-ing to the `exons`. I add an
`all.equal` step to make sure we have the two tables lined up, before
we add the extra columns with `cbind`.
Some of this code wouldn't be necessary for TranscriptDb with more
details `exons` output, as with *ensembldb*.
```{r}
txp_data <- txp %>%
select(tx_id, gene_id, ntxp, the_one, .drop_ranges=TRUE) %>%
as_tibble()
ids <- dplyr::left_join(tibble(tx_id = exons$tx_id),
txp_data, by="tx_id")
ids
all.equal(exons$tx_id, ids$tx_id)
mcols(exons) <- cbind(mcols(exons), ids %>% select(-tx_id))
exons
```
We repeat similar code as performed above with `txp`, now identifying
parts of exons that are unique to the isoform of interest, per gene:
```{r}
exon_parts <- exons %>%
group_by(gene_id) %>%
disjoin_ranges(the_one_parts = min(the_one)) %>%
filter(the_one_parts > 0)
exon_parts
```
To confirm that we've identified the right parts, let's use
*plotgardener* to visualize a particular gene:
```{r}
txp %>%
filter(gene_id == "339451")
tx_to_show <- txp %>%
filter(gene_id == "339451" & the_one == 1) %>%
as_tibble() %>%
dplyr::pull(tx_name)
these_parts <- exon_parts %>%
filter(gene_id == "339451")
```
We lay out a page zooming into this gene and its isoforms:
```{r message=FALSE}
library(plotgardener)
par <- pgParams(
chrom = "chr1",
chromstart = 895.9e3, chromend = 901.2e3,
assembly = "hg19", just = c("left", "bottom")
)
```
We will highlight our isoform of interest:
```{r}
hilite <- data.frame(transcript=tx_to_show, color="magenta")
```
Finally, putting it all together:
```{r isoform-plot, fig.width=5, fig.height=2.5, message=FALSE}
pageCreate(width = 5, height = 2.5, showGuides = FALSE)
plotTranscripts(
params = par, x = 0.5, y = 1.5, width = 4, height = 1.5,
transcriptHighlights = hilite
)
plotRanges(
these_parts, fill="darkorchid",
params = par, x = 0.5, y = 1.75, width = 4, height = .25
)
label <- paste("unique parts of", tx_to_show)
plotText(
label = label, fontcolor = "darkorchid",
params = par, x = 3.1, y = 1.75,
just = c("left", "bottom"), fontsize = 8
)
plotGenomeLabel(
params = par, x = 0.5, y = 2, length = 4,
just = c("left", "top")
)
```
Questions:
- How else could we have found parts of one isoform per gene, that
do not belong to any other isoforms of the genes. Would other
approaches have any limitations?