-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbootstrap-overlap.Rmd
321 lines (265 loc) · 9.78 KB
/
bootstrap-overlap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
# Bootstrap overlap
Objective: determine if one set of peaks are overlapping another set
of peaks more or less than expected when comparing to sets of *null*
features.
The null features can be generated in a variety of ways -- here we
will generate them by resampling large blocks of one of the peak
sets. The motivation for sampling blocks, instead of placing features
uniformly along the chromosome ("shuffling"), is to better preserve
inter-feature distances, because genomic features tend to cluster in
the genome, even after considering things like excluded regions. This
technique of generating null feature sets is called
*block bootstrap resampling*, and we will use the *nullranges*
implementation of the block bootstrapping algorithm [@Mu2023] to
generate the null features, followed by overlap analysis with
*plyranges* [@Lee2019]. The approach used in *nullranges* to generate
bootstrap ranges closely follows the GSC method proposed by
@bickel2010.
Note that an alternative approach for null hypothesis comparisons is to
define a set of *covariate-matched ranges*, an approach also implemented
in the *nullranges* package and described in @Davis2023. For examples
of using matching with ranges, see the related articles
[here](https://nullranges.github.io/nullranges/).
We start by loading the ENCODE kidney and bladder H3K27ac ChIP-seq
peaks used in the previous analysis [@encode].
```{r eval=FALSE}
library(AnnotationHub)
ah <- AnnotationHub()
kidney_pks <- ah[["AH43443"]]
bladder_pks <- ah[["AH44180"]]
```
```{r echo=FALSE}
load("data/peaks.rda")
```
We will additionally obtain an excluded region set, so that we avoid
placing bootstrap features into regions of the genome that don't
typically have features. A variety of possible exclude lists are
provided by the *excluderanges* packages and available via
*AnnotationHub*. Here we will use the
`hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable` regions, as
they are available for hg19, which is the genome used with the peak
sets.
```{r eval=FALSE}
# query(ah, c("excluderanges","hg19"))
exclude <- ah[["AH95912"]]
save("data/exclude.rda")
```
```{r echo=FALSE}
load("data/exclude.rda")
```
To make the code more generic, we will rename the kidney peaks to `x`
and the bladder peaks to `y`. We will be looking for overlaps with
features in `x` as the query set: how many of the features in `x`
overlap features in `y`?
The following code reduces our analysis to looking only at standard
chromosomes, excluding the mitochondrial genome (too small for
including in the block bootstrap).
```{r message=FALSE}
library(GenomeInfoDb)
x <- kidney_pks
y <- bladder_pks
x <- keepStandardChromosomes(x)
seqlevels(x, pruning.mode="coarse") <- setdiff(seqlevels(x), "chrM")
seqlevels(y, pruning.mode="coarse") <- seqlevels(x)
seqlevels(exclude, pruning.mode="coarse") <- seqlevels(x)
```
We are mostly concerned with avoiding placing bootstrapped features in
large regions in the exclude list, so we subset the exclude list to
features larger than 500 bp. Why do we have `plyranges::` in front of
`filter`? This is because there is a function in the *ensembldb*
package that is also called *filter*, so it's a bit safer if we are
using both packages to use the package name.
```{r message=FALSE}
library(plyranges)
exclude <- exclude %>%
plyranges::filter(width(exclude) >= 500)
```
We also subset to the peaks for kidney and bladder which have q-value
less than 0.001 and signal value greater than 9 (these are arbitrary
filter values, just for demonstration).
For further analysis, we will need the features in `y` to be sorted,
for the bootstrapping, here we sort both sets.
```{r}
q_thr <- 3
s_thr <- 9
x <- x %>%
plyranges::filter(qValue > q_thr & signalValue > s_thr) %>%
sort()
y <- y %>%
plyranges::filter(qValue > q_thr & signalValue > s_thr) %>%
sort()
```
Now we can assess how many overlaps we observed between `x` and `y`:
```{r}
obs <- x %>% mutate(n_overlaps = count_overlaps(., y))
obs %>% summarize(total = sum(n_overlaps))
table( obs$n_overlaps )
```
We can check if any of the features of `y` fall in the excluded
regions:
```{r}
y %>% mutate(n_overlaps = count_overlaps(., exclude)) %>%
summarize(total = sum(n_overlaps))
```
The following chunk of code does the bootstrapping of features in
`y`. Here we subset first to metadata columns of interest (an `id`
variable that we create, and the signal value which we rename to
`signal`).
```{r}
pks_to_boot <- y %>%
mutate(id = seq_along(.)) %>%
plyranges::select(id, signal = signalValue)
```
The `bootRanges` function returns the bootstrap feature sets
combined into one *GRanges* object -- this tidy format facilitates
downstream analysis as we will see. The bootstrap iteration is stored
in the `iter` metadata column.
```{r bootstrap}
library(nullranges)
R <- 30 # number of iterations
set.seed(5) # set seed for reproducibility
boots <- bootRanges(pks_to_boot, blockLength=5e5, R=R, exclude=exclude)
boots
```
The default use above of the `exclude` argument is to drop bootstrapped
ranges that overlap the exclude list.
We can examine properties of permuted `y` over iterations, and compare
to the original `y`. To do so, we first add the original features as
`iter=0`.
```{r}
combined <- pks_to_boot %>%
mutate(iter=0) %>%
bind_ranges(boots) %>%
plyranges::select(iter)
```
Then compute summaries:
```{r}
library(tibble)
stats <- combined %>%
group_by(iter) %>%
summarize(n = n(),
sum_width=sum(width)/1e6) %>%
as_tibble()
```
Original `y` vs bootstrap:
```{r}
stats[1,]
summary(stats[-1,])
```
We can also look at distributions of various aspects, e.g. here the
width of features, across a few of the bootstraps and the original
feature set `y`.
```{r boot-width, message=FALSE}
library(ggplot2)
library(ggridges)
combined %>% plyranges::filter(iter %in% 0:5) %>%
plyranges::select(iter) %>%
as_tibble() %>%
mutate(type = ifelse(iter == 0, "original", "boot")) %>%
ggplot(aes(log10(width), iter, fill=type)) +
geom_density_ridges(alpha = 0.75) +
geom_text(data=head(stats),
aes(x=2.25, y=iter, label=paste0("n=",n), fill=NULL),
vjust=1.5)
```
To compute overlap with the null features, we need the `complete()`
function from the *tidyr* package. We saw `complete()` before -- this
is used in the case that one of the iterations has no overlaps. In
this case, we need to record the 0 value for proper inference and
plots downstream. It is rare we would have no overlaps with so many
features as we have in `x` and `y` but it's good practice to leave the
`complete()` as part of the workflow so the code works correctly in
all cases.
The overlap per iteration of the bootstrap is accomplished by a series
of *plyranges* / *dplyr* commands (we switch to *dplyr* halfway
through, after the `as_tibble()` call).
```{r message=FALSE}
library(tidyr)
null <- x %>% join_overlap_inner(boots) %>%
group_by(iter) %>%
summarize(n_overlaps = n()) %>%
as_tibble() %>%
complete(iter, fill=list(n_overlaps = 0))
head(null)
```
```{r}
sum( obs$n_overlaps )
```
The observed number of overlaps is about two orders of magnitude more
than the bootstrapped number, which makes sense as two tissues would
be expected to share a number of similar regulatory regions (as marked
by H3K27ac) -- more so than randomly placed genomic features, even
after accounting for excluded regions and feature clustering.
```{r boot-hist}
ggplot(null, aes(n_overlaps)) +
geom_histogram(binwidth=5) +
ggtitle("bootstrap overlaps")
```
What could be improved with this analysis?
Note that in the above chunks where we count overlaps, we are doubly
(or triply, etc.) counting features in `x` if they hit more than one
feature in `y` or `boots`. We can count statistics per `x` feature by
adding another `group_by` into the stream of operations. This also
allows us to do more complex operations, such as computing the
maximum signal value for the overlapping features in `y` per feature
in `x`:
First add an ID variable to keep track of `x` features:
```{r}
x <- x %>% mutate(x_id = seq_along(x))
```
Then perform an inner join, and group by the new `x` ID:
```{r}
obs <- x %>% join_overlap_inner(pks_to_boot) %>%
group_by(x_id) %>%
summarize(num_overlaps = n(),
max_signal = max(signal))
sum( obs$num_overlaps > 0 )
summary( obs$max_signal )
```
For the bootstrap ranges overlap step, we also need to add `iter` to
the initial `group_by`, so we count per `x` feature and per iteration
of the bootstrap:
```{r}
null <- x %>% join_overlap_inner(boots) %>%
group_by(x_id, iter) %>%
summarize(num_overlaps = n()) %>%
as_tibble() %>%
group_by(iter) %>%
summarize(any_hits = sum(num_overlaps > 0)) %>%
complete(iter, fill=list(any_hits = 0))
head(null)
```
Still, we are seeing much more overlap in the observed data than in
the bootstrap data:
```{r}
sum( obs$num_overlaps > 0 ) /
mean(null$any_hits)
```
The above code chunk then avoids double counting. We could also have
made other per-`x`-feature statistics in the summarize step after the
initial `group_by`, such as maximum signal of overlapping features.
What other ways could we have done this analysis? Suppose we don't
just want the count of overlaps, but the rate of overlaps from the `y`
perspective, keeping track of the variable number of features per
bootstrap.
We demonstrate one approach to obtain this rate for the bootstraps:
```{r}
x_thin <- x %>% plyranges::select(x_id)
null <- boots %>% plyranges::select(id, iter) %>%
join_overlap_inner(x_thin) %>%
group_by(id, iter) %>%
summarize(num_overlaps = n()) %>%
as_tibble() %>%
group_by(iter) %>%
summarize(any_hits = sum(num_overlaps > 0)) %>%
complete(iter, fill=list(any_hits = 0))
```
Now combine with the per-iteration total count:
```{r}
totals <- boots %>%
group_by(iter) %>%
summarize(total=n()) %>%
as_tibble()
null %>% dplyr::left_join(totals) %>%
mutate(rate = any_hits/total)
```