-
Notifications
You must be signed in to change notification settings - Fork 22
aggregate
Usage message:
bwtool aggregate - produce plot data as averages surrounding given regions
seen in the bigWig.
usage:
bwtool aggregate up:down a.bed,b.bed,... x.bw,y.bw,... output.txt
bwtool aggregate up:down bed.lst bigWig.lst output.txt
bwtool aggregate up:meta:down a.bed,b.bed,... x.bw,y.bw,... output.txt
bwtool aggregate up:meta:down bed.lst bigWig.lst output.txt
where:
up:down is a description of the range to use surrounding the
centers/starts. A value of 200:300 will create plot points
for 200 bp upstream to 300 bp downstream from the center of
all the regions.
bed.lst/bigWig.lst are files containing lists of files to be used
in place of a comma-separated list.
options:
-starts use starts of bed regions as opposed to the middles
(unavailable when using meta)
-ends use ends of bed regions as opposed to the middles
(unavailable when using meta)
-firstbase in this case the zero base is used so output
has left+right+1 lines
-expanded output medians and standard deviations instead of just
averages
-cluster=k cluster with k-means with given parameter (2-10 are best)
-cluster-sets=file.bed
write out the original bed with the cluster label along
with the accompanying region used for the clustering
-long-form=[label1,label2,...,labeln]
output "long form" where each line is just the position
and one of the values and the first column is the name of
the file.
Suppose we have the following data in a bigWig (main.bigWig, blue graph) and a bed (agg1.bed, red regions):
The files (wig and bed) would look like:
variableStep chrom=chr span=1
1 1.0
2 2.0
3 5.0
4 6.0
5 5.0
6 3.0
7 3.0
8 5.0
9 5.0
10 5.0
11 6.0
12 6.0
13 0.0
14 2.0
15 3.0
16 3.0
17 10.0
18 4.0
19 4.0
20 2.0
21 2.0
22 2.0
23 1.0
variableStep chrom=chr span=1
28 2.0
29 3.0
30 4.0
31 6.0
32 6.0
33 4.0
34 4.0
35 4.0
36 2.0
and
chr 0 4 R1
chr 9 19 R2
chr 28 35 R3
respectively. Aggregate will align the regions on their centers by default (below left), but also possible is to align the data on the start coordinates (below center) using the the -starts option, or on the end coordinates using the -ends option (below right).
As you can see, the size of the region is unimportant. What is important is how they are aligned and the size of the left:right aggregate specification. In the case the size of the bed region is odd, the center of the region is taken as the floor() of the calculated midpoint coordinate. In this way, the alignment is always between bases. In a simple case, one could ask for the 3 bases left and right of each of these centerings as follows:
Region/Values | Left | Right | Left (-starts) | Right (-starts) | Left (-ends) | Right (-ends) |
---|---|---|---|---|---|---|
R1 bases | -1,1,2 | 3,4,5 | -3,-2,-1 | 1,2,3 | 2,3,4 | 5,6,7 |
R1 vals | NA,1.0,2.0 | 5.0,6.0,5.0 | NA,NA,NA | 1.0,2.0,5.0 | 2.0,5.0,6.0 | 5.0,3.0,3.0 |
R2 bases | 12,13,14 | 15,16,17 | 6,7,8 | 9,10,11 | 17,18,19 | 20,21,22 |
R2 vals | 6.0,0.0,2.0 | 3.0,3.0,10.0 | 3.0,5.0,5.0 | 5.0,6.0,6.0 | 10.0,4.0,4.0 | 2.0,2.0,2.0 |
R3 bases | 29,30,31 | 32,33,34 | 26,27,28 | 29,30,31 | 33,34,35 | 36,37,38 |
R3 vals | 3.0,4.0,6.0 | 6.0,4.0,4.0 | NA,NA,2.0 | 3.0,4.0,6.0 | 4.0,4.0,4.0 | 2.0,NA,NA |
The aggregate calculation at its most basic, does an averaging. In this case, when running with each type of alignment:
Alignment | Base from center | Calculation | Result |
---|---|---|---|
default | -3 | (6.0 + 3.0) ÷ 2 | 4.5 |
default | -2 | (1.0 + 0.0 + 4.0) ÷ 3 | 1.66667 |
default | -1 | (2.0 + 2.0 + 6.0) ÷ 3 | 3.33333 |
default | 1 | (5.0 + 3.0 + 6.0) ÷ 3 | 4.66667 |
default | 2 | (6.0 + 3.0 + 4.0) ÷ 3 | 4.33333 |
default | 3 | (5.0 + 10.0 + 4.0) ÷ 3 | 6.33333 |
-starts | -3 | 3.0 ÷ 1 | 3.0 |
-starts | -2 | 5.0 ÷ 1 | 5.0 |
-starts | -1 | (2.0 + 5.0) ÷ 2 | 3.5 |
-starts | 1 | (1.0 + 5.0 + 3.0) ÷ 3 | 3.0 |
-starts | 2 | (2.0 + 6.0 + 4.0) ÷ 3 | 4.0 |
-starts | 3 | (5.0 + 6.0 + 6.0) ÷ 3 | 5.66667 |
-ends | -3 | (2.0 + 10.0 + 4.0) ÷ 3 | 5.33333 |
-ends | -2 | (5.0 + 4.0 + 4.0) ÷ 3 | 4.33333 |
-ends | -1 | (6.0 + 4.0 + 4.0) ÷ 3 | 4.66667 |
-ends | 1 | (5.0 + 2.0 + 2.0) ÷ 3 | 3.0 |
-ends | 2 | (3.0 + 2.0) ÷ 2 | 2.5 |
-ends | 3 | (3.0 + 2.0) ÷ 2 | 2.5 |
After running the program, we get:
$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout
-3 4.500000
-2 1.666667
-1 3.333333
1 4.666667
2 4.333333
3 6.333333
bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -starts
-3 3.000000
-2 5.000000
-1 3.500000
1 3.000000
2 4.000000
3 5.666667
$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -ends
-3 5.333333
-2 4.333333
-1 4.666667
1 3.000000
2 2.500000
3 2.500000
From this point, it may be worthwhile to examine a few other options. -firstbase, does nothing really substantial other than center on a base instead of a base border. The output is almost identical to adding an additional base to the right part of the left:right specification, except the base distance numbers are slightly different, e.g.:
$ bwtool agg 3:4 agg1.bed main.bigWig /dev/stdout -ends
-3 5.333333
-2 4.333333
-1 4.666667
1 3.000000
2 2.500000
3 2.500000
4 3.000000
$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -ends -firstbase
-3 5.333333
-2 4.333333
-1 4.666667
0 3.000000
1 2.500000
2 2.500000
3 3.000000
-expanded prints three additional columns to the output. These are:
- Median.
- Standard deviation.
- Size (number of values used).
For example, for the -2 values in the default 3:3 running of the program are, in sorted order, 0.0, 1.0, and 4.0. The middle value in the sorted list is 1.0 becomes the median, and the standard deviation is calculated as sqrt(((1.666667-0.0)^2+(1.666667-1)^2+(1.666667-4)^2)/3) = 1.699673.
$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -expanded
-3 4.500000 4.500000 1.500000 2
-2 1.666667 1.000000 1.699673 3
-1 3.333333 2.000000 1.885618 3
1 4.666667 5.000000 1.247219 3
2 4.333333 4.000000 1.247219 3
3 6.333333 5.000000 2.624669 3
Before moving on to more advanced usage of the aggregate program, it's important to mention that when using BED6-formatted regions, i.e. stranded, in the regions that are minus strand, the data is reversed. In the example, we now have a new region R4 on the minus strand:
with an updated bed file looking like:
chr 0 4 R1 0 +
chr 9 19 R2 0 +
chr 28 35 R3 0 +
chr 13 22 R4 0 -
This means that the coordinates used for R4 are 21,20,19:18,17,16 corresponding to data values of 2.0,2.0,4.0:4.0,10.0,3.0. If you were expecting the coordinates to in fact be 20,19,18:17,16,15 instead, it's important to note that because the center of odd-sized regions is placed 1/2 a base off-center to the left in plus-strand regions, the symmetric thing to do is to put them 1/2 a base off-center in the minus-strand regions. Yes, coordinate calculus is frequently annoying in this way. Anyhow, the updated calculations are:
Base from center | Calculation | Result |
---|---|---|
-3 | (6.0 + 3.0 + 2.0) ÷ 3 | 3.66667 |
-2 | (1.0 + 0.0 + 4.0 + 2.0) ÷ 4 | 1.75 |
-1 | (2.0 + 2.0 + 6.0 + 4.0) ÷ 4 | 3.5 |
1 | (5.0 + 3.0 + 6.0 + 4.0) ÷ 4 | 4.5 |
2 | (6.0 + 3.0 + 4.0 + 10.0) ÷ 4 | 5.75 |
3 | (5.0 + 10.0 + 4.0 + 3.0) ÷ 4 | 5.5 |
which is what results when running bwtoool:
$ bwtool agg 3:3 agg2.bed main.bigWig /dev/stdout
-3 3.666667
-2 1.750000
-1 3.500000
1 4.500000
2 5.750000
3 5.500000
If you look at the following example:
it can be seen that the bed regions fall into two groups: ones that roughly contain a peak (R1, R4, R6), and those that contain a valley (R2, R3, possibly R5). bwtool is able to cluster regions using a basic k-means algorithm, with the limitation that the data from the regions (the aggregated regions, not the orignal regions) not contain missing data. Regions with missing data will be thrown out of the algorithm, and not used for further aggregation. So in this example: R5. If two or more clusters are desired, as in this case, the -cluster option is used.
$ bwtool agg 3:3 agg_clust1.bed main.bigWig /dev/stdout -cluster=2
-3 3.000000 5.500000
-2 4.000000 5.500000
-1 7.333333 1.500000
1 5.000000 2.500000
2 3.666667 4.000000
3 2.666667 4.500000
It's plain to see that the first column of the output has a low-high-low profile and the second column has an high-low-high profile. To confirm the clusters are the ones we expect (also expecting R5 to be thrown out), we can just run aggregate twice with two separate beds:
$ egrep "R1|R4|R6" agg_clust1.bed > agg_clust1_c1.bed
$ bwtool agg 3:3 tmp.bed main.bigWig /dev/stdout
-3 3.000000
-2 4.000000
-1 7.333333
1 5.000000
2 3.666667
3 2.666667
$ egrep "R2|R3" agg_clust1.bed > agg_clust1_c2.bed
$ bwtool agg 3:3 tmp.bed main.bigWig /dev/stdout
-3 5.500000
-2 5.500000
-1 1.500000
1 2.500000
2 4.000000
3 4.500000
To further confirm the clusters are correct, the -cluster-sets option can be used. This will report which cluster each region is assigned to in a file, formatted in the following tripartite way: <original bed region> <cluster assignment> <aggregated region>. This looks like:
$ bwtool agg 3:3 agg_clust1.bed main.bigWig /dev/null -cluster=2 -cluster-sets=/dev/stdout
chr 21 25 R5 0 + -1 chr 20 26 R5 0 +
chr 27 35 R6 0 - 0 chr 28 34 R6 0 -
chr 2 6 R1 0 + 0 chr 1 7 R1 0 +
chr 14 21 R4 0 + 0 chr 14 20 R4 0 +
chr 4 8 R2 0 - 1 chr 3 9 R2 0 -
chr 11 15 R3 0 + 1 chr 10 16 R3 0 +
Perhaps not very intuitively, the clusters are numbered from 0 to n-1 and not from 1 to n. Regions in the -1 cluster are the ones thrown out.
Up until now, the examples have been using one bed and one bigWig. But it's possible to use multiple bigWigs or multiple beds in the same aggregation. In the clustering example above, we made two files: agg_clust1_c1.bed and agg_clust1_c2.bed by grepping the particular regions we wanted in order to reconstitute the clusters. Using both files in the same aggregation will result in a column for the first bed, and a column for the second bed, in the same order given in the command:
$ bwtool agg 3:3 agg_clust1_c1.bed,agg_clust1_c2.bed main.bigWig /dev/stdout
-3 3.000000 5.500000
-2 4.000000 5.500000
-1 7.333333 1.500000
1 5.000000 2.500000
2 3.666667 4.000000
3 2.666667 4.500000
Using -expanded will result in four columns for each file:
$ bwtool agg 3:3 agg_clust1_c1.bed,agg_clust1_c2.bed main.bigWig /dev/stdout -expanded
-3 3.000000 3.000000 0.816497 3 5.500000 5.500000 0.500000 2
-2 4.000000 4.000000 0.816497 3 5.500000 5.500000 0.500000 2
-1 7.333333 6.000000 1.885618 3 1.500000 1.500000 1.500000 2
1 5.000000 5.000000 0.816497 3 2.500000 2.500000 0.500000 2
2 3.666667 4.000000 0.471405 3 4.000000 4.000000 1.000000 2
3 2.666667 3.000000 0.471405 3 4.500000 4.500000 1.500000 2
Multiple can also be used in a similar way. In the example below, another dataset (second.bigWig), and a bed (R0) has been added to the previous one:
variableStep chrom=chr span=1
1 4.0
2 2.0
3 3.0
4 4.0
5 4.0
6 3.0
7 3.0
8 7.0
9 8.0
10 7.0
11 7.0
12 5.0
13 1.0
14 2.0
15 3.0
16 3.0
17 4.0
18 4.0
19 4.0
20 2.0
21 2.0
22 2.0
23 1.0
24 1.0
25 1.0
26 2.0
27 1.0
28 2.0
29 3.0
30 4.0
31 4.0
32 2.0
33 2.0
34 2.0
35 2.0
36 2.0
with bed:
chr 13 20 R0
making:
Aggregating the two datasets together just echoes the data at a desired range, because with only one region, no averaging is done. But two columns are made: the first for the first dataset, and the second for the second dataset:
$ bwtool agg -starts 3:3 r0.bed main.bigWig,second.bigWig /dev/stdout
-3 6.000000 7.000000
-2 6.000000 5.000000
-1 0.000000 1.000000
1 2.000000 2.000000
2 3.000000 3.000000
3 3.000000 3.000000
Now we'll add another bed with two regions, in order to demonstrate two beds with two bigWigs. The yellow regions below:
are defined by the following (yellow.bed):
chr 2 6 R1
chr 28 32 R2
Combining two or more beds and two or more bigWigs automatically changes the output format slightly to that known as long form, as demonstrated:
$ bwtool agg 3:3 r0.bed,yellow.bed main.bigWig,second.bigWig /dev/stdout
r0.bed main.bigWig -3 2.000000
r0.bed second.bigWig -3 2.000000
yellow.bed main.bigWig -3 2.000000
yellow.bed second.bigWig -3 2.000000
r0.bed main.bigWig -2 3.000000
r0.bed second.bigWig -2 3.000000
yellow.bed main.bigWig -2 4.000000
yellow.bed second.bigWig -2 3.000000
r0.bed main.bigWig -1 3.000000
r0.bed second.bigWig -1 3.000000
yellow.bed main.bigWig -1 5.000000
yellow.bed second.bigWig -1 4.000000
r0.bed main.bigWig 1 10.000000
r0.bed second.bigWig 1 4.000000
yellow.bed main.bigWig 1 5.500000
yellow.bed second.bigWig 1 4.000000
r0.bed main.bigWig 2 4.000000
r0.bed second.bigWig 2 4.000000
yellow.bed main.bigWig 2 4.500000
yellow.bed second.bigWig 2 2.500000
r0.bed main.bigWig 3 4.000000
r0.bed second.bigWig 3 4.000000
yellow.bed main.bigWig 3 3.500000
yellow.bed second.bigWig 3 2.500000
Because two beds and two bigWigs would require 4 output columns, a more explicit way of outputting is to only have one position column and one aggregate result column and instead expand vertically. By default, the filenames are used as labels, but they may also be set using the -long-form option like:
$ bwtool agg 3:3 r0.bed,yellow.bed main.bigWig,second.bigWig /dev/stdout -long-form=red_bed,yellow_bed,blue_sig,orange_sig
red_bed blue_sig -3 2.000000
red_bed orange_sig -3 2.000000
yellow_bed blue_sig -3 2.000000
yellow_bed orange_sig -3 2.000000
red_bed blue_sig -2 3.000000
red_bed orange_sig -2 3.000000
yellow_bed blue_sig -2 4.000000
yellow_bed orange_sig -2 3.000000
red_bed blue_sig -1 3.000000
red_bed orange_sig -1 3.000000
yellow_bed blue_sig -1 5.000000
yellow_bed orange_sig -1 4.000000
red_bed blue_sig 1 10.000000
red_bed orange_sig 1 4.000000
yellow_bed blue_sig 1 5.500000
yellow_bed orange_sig 1 4.000000
red_bed blue_sig 2 4.000000
red_bed orange_sig 2 4.000000
yellow_bed blue_sig 2 4.500000
yellow_bed orange_sig 2 2.500000
red_bed blue_sig 3 4.000000
red_bed orange_sig 3 4.000000
yellow_bed blue_sig 3 3.500000
yellow_bed orange_sig 3 2.500000
Using multiple beds, multiple bigWigs, -long-form with labels, etc all together can make for some very long commands. A good way to tidy this up is to make files listing the beds and the bigWigs, which incidentally can also hold the labeling information. Making the two files beds.lst:
r0.bed red_bed
yellow.bed yellow_bed
and bigwigs.lst:
main.bigWig blue_sig
second.bigWig orange_sig
the above output can be exactly replicated using the following, shorter, command:
$ bwtool agg 3:3 beds.lst bigwigs.lst /dev/stdout
When the meta value is specified in the range argument, this means the regions are aligned at both the start and end coordinate of each bed row and the region in between is artificially set to be meta bases long. So for example, if the typical exon is 100 bases long, then setting a meta value of 100 will be reasonable. The resulting data within each bed region is expanded or contracted across the same number of bases, depending on whether the original region is larger or smaller than the meta region. Exons that are 150 bases long will result in meta bases that consist of more than one original base, averaging is done to achieve this. Exons that are smaller than the meta region will result in meta bases that use less than one original base, however averaging may still be involved if a particular meta base maps across two original bases.
For example, in the original example, R1 is 4 bases, R2 is 10 bases and R3 is 7 bases:
If we ask for a range like 2:5:2 then the aggregate alignment will look like:
U2 | U1 | M1 | M2 | M3 | M4 | M5 | D1 | D2 | |
---|---|---|---|---|---|---|---|---|---|
R1 | NA | NA | M | M | M | M | M | 5 | 3 |
R2 | 5 | 5 | M | M | M | M | M | 2 | 2 |
R3 | NA | 2 | M | M | M | M | M | 2 | NA |
Before the aggregation step, it is necessary to calculate the meta values. Because R2 and R3 are both larger than 5 bases, the data is contracted, while on the other hand, the data in R1 is expanded. For R2 it is quite easy because 10 is a multiple of 5. We can represent the rescaling of bases in the following figure:
The meta values are then:
Meta Base | R1 bases | R1 | R2 bases | R2 | R3 bases | R3 |
---|---|---|---|---|---|---|
1 | 1 (4/4) | 1.0 | 1 (5/10), 2 (5/10) | 5(5/10) + 6(5/10) = 5.5 | 1 (5/7), 2 (2/7) | 3(5/7) + 4(2/7) = 3.29 |
2 | 1 (1/4), 2 (3/4) | (1/4) + 2(3/4) = 1.75 | 3 (5/10), 4 (5/10) | 6(5/10) + 0(5/10) = 3.0 | 2 (3/7), 3 (4/7) | 4(3/7) + 6(4/7) = 5.14 |
3 | 2 (2/4), 3 (2/4) | 2(2/4) + 5(2/4) = 3.5 | 5 (5/10), 6 (5/10) | 2(5/10) + 3(5/10) = 2.5 | 3 (1/7), 4 (5/7), 5 (1/7) | 6(1/7) + 6(5/7) + 4(1/7) = 5.71 |
4 | 3 (3/4), 1 (1/4) | 5(3/4) + 6(1/4) = 5.25 | 7 (5/10), 8 (5/10) | 3(5/10) + 10(5/10) = 6.5 | 5 (4/7), 6 (3/7) | 4(4/7) + 4(3/7) = 4.0 |
5 | 4 (4/4) | 6.0 | 9 (5/10), 10 (5/10) | 4(5/10) + 4(5/10) = 4.0 | 6 (2/7), 7 (5/7) | 4(2/7) + 4(5/7) = 4.0 |
Now that we have the meta signal calculated, we can update the data alignment:
U2 | U1 | M1 | M2 | M3 | M4 | M5 | D1 | D2 | |
---|---|---|---|---|---|---|---|---|---|
R1 | NA | NA | 1.0 | 1.75 | 3.5 | 5.25 | 6.0 | 5 | 3 |
R2 | 5 | 5 | 5.5 | 3.0 | 2.5 | 6.5 | 4.0 | 2 | 2 |
R3 | NA | 2 | 3.29 | 5.14 | 5.71 | 4.0 | 4.0 | 2 | NA |
which is then ready to be aggregated accordingly:
U2 | U1 | M1 | M2 | M3 | M4 | M5 | D1 | D2 | |
---|---|---|---|---|---|---|---|---|---|
Agg | 5.0 | 3.5 | 3.26 | 3.30 | 3.90 | 5.25 | 4.67 | 3.0 | 2.5 |
and we can check this with bwtool:
$ bwtool agg 2:5:2 agg1.bed main.bw /dev/stdout
-2 5.000000
-1 3.500000
1 3.261905
2 3.297619
3 3.904762
4 5.250000
5 4.666667
# meta part ends here
6 3.000000
7 2.500000
Note that the downstream aggregate coordinates continue counting from the meta coordinate. The reason for this is that most graphing software expects there to only be one zero per dimension. In some cases, such as in R, labels may be manually changed. The comment in the output is to help with any confusion arising from this.
The meta parameter is meant to align the signal of starts/ends of variable-sized regions. By default the upstream and downstream parameters are unscaled. So specifying 200:1000:200 for gene regions will shrink/expand all genes to be 1000 bp but the 200 bases upstream and downstream are not scaled in any way. The -meta-scale-all option will cause these regions to first scale to the size of the gene based on the proportion specified by the upstream:meta:downstream parameter. Once all sizes are known, the region is treated in the same way as a meta region, and the signal is expanded or contracted to fit. In this way, the meta scale is uniform across the upstream, meta, and downstream regions.
Below is an example of how the upstream and downstream regions are captured prior to fitting meta sizes specified as 2:5:2, (modifying the classic example a bit so that there's no missing data):
In this case, for R2, the region is 10 bases long so a meta base is 10/5 = 2 bases. With a 2 base meta size, 2x2 = 4 bases are needed from the upstream and downstream if the -meta-scale-all option is used (in green). If the -meta-scale-all option is not used, then just 2 bases are used by default (purple). R3 is a different size than R2, but without the -meta-scale-all option, both use an equivalent amount of signal for aggregation in the upstream and downstream regions. With the -meta-scale-all option, R3 has a meta base size of 7/5 = 1.4 bases, and therefore uses 2x1.4 = 2.8 bases of signal for the eventual aggregation. Calculating the second upstream metabase for R3 would then take into account 80% of the signal of base 26 (0.8 x 2 = 1.6), and 60% of the signal of base 27 (0.6 x 1 = 0.6), the sum of which is 2.2, which then divided by 1.4 is 1.571. We can see this running bwtool:
$ bwtool agg 2:5:2 r3.bed main-alt.bw /dev/stdout -meta-scale-all
-2 1.571428
-1 1.714286
1 3.285714
2 5.142857
3 5.714286
4 4.000000
5 4.000000
# meta part ends here
6 2.000000
7 nan