Skip to content
Andy Pohl edited this page Oct 21, 2013 · 24 revisions

Usage message:

bwtool aggregate - produce plot data as averages surrounding given regions
   seen in the bigWig.
usage:
   bwtool aggregate left:right a.bed,b.bed,... x.bw,y.bw,... output.txt
   bwtool aggregate left:right bed.lst bigWig.lst output.txt
where:
  left:right is a description of the range to use surrounding the
         centers/starts. A value of 200:300 will create plot points
         for -200 bp to +300 bp from the center of all the regions.
  bed.lst/bigWig.lst are files containing lists of files to be used
         in place of a comma-separated list.

options:
   -starts          use starts of bed regions as opposed to the middles
   -ends            use ends of bed regions as opposed to the middles
   -firstbase       in this case the zero base is used so output
                    has left+right+1 lines
   -expanded        output medians and standard deviations instead of just
                    averages
   -cluster=k       cluster with k-means with given parameter (2-10 are best)
   -cluster-sets=file.bed
                    write out the original bed with the cluster label along
                    with the accompanying region used for the clustering
   -long-form=[label1,label2,...,labeln]
                    output "long form" where each line is just the position
                    and one of the values and the first column is the name of
                    the file.

Examples

Suppose we have the following data in a bigWig (main.bigWig, blue graph) and a bed (agg1.bed, red regions):

The files (wig and bed) would look like:

variableStep chrom=chr span=1
1	1.0
2	2.0
3	5.0
4	6.0
5	5.0
6	3.0
7	3.0
8	5.0
9	5.0
10	5.0
11	6.0
12	6.0
13	0.0
14	2.0
15	3.0
16	3.0
17	10.0
18	4.0
19	4.0
20	2.0
21	2.0
22	2.0
23	1.0
variableStep chrom=chr span=1
28	2.0
29	3.0
30	4.0
31	6.0
32	6.0
33	4.0
34	4.0
35	4.0
36	2.0

and

chr	0	4	R1
chr	9	19	R2
chr	28	35	R3

respectively. Aggregate will align the regions on their centers by default (below left), but also possible is to align the data on the start coordinates (below center) using the the -starts option, or on the end coordinates using the -ends option (below right).

As you can see, the size of the region is unimportant. What is important is how they are aligned and the size of the left:right aggregate specification. In the case the size of the bed region is odd, the center of the region is taken as the floor() of the calculated midpoint coordinate. In this way, the alignment is always between bases. In a simple case, one could ask for the 3 bases left and right of each of these centerings as follows:

Region/Values Left Right Left (-starts) Right (-starts) Left (-ends) Right (-ends)
R1 bases -1,1,2 3,4,5 -3,-2,-1 1,2,3 2,3,4 5,6,7
R1 vals NA,1.0,2.0 5.0,6.0,5.0 NA,NA,NA 1.0,2.0,5.0 2.0,5.0,6.0 5.0,3.0,3.0
R2 bases 12,13,14 15,16,17 6,7,8 9,10,11 17,18,19 20,21,22
R2 vals 6.0,0.0,2.0 3.0,3.0,10.0 3.0,5.0,5.0 5.0,6.0,6.0 10.0,4.0,4.0 2.0,2.0,2.0
R3 bases 29,30,31 32,33,34 26,27,28 29,30,31 33,34,35 36,37,38
R3 vals 3.0,4.0,6.0 6.0,4.0,4.0 NA,NA,2.0 3.0,4.0,6.0 4.0,4.0,4.0 2.0,NA,NA

The aggregate calculation at its most basic, does an averaging. In this case, when running with each type of alignment:

Alignment Base from center Calculation Result
default -3 (6.0 + 3.0) ÷ 2 4.5
default -2 (1.0 + 0.0 + 4.0) ÷ 3 1.66667
default -1 (2.0 + 2.0 + 6.0) ÷ 3 3.33333
default 1 (5.0 + 3.0 + 6.0) ÷ 3 4.66667
default 2 (6.0 + 3.0 + 4.0) ÷ 3 4.33333
default 3 (5.0 + 10.0 + 4.0) ÷ 3 6.33333
-starts -3 3.0 ÷ 1 3.0
-starts -2 5.0 ÷ 1 5.0
-starts -1 (2.0 + 5.0) ÷ 2 3.5
-starts 1 (1.0 + 5.0 + 3.0) ÷ 3 3.0
-starts 2 (2.0 + 6.0 + 4.0) ÷ 3 4.0
-starts 3 (5.0 + 6.0 + 6.0) ÷ 3 5.66667
-ends -3 (2.0 + 10.0 + 4.0) ÷ 3 5.33333
-ends -2 (5.0 + 4.0 + 4.0) ÷ 3 4.33333
-ends -1 (6.0 + 4.0 + 4.0) ÷ 3 4.66667
-ends 1 (5.0 + 2.0 + 2.0) ÷ 3 3.0
-ends 2 (3.0 + 2.0) ÷ 2 2.5
-ends 3 (3.0 + 2.0) ÷ 2 2.5

After running the program, we get:

$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout
-3	4.500000
-2	1.666667
-1	3.333333
1	4.666667
2	4.333333
3	6.333333
bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -starts
-3	3.000000
-2	5.000000
-1	3.500000
1	3.000000
2	4.000000
3	5.666667
$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -ends
-3	5.333333
-2	4.333333
-1	4.666667
1	3.000000
2	2.500000
3	2.500000

From this point, it may be worthwhile to examine a few other options. -firstbase, does nothing really substantial other than center on a base instead of a base border. The output is almost identical to adding an additional base to the right part of the left:right specification, except the base distance numbers are slightly different, e.g.:

$ bwtool agg 3:4 agg1.bed main.bigWig /dev/stdout -ends 
-3	5.333333
-2	4.333333
-1	4.666667
1	3.000000
2	2.500000
3	2.500000
4	3.000000
$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -ends -firstbase
-3	5.333333
-2	4.333333
-1	4.666667
0	3.000000
1	2.500000
2	2.500000
3	3.000000

-expanded prints three additional columns to the output. These are:

  1. Median.
  2. Standard deviation.
  3. Size (number of values used).

For example, for the -2 values in the default 3:3 running of the program are, in sorted order, 0.0, 1.0, and 4.0. The middle value in the sorted list is 1.0 becomes the median, and the standard deviation is calculated as sqrt(((1.666667-0.0)^2+(1.666667-1)^2+(1.666667-4)^2)/3) = 1.699673.

$ bwtool agg 3:3 agg1.bed main.bigWig /dev/stdout -expanded
-3	4.500000	4.500000	1.500000	2
-2	1.666667	1.000000	1.699673	3
-1	3.333333	2.000000	1.885618	3
1	4.666667	5.000000	1.247219	3
2	4.333333	4.000000	1.247219	3
3	6.333333	5.000000	2.624669	3

Before moving on to more advanced usage of the aggregate program, it's important to mention that when using BED6-formatted regions, i.e. stranded, in the regions that are minus strand, the data is reversed. In the example, we now have a new region R4 on the minus strand:

with an updated bed file looking like:

chr	0	4	R1	0	+
chr	9	19	R2	0	+
chr	28	35	R3	0	+
chr	13	22	R4	0	-

This means that the coordinates used for R4 are 21,20,19:18,17,16 corresponding to data values of 2.0,2.0,4.0:4.0,10.0,3.0. If you were expecting the coordinates to in fact be 20,19,18:17,16,15 instead, it's important to note that because the center of odd-sized regions is placed 1/2 a base off-center to the left in plus-strand regions, the symmetric thing to do is to put them 1/2 a base off-center in the minus-strand regions. Yes, coordinate calculus is frequently annoying in this way. Anyhow, the updated calculations are:

Base from center Calculation Result
-3 (6.0 + 3.0 + 2.0) ÷ 3 3.66667
-2 (1.0 + 0.0 + 4.0 + 2.0) ÷ 4 1.75
-1 (2.0 + 2.0 + 6.0 + 4.0) ÷ 4 3.5
1 (5.0 + 3.0 + 6.0 + 4.0) ÷ 4 4.5
2 (6.0 + 3.0 + 4.0 + 10.0) ÷ 4 5.75
3 (5.0 + 10.0 + 4.0 + 3.0) ÷ 4 5.5

which is what results when running bwtoool:

$ bwtool agg 3:3 agg2.bed main.bigWig /dev/stdout
-3	3.666667
-2	1.750000
-1	3.500000
1	4.500000
2	5.750000
3	5.500000

Clustering

If you look at the following example:

it can be seen that the bed regions fall into two groups: ones that roughly contain a peak (R1, R4, R6), and those that contain a valley (R2, R3, possibly R5). bwtool is able to cluster regions using a basic k-means algorithm, with the limitation that the data from the regions (the aggregated regions, not the orignal regions) not contain missing data. Regions with missing data will be thrown out of the algorithm, and not used for further aggregation. So in this example: R5. If two or more clusters are desired, as in this case, the -cluster option is used.

$ bwtool agg 3:3 agg_clust1.bed main.bigWig /dev/stdout -cluster=2
-3	3.000000	5.500000
-2	4.000000	5.500000
-1	7.333333	1.500000
1	5.000000	2.500000
2	3.666667	4.000000
3	2.666667	4.500000

It's plain to see that the first column of the output has a low-high-low profile and the second column has an high-low-high profile. To confirm the clusters are the ones we expect (also expecting R5 to be thrown out), we can just run aggregate twice with two separate beds:

$ egrep "R1|R4|R6" agg_clust1.bed > tmp.bed
$ bwtool agg 3:3 tmp.bed main.bigWig /dev/stdout
-3	3.000000
-2	4.000000
-1	7.333333
1	5.000000
2	3.666667
3	2.666667
$ egrep "R2|R3" agg_clust1.bed > tmp.bed
$ bwtool agg 3:3 tmp.bed main.bigWig /dev/stdout
-3	5.500000
-2	5.500000
-1	1.500000
1	2.500000
2	4.000000
3	4.500000

To further confirm the clusters are correct, the -cluster-sets option can be used. This will report which cluster each region is assigned to in a file, formatted in the following tripartite way: . This looks like:

$ bwtool agg 3:3 agg_clust1.bed main.bigWig /dev/null -cluster=2 -cluster-sets=/dev/stdout
chr	21	25	R5	0	+	-1	chr	20	26	R5	0	+
chr	27	35	R6	0	-	0	chr	28	34	R6	0	-
chr	2	6	R1	0	+	0	chr	1	7	R1	0	+
chr	14	21	R4	0	+	0	chr	14	20	R4	0	+
chr	4	8	R2	0	-	1	chr	3	9	R2	0	-
chr	11	15	R3	0	+	1	chr	10	16	R3	0	+

Perhaps not very intuitively, the clusters are numbered from 0 to n-1 and not from 1 to n. Regions in the -1 cluster are the ones thrown out.

Clone this wiki locally