diff --git a/Makefile b/Makefile index cb6805a12..57460b891 100644 --- a/Makefile +++ b/Makefile @@ -87,6 +87,7 @@ BOOK_SRC = \ # All source pages (including things not in the book). PAGES_SRC = \ contents.md \ + $(wildcard novice/capstones/*/*.md) \ $(wildcard intermediate/python/*.md) \ $(BOOK_SRC) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md new file mode 100644 index 000000000..3d53a645a --- /dev/null +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -0,0 +1,853 @@ +--- +layout: lesson +root: ../../.. +title: Capstone example Python and Bedtools +--- + +Synopsis +======== + +Our goal is to work through examples that demonstrate how to explore, process and manipulate bed files with the `bedtools` software package. + +Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. If you are working with published data from other groups (for example, transcription factor binding sites identified by ChIP-seq) you will likely find that the data is provided in bed format. One of the most commonly asked questions of bioinformaticians is whether two different sets of features overlap with each other, so today we'll be showing you how to answer that question. + +This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). + +During this lesson, you'll be working on some scripts and data which are in their own github repository. In our git lesson, we covered the situation where you have been given write access to a repository. Now we're going to cover what to do if you don't have write access to a repository - this is called forking. + +Forking the github repository +============================= + +The data and code that you need to work on are located at . You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface: + +The Fork Button + +If you visit the [bedtools-example repository](http://github.com/rbeagrie/bedtools-example) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example. + +Setup +===== + +From the Terminal, clone your fork of the repository into a new folder on your Desktop called "bedtools-example". + +~~~ +$ cd ~/Desktop +$ git clone git@github.com:YOUR_USERNAME_GOES_HERE/bedtools-example.git +~~~ +{:class="in"} + + +Navigate into that directory. + +~~~ +$ cd bedtools-example +~~~ +{:class="in"} + +Let's take a look at what files we now have. + +~~~ +$ ls +~~~ +{:class="in"} + +~~~ +data scripts +~~~ +{:class="out"} + +So we have two directories, one containing data and the other containing some scripts. We're going to start with the data folder, so lets look at what's in there. + +~~~ +$ ls data +~~~ +{:class="in"} + +~~~ +cpg.bed exons.bed genome.txt gwas.bed +~~~ +{:class="out"} + + +What are these files? +--------------------- + +The data directory should contain 3 BED files and 1 genome file. + +`data/cpg.bed` represents CpG islands in the human genome; `data/exons.bed` represents RefSeq exons from human genes; and `data/gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). `data/genome.txt` file contains the lengths of all the human chromosomes. + +These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). + +Introducing BedTools +==================== + +All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`: + +~~~ +head data/cpg.bed +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 +chr1 135124 135563 CpG:_30 +chr1 327790 328229 CpG:_29 +chr1 437151 438164 CpG:_84 +chr1 449273 450544 CpG:_99 +chr1 533219 534114 CpG:_94 +chr1 544738 546649 CpG:_171 +chr1 713984 714547 CpG:_60 +chr1 762416 763445 CpG:_115 +chr1 788863 789211 CpG:_28 +~~~ +{:class="out"} + +The fourth column here is the name of each CpG island - which in this case is just a numerical ID assigned by the folks at UCSC. + +The bedtools help +----------------- + +You should have installed bedtools as part of the setup for this bootcamp. In which case, you can bring up the help by typing: + +~~~ +$ bedtools +~~~ +{:class="in"} + +If that didn't work, you probably don't have bedtools installed properly. Have a look at for some help. + +If it did work, you should see that there are multiple "subcommands". For bedtools to work you must tell it which subcommand you want to use. Examples: + +~~~ +$ bedtools intersect +$ bedtools merge +$ bedtools subtract +~~~ +{:class="in"} + +This should bring up the specific help text associated with each tool. + +bedtools "intersect" +==================== + +The `intersect` command is the workhorse of the `bedtools` suite. It compares two bed files and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common). + +![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png) + +Default behavior +---------------- +By default, `intersect` reports the intervals that represent overlaps between your two files. This is represented by the top line of the image above (the one which reads "A intersect B"). As an example, you may know that CpG islands often mark transcription start sites. Imagine that we want to find all the exons in the genome which might contain a promotor. One way of doing this would be to identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons. + +~~~ +$ bedtools intersect -a data/cpg.bed -b data/exons.bed | head +~~~ +{:class="in"} +~~~ +chr1 29320 29370 CpG:_116 +chr1 135124 135563 CpG:_30 +chr1 327790 328229 CpG:_29 +chr1 327790 328229 CpG:_29 +chr1 327790 328229 CpG:_29 +chr1 713984 714068 CpG:_60 +chr1 762970 763155 CpG:_115 +chr1 763177 763229 CpG:_115 +chr1 762970 763155 CpG:_115 +chr1 762970 763155 CpG:_115 +~~~ +{:class="out"} + +Notice how the output lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get all the extra columns from the exons.bed file, including the name (fourth column). The fifth column is a score (not used in this context) and the sixth is the strand. + +~~~ +$ bedtools intersect -a data/exons.bed -b data/cpg.bed | head -5 +~~~ +{:class="in"} +~~~ +chr1 50489434 50489626 NM_032785_exon_13_0_chr1_50489435_r 0 - +chr1 16767166 16767348 NM_018090_exon_0_0_chr1_16767167_f 0 + +chr1 33546713 33546895 NM_052998_exon_0_0_chr1_33546714_f 0 + +chr1 33546988 33547109 NM_052998_exon_1_0_chr1_33546989_f 0 + +chr1 33547201 33547243 NM_052998_exon_2_0_chr1_33547202_f 0 + +chr1 16767166 16767270 NM_001145278_exon_0_0_chr1_16767167_f 0 + +chr1 16767166 16767348 NM_001145277_exon_0_0_chr1_16767167_f 0 + +chr1 8384389 8384719 NM_001080397_exon_0_0_chr1_8384390_f 0 + +chr1 8385895 8386102 NM_001080397_exon_2_0_chr1_8385878_f 0 + +chr1 8390268 8390800 NM_001080397_exon_3_0_chr1_8390269_f 0 + +~~~ +{:class="out"} + + +Reporting the original feature in each file. +-------------------------------------------- +The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected. This is represented by the second line of the image above (the one which reads "A intersect B (-wa)"). A + +~~~ +$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + +~~~ +{:class="out"} + +The first three columns of the output show us where there are overlaps between exons and cpg islands. The fourth column contains the name of the CpG island which is overlapping. The rest of the output is the full line from the exons.bed file. + +Counting the number of overlapping features. +-------------------------------------------- +We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. + +~~~ +$ bedtools intersect -a data/cpg.bed -b data/exons.bed -c | head +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 1 +chr1 135124 135563 CpG:_30 1 +chr1 327790 328229 CpG:_29 3 +chr1 437151 438164 CpG:_84 0 +chr1 449273 450544 CpG:_99 0 +chr1 533219 534114 CpG:_94 0 +chr1 544738 546649 CpG:_171 0 +chr1 713984 714547 CpG:_60 1 +chr1 762416 763445 CpG:_115 10 +chr1 788863 789211 CpG:_28 9 +~~~ +{:class="out"} + +Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`. If we reverse the order of the two files again, we get a different output: + +~~~ +$ bedtools intersect -a data/exons.bed -b data/cpg.bed -c | head +~~~ +{:class="in"} +~~~ +chr1 66999824 67000051 NM_032291_exon_0_0_chr1_66999825_f 0 + 0 +chr1 67091529 67091593 NM_032291_exon_1_0_chr1_67091530_f 0 + 0 +chr1 67098752 67098777 NM_032291_exon_2_0_chr1_67098753_f 0 + 0 +chr1 67101626 67101698 NM_032291_exon_3_0_chr1_67101627_f 0 + 0 +chr1 67105459 67105516 NM_032291_exon_4_0_chr1_67105460_f 0 + 0 +chr1 67108492 67108547 NM_032291_exon_5_0_chr1_67108493_f 0 + 0 +chr1 67109226 67109402 NM_032291_exon_6_0_chr1_67109227_f 0 + 0 +chr1 67126195 67126207 NM_032291_exon_7_0_chr1_67126196_f 0 + 0 +chr1 67133212 67133224 NM_032291_exon_8_0_chr1_67133213_f 0 + 0 +chr1 67136677 67136702 NM_032291_exon_9_0_chr1_67136678_f 0 + 0 +~~~ +{:class="out"} + +Now we get one line for every exon. In this case, the first ten exons actually don't overlap with any CpG islands. + +Require a minimal fraction of overlap. +-------------------------------------------- +Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported. + +Let's be more strict and require 50% of overlap. + +~~~ +$ bedtools intersect -a data/cpg.bed -b data/exons.bed -c -f 0.50 | head +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 0 +chr1 135124 135563 CpG:_30 1 +chr1 327790 328229 CpG:_29 3 +chr1 437151 438164 CpG:_84 0 +chr1 449273 450544 CpG:_99 0 +chr1 533219 534114 CpG:_94 0 +chr1 544738 546649 CpG:_171 0 +chr1 713984 714547 CpG:_60 0 +chr1 762416 763445 CpG:_115 0 +chr1 788863 789211 CpG:_28 7 +~~~ +{:class="out"} + +If you compare this to the result you see without the -f flag you will see that the very first CpG island is now reported as having 0 overlaps. This means it must have overlapped an exon by less than 50%. + +Find features that DO NOT overlap +-------------------------------------------- +Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. This is represented in the last line of the image above. + +~~~ +$ bedtools intersect -a cpg.bed -b exons.bed -v | head +~~~ +{:class="in"} +~~~ +chr1 437151 438164 CpG:_84 +chr1 449273 450544 CpG:_99 +chr1 533219 534114 CpG:_94 +chr1 544738 546649 CpG:_171 +chr1 801975 802338 CpG:_24 +chr1 805198 805628 CpG:_50 +chr1 839694 840619 CpG:_83 +chr1 844299 845883 CpG:_153 +chr1 912869 913153 CpG:_28 +chr1 919726 919927 CpG:_15 +~~~ +{:class="out"} + +How many base pairs of overlap were there? +------------------------------------------ +The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files. + +~~~ +$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR_047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 +~~~ +{:class="out"} + +Do exons and CpG islands overlap significantly? +=============================================== + +OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do they overlap more than we would expect by random chance? + +Going back to base pairs overlap +-------------------------------- + +In order to start answering this question, we need to get the base pairs of overlap between all of the elements. From the section above, we know we can get this using the -wo flag: + +~~~ +$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR_047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 +~~~ +{:class="out"} + +Can you spot anything odd about this output? Lines 3-5 indicate that one of our CpG islands is overlapping with three alternative versions of the same exon. For the purposes of our analysis though, this should really count as a single overlap. What we need to do is merge overlapping exons in exons.bed together. + +bedtools "merge" +==================== + +Many datasets of genomic features have many individual features that overlap one another (e.g. alignments from a ChiP seq experiment). It is often useful to just combine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you. + +![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png) + + +Input must be sorted +-------------------- + +The merge tool requires that the input file is sorted by chromosome, then by start position. This allows the merging algorithm to work very quickly without requiring any RAM. + +If you run the merge command on a file that is not sorted, you will get an error. For example: + +~~~ +bedtools merge -i exons.bed +~~~ +{:class="in"} +~~~ +chr1 66999824 67000051 +chr1 67091529 67091593 +chr1 67098752 67098777 +chr1 67101626 67101698 +chr1 67105459 67105516 +chr1 67108492 67108547 +chr1 67109226 67109402 +chr1 67126195 67126207 +chr1 67133212 67133224 +chr1 67136677 67136702 +chr1 67137626 67137678 +chr1 67138963 67139049 +chr1 67142686 67142779 +chr1 67145360 67145435 +chr1 67147551 67148052 +chr1 67154830 67154958 +chr1 67155872 67155999 +chr1 67161116 67161176 +chr1 67184976 67185088 +chr1 67194946 67195102 +chr1 67199430 67199563 +chr1 67205017 67205220 +chr1 67206340 67206405 +chr1 67206954 67207119 +~~~ +{:class="in"} +~~~ +ERROR: input file: (exons.bed) is not sorted by chrom then start. + The start coordinate at line 26 is less than the start at line 25 +~~~ +{:class="err"} + +To correct this, you need to sort your BED. We could use the UNIX `sort` utility that we already covered, which would look like this: + +~~~ +$ sort -k1,1 -k2,2n exons.bed | head +~~~ +{:class="in"} +~~~ +chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 + +chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 + +chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 + +chr1 14361 14829 NR_024540_exon_0_0_chr1_14362_r 0 - +chr1 14969 15038 NR_024540_exon_1_0_chr1_14970_r 0 - +~~~ +{:class="out"} + +But if you can't remember the right arguments to use for sort, the command `bedtools sort` will also do what we want: + +~~~ +$ bedtools sort -i exons.bed | head +~~~ +{:class="in"} +~~~ +chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 + +chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 + +chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 + +chr1 14361 14829 NR_024540_exon_0_0_chr1_14362_r 0 - +chr1 14969 15038 NR_024540_exon_1_0_chr1_14970_r 0 - +~~~ +{:class="out"} + +Now instead of saving the sorted results to a separate file, we can actually pipe the result into bedtools merge: + +~~~ +$ bedtools sort -i exons.bed | bedtools merge | head +~~~ +{:class="in"} +~~~ +chr1 11873 12227 +chr1 12612 12721 +chr1 13220 14829 +chr1 14969 15038 +chr1 15795 15947 +~~~ +{:class="out"} + +The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. Notice how the merged file has less columns than the original file - we've lost the name of the exon and the strand. + +Let's save the output of bedtools merge to a file and recalculate the overlap: + +~~~ +$ bedtools sort -i exons.bed | bedtools merge > exons.merged.bed +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 84 +chr1 762416 763445 CpG:_115 chr1 761585 762902 486 +chr1 762416 763445 CpG:_115 chr1 762970 763155 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 52 +chr1 788863 789211 CpG:_28 chr1 788770 794826 348 +chr1 854765 854973 CpG:_16 chr1 854714 854817 52 +chr1 858970 861632 CpG:_257 chr1 861120 861180 60 +~~~ +{:class="out"} + +This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. Notice how the repetition in line 3-5 is now gone. + +#Challenge: stranded merge + +All we actually need from the output of `bedtools intersect` is the overlap column. We can use the unix `cut` command to extract just the 8th field (or column) using the -f flag. + +~~~ +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | head +~~~ +{:class="in"} +~~~ +50 +439 +439 +84 +486 +185 +52 +348 +52 +60 +~~~ +{:class="out"} + +Now we need a program to read in this column of numbers and give us the sum. Hopefully this should ring some bells - The `readings.py` program we wrote earlier will do this with a few small changes. + +~~~ +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum +~~~ +{:class="in"} +~~~ +7235722.0 +~~~ +{:class="out"} + +Let's save this result to a file so that we remember it later: + +~~~ +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum > cpg_result.txt +~~~ +{:class="in"} + + +So we know what the overlap is between CpG islands and exons. How do we know whether this is significant? How do we know what we would expect by random chance? + +BedTools includes a command called shuffle which will randomize one of our input files. Let's try randomizing the CpG islands: + +~~~ +bedtools shuffle -i cpg.bed -g genome.txt | head +~~~ +{:class="in"} +~~~ +chr7 154365006 154366081 CpG:_116 +chr2 104893603 104894042 CpG:_30 +chr11 11786369 11786808 CpG:_29 +chr1 116263073 116264086 CpG:_84 +chr10 10674197 10675468 CpG:_99 +chrX 7680868 7681763 CpG:_94 +chr6 34093178 34095089 CpG:_171 +chr1 88263374 88263937 CpG:_60 +chr1 150416635 150417664 CpG:_115 +chr6 39650957 39651305 CpG:_28 +~~~ +{:class="out"} + +Notice how the CpG islands have been shuffled to random chromosomes. This means that each chromosome will have roughly the same number of CpG islands, regardless of length. A better random model would be to keep the distribution of CpG islands over chromosomes and just shuffle the positions of each CpG on each chromosome. We can do this using the -chrom flag. + +~~~ +$ bedtools shuffle -chrom -i cpg.bed -g genome.txt | head +~~~ +{:class="in"} +~~~ +chr1 2367858 2368933 CpG:_116 +chr1 225323760 225324199 CpG:_30 +chr1 37368198 37368637 CpG:_29 +chr1 13872987 13874000 CpG:_84 +chr1 91280571 91281842 CpG:_99 +chr1 74656135 74657030 CpG:_94 +chr1 64079046 64080957 CpG:_171 +chr1 54034535 54035098 CpG:_60 +chr1 237633659 237634688 CpG:_115 +chr1 51954093 51954441 CpG:_28 +~~~ +{:class="out"} + +Now let's save that result to a file, and count the overlap for the shuffled features. + +~~~ +$ bedtools shuffle -chrom -i cpg.bed -g genome.txt > cpg.shuffled.bed +$ bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum +~~~ +{:class="in"} +~~~ +681980.0 +~~~ +{:class="out"} + +Great, so our shuffled file overlaps much less with exons than real CpG islands do. How do we know if the difference is significant? One approach is to make lots of different randomly shuffled files, and count the overlap for each one. If we do this 100 times, and we never see an overlap that is as big as the one we see for the original `data/cpg.bed` file, then we can say that the overlap between exons and CpG islands is significant with a P-value (roughly) < 0.01. + +First, prepare a directory to hold our shuffled results: + +~~~ +$ mkdir shuffled_cpg +~~~ +{:class="in"} + +In the `scripts` folder, you should see a file called `scripts/do_shuffle.sh`. This is a bash script that contains a loop from 1 to 100. It runs bedtools shuffle 100 times, and saves the results to a folder. Have a look at the contents of the file: + +~~~ +$ cat scripts/do_shuffle.sh +~~~ +{:class="in"} +~~~ +#! /bin/bash + + +for i in {1..100} +do + bedtools shuffle -chrom -i data/cpg.bed -g data/genome.txt > shuffled_cpg/cpg.shuffled${i}.bed +done +~~~ +{:class="out"} + +Run the script, and then check to make sure you have some shiny new files in the `shuffled_cpg` folder: + +~~~ +$ scripts/do_suffle.sh +$ ls shuffled_cpg +~~~ +{:class="in"} +~~~ +cpg.shuffled100.bed cpg.shuffled21.bed cpg.shuffled33.bed cpg.shuffled45.bed cpg.shuffled57.bed cpg.shuffled69.bed cpg.shuffled80.bed cpg.shuffled92.bed +cpg.shuffled10.bed cpg.shuffled22.bed cpg.shuffled34.bed cpg.shuffled46.bed cpg.shuffled58.bed cpg.shuffled6.bed cpg.shuffled81.bed cpg.shuffled93.bed +cpg.shuffled11.bed cpg.shuffled23.bed cpg.shuffled35.bed cpg.shuffled47.bed cpg.shuffled59.bed cpg.shuffled70.bed cpg.shuffled82.bed cpg.shuffled94.bed +cpg.shuffled12.bed cpg.shuffled24.bed cpg.shuffled36.bed cpg.shuffled48.bed cpg.shuffled5.bed cpg.shuffled71.bed cpg.shuffled83.bed cpg.shuffled95.bed +cpg.shuffled13.bed cpg.shuffled25.bed cpg.shuffled37.bed cpg.shuffled49.bed cpg.shuffled60.bed cpg.shuffled72.bed cpg.shuffled84.bed cpg.shuffled96.bed +cpg.shuffled14.bed cpg.shuffled26.bed cpg.shuffled38.bed cpg.shuffled4.bed cpg.shuffled61.bed cpg.shuffled73.bed cpg.shuffled85.bed cpg.shuffled97.bed +cpg.shuffled15.bed cpg.shuffled27.bed cpg.shuffled39.bed cpg.shuffled50.bed cpg.shuffled62.bed cpg.shuffled74.bed cpg.shuffled86.bed cpg.shuffled98.bed +cpg.shuffled16.bed cpg.shuffled28.bed cpg.shuffled3.bed cpg.shuffled51.bed cpg.shuffled63.bed cpg.shuffled75.bed cpg.shuffled87.bed cpg.shuffled99.bed +cpg.shuffled17.bed cpg.shuffled29.bed cpg.shuffled40.bed cpg.shuffled52.bed cpg.shuffled64.bed cpg.shuffled76.bed cpg.shuffled88.bed cpg.shuffled9.bed +cpg.shuffled18.bed cpg.shuffled2.bed cpg.shuffled41.bed cpg.shuffled53.bed cpg.shuffled65.bed cpg.shuffled77.bed cpg.shuffled89.bed +cpg.shuffled19.bed cpg.shuffled30.bed cpg.shuffled42.bed cpg.shuffled54.bed cpg.shuffled66.bed cpg.shuffled78.bed cpg.shuffled8.bed +cpg.shuffled1.bed cpg.shuffled31.bed cpg.shuffled43.bed cpg.shuffled55.bed cpg.shuffled67.bed cpg.shuffled79.bed cpg.shuffled90.bed +cpg.shuffled20.bed cpg.shuffled32.bed cpg.shuffled44.bed cpg.shuffled56.bed cpg.shuffled68.bed cpg.shuffled7.bed cpg.shuffled91.bed +~~~ +{:class="out"} + +Now we need to calculate the overlap between `data/exons.bed` and each of our shuffled bed files. The command for the first file is: + +~~~ +$ bedtools intersect -a shuffled_cpg/cpg.shuffled1.bed -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum +~~~ +{:class="in"} +~~~ +630845.0 +~~~ +{:class="out"} + +See if you can adapt this command and turn it into a for loop which runs over all of our shuffled results. + +~~~ +$ for filename in shuffled_cpg/* +> do +> bedtools intersect -a $filename -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum +> done | head +~~~ +{:class="in"} +~~~ +651838.0 +655178.0 +633355.0 +636853.0 +668339.0 +651838.0 +655178.0 +633355.0 +636853.0 +668339.0 +~~~ +{:class="out"} + +Save that to a file: + +~~~ +$ for filename in shuffled_cpg/*; +> do +> bedtools intersect -a $filename -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum +> done > shuffled_results.txt +~~~ +{:class="in"} + + +Plotting our results with python +-------------------------------- + +Also in your `scripts` directory, there should be a file called results.py. Open it and have a look. + +~~~ +$ cat scripts/results.py +~~~ +{:class="in"} +~~~ +import sys +import numpy as np + + +def main(): + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + +main() +~~~ +{:class="out"} + +Which prints: + +~~~ +python results.py shuffled_results.txt cpg_result.txt +~~~ +{:class="in"} +~~~ +Real data: +7235722.0 + +Shuffled data: +[ 640473. 638611. 644321. 642422. 634641. 660745. 645816. 635427. + 595615. 644916. 662412. 645489. 666141. 674597. 637099. 621515. + 647847. 653944. 641051. 685423. 658610. 686554. 618233. 655932. + 670053. 641412. 617865. 651189. 639806. 658123. 639381. 644652. + 667240. 689363. 627791. 625137. 635577. 643151. 616453. 633041. + 629223. 645209. 629201. 639179. 649602. 638849. 667827. 637550. + 652560. 647235. 710669. 626332. 689819. 646094. 631575. 633863. + 657661. 642538. 648691. 660292. 649780. 643179. 615128. 653863. + 610901. 613489. 624788. 680903. 617416. 654761. 663484. 672619. + 615038. 630960. 622431. 634951. 659257. 649931. 633901. 612363. + 639325. 683887. 656753. 690935. 661310. 692022. 633441. 644006. + 652187. 643428. 643711. 682621. 607918. 674996. 674909. 625800. + 642090. 662203. 667386. 660448.] +~~~ +{:class="out"} + +What happens if we call the script with no arguments? + +~~~ +$ python scripts/results.py +~~~ +{:class="in"} +~~~ +Traceback (most recent call last): + File "results.py", line 23, in + main() + File "results.py", line 7, in main + shuffled_results_file = sys.argv[1] +IndexError: list index out of range +~~~ +{:class="err"} + +Not very helpful. Let's add a usage description. Modify the python file so that it looks like this: + +~~~ +import sys +import numpy as np + +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. + +Usage: python results.py shuffled_results.txt real_results.txt +""" + +def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) +~~~ + +Let's try plotting a histogram of the shuffled data, first you'll need to add `from matplotlib import pyplot as plt` at the top of the file, then add this at the bottom of the process function. + +~~~ + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black') + plt.savefig('cpg_overlap_exons.png') +~~~ + +Have a look at the image you just made. At the moment we're only plotting the shuffled results, so let's add the real overlap as a vertical red line. Update your results.py file so that it looks like this: + +~~~ +import sys +import numpy as np +from matplotlib import pyplot as plt + +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. + +Usage: python results.py shuffled_results.txt real_results.txt +""" + +def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) + + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black') + plt.axvline(real_data, color='red', ls='--', lw=2) + plt.savefig('cpg_overlap_exons.png') + +main() +~~~ + +Finally, let's add some axis labels and a legend: + +~~~ +import sys +import numpy as np +from matplotlib import pyplot as plt + +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. + +Usage: python results.py shuffled_results.txt real_results.txt +""" + +def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) + + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black', label='Shuffled Data') + plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data') + plt.xlabel('Nucleotides overlap') + plt.ylabel('Number of shuffled results') + plt.legend() + plt.savefig('cpg_overlap_exons.png') + +main() +~~~ + + diff --git a/novice/capstones/bedtools-python/img/git-fork-ui.png b/novice/capstones/bedtools-python/img/git-fork-ui.png new file mode 100644 index 000000000..fe9045047 Binary files /dev/null and b/novice/capstones/bedtools-python/img/git-fork-ui.png differ