Last year (2016) in Woods Hole, MA we used our lab's MinION to sequence a new bacterial species isolated by Rebecca Mickol in the Microbial Diversity Course at the Marine Biological Lab.
If you're interested, you can read a blog post about it.
The goals of this tutorial are to:
- Assess an Oxford Nanopore Technologies (ONT) sequencing run on the MinION
- Create an assembly from raw fastq files
- Evaluate the assembly
Start a blank Jetstream instance (m1.medium) with CPU: 6, Mem: 16, Disk 60 GB RAM:
Copy/paste to update and install software on your new instance:
sudo apt-get update && \
sudo apt-get -y install build-essential ruby screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core \
r-cran-gplots python-matplotlib sysstat python-virtualenv \
python-setuptools cmake cython libhdf5-serial-dev \
python-numpy python-scipy python-pandas python-pandas-lib \
python-biopython parallel python-h5py python-tornado \
bioperl libxml-simple-perl default-jre gdebi-core r-base gnuplot
To install some of the software, we will use Linux brew:
cd
sudo mkdir /home/linuxbrew
sudo chown $USER:$USER /home/linuxbrew
git clone https://github.com/Linuxbrew/brew.git /home/linuxbrew/.linuxbrew
echo 'export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
brew tap homebrew/science
Now install canu, samtools, bwa:
brew install jdk canu bwa samtools
Install poretools:
sudo pip install poretools
Install prokka:
cd
git clone https://github.com/tseemann/prokka.git
echo 'export PATH=$PWD/prokka/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
prokka --setupdb
prokka --version
Install assembly-stats:
cd
git clone https://github.com/sanger-pathogens/assembly-stats.git
cd assembly-stats/
mkdir build
cd build
cmake ..
make
make test
sudo make install
Install RStudio:
cd
wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
sudo gdebi -n rstudio-server-1.0.143-amd64.deb
Change your password for RStudio:
cd
sudo passwd <account name>
Install mummer and gnuplot v4:
mummer
cd
wget https://github.com/mummer4/mummer/releases/download/v3.9.4alpha/mummer-3.9.4alpha.tar.gz
tar xvzf mummer-3.9.4alpha.tar.gz
cd mummer-3.9.4alpha
./configure
make
sudo make install
echo 'export LD_LIBRARY_PATH="/usr/local/lib"' >> ~/.bashrc
source ~/.bashrc
Our data were collected from three R9.4 flowcells in 2016. Download a subset of these reads:
cd
wget https://s3.amazonaws.com/ngs2016/ectocooler_subset.zip
unzip ectocooler_subset.zip
ls ectocooler_subset/
You should see a bunch of .fast5 files.
This is only a subset of the reads from the whole run. (Click here for stats from the full data set.)
The MinION instrument collects raw data in .fast5 format. The local basecalling software, Albacore (sorry, link requires ONT MAP login access), converts .fast5 files into .fastq or .fasta files. Poretools is another method for converting .fast5 files into .fastq files.
Convert your .fast5 to .fastq and .fasta files:
cd
directory="ectocooler_subset/"
poretools fastq $directory > ectocooler_subset.fastq
poretools fasta $directory > ectocooler_subset.fasta
Take a look at the reads:
head ectocooler_subset.fastq
Copy a few reads and use the web blastn to try to identify what species or closest taxa these data came from. What do you come up with?
Download the full dataset, fastq and fasta files:
cd
wget https://s3.amazonaws.com/ngs2016/ectocooler_all_2D.fastq
wget https://s3.amazonaws.com/ngs2016/ectocooler_all_2D.fasta
assembly-stats ectocooler_subset.fastq
- How many reads are there total?
- What is the longest read?
Assess the full data set:
assembly-stats ectocooler_all_2D.fastq
How does the full data set compare to the subset?
How does it compare to these results from three R9.5 flowcells of killifish (Fundulus olivaceus) data collected at Porecamp in 2017?
[ljcohen@globus-00 fastq2]$ /mnt/home/ljcohen/bin/assembly-stats/assembly-stats porecamp_killifish2.fastq
stats for porecamp_killifish2.fastq
sum = 4962626713, n = 740248, ave = 6704.01, largest = 973552
N50 = 12726, n = 117202
N60 = 10357, n = 160433
N70 = 8098, n = 214460
N80 = 5724, n = 286845
N90 = 3229, n = 400661
N100 = 5, n = 740248
N_count = 0
Gaps = 0
Run this to make a file with all read lengths:
cat ectocooler_all_2D.fastq | paste - - - - | awk -F"\t" '{print length($2)}' > lengths.txt
Grab lengths from the killifish runs to compare:
wget https://raw.githubusercontent.com/ngs-docs/angus/2017/killifish_lengths.txt
Start RStudio server:
echo My RStudio Web server is running at: http://$(hostname):8787/
Run these commands in RStudio:
setwd("~/")
lengths <- read.table("lengths.txt")[,1]
hist(lengths, xlim=c(0,30000), breaks=100, col="red")
killifish_lengths <- read.table("killifish_lengths.txt")[,1]
hist(killifish_lengths, xlim=c(0,90000), breaks=1000, col="blue")
We will use the program canu to assemble the reads. The full data set will take several hours. So, we will only assemble the subset. Which data are better to use, 2D or a mixture of template and compliment? Pick one, assemble, and compare with your neighbor.
canu \
-p ecto_subset -d ectocooler_assembly \
genomeSize=3.0m \
-nanopore-raw ectocooler_subset.fastq
This will take ~15 min. So, go ahead and grab some coffee!
After the assembly has finished, the output file you are interested is ecto_subset.contigs.fasta
. Let's copy that file to the home directory:
cd
cp ectocooler_assembly/ecto_subset.contigs.fasta .
Assess the assembly:
assembly-stats ecto_subset.contigs.fasta
How many contigs do you have?
Download the pre-assembled contigs from the full data set:
wget https://raw.githubusercontent.com/ljcohen/dib_ONP_MinION/master/Ectocooler/ecto.contigs.fasta
assembly-stats ecto.contigs.fasta
Compare this with your assembly. How are they different?
Before running mummerplot
, run this:
sudo awk -i inplace '/\$P_FORMAT .=/ { print "#" $0; next } { print }' $(which mummerplot)
because reasons
Quick assembly vs. assembly of full reads:
nucmer --maxmatch -c 100 -p ectocooler ecto.contigs.fasta ecto_subset.contigs.fasta
mummerplot --fat --filter --png --large -p ectocooler ectocooler.delta
Grab closely-related reference genome of Tenacibaculum dicentrarchi:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/483/385/GCF_001483385.1_ASM148338v1/GCF_001483385.1_ASM148338v1_genomic.fna.gz
gunzip GCF_001483385.1_ASM148338v1_genomic.fna.gz
Do all-by-all comparison of reference genome vs. ecocooler full reads assembly
nucmer --maxmatch -c 100 -p ectocooler_Tdicent ecto.contigs.fasta GCF_001483385.1_ASM148338v1_genomic.fna
mummerplot --fat --filter --png --large -p ectocooler_Tdicent ectocooler_Tdicent.delta
This week, you have used Torsten's program, prokka to annotate a bacterial genome. We will use this to annotate these new contigs we have assembled.
prokka --outdir anno_subset --prefix ecto_subset_prokka ecto_subset.contigs.fasta
Check the output:
cat ./anno_subset/ecto_subset_prokka.txt
- How many genes did Prokka find in the contigs?
- Does this meet your expectations?
Use this command to run prokka on the contigs assembled with the full data set:
prokka --outdir anno_full --prefix ecto_full_prokka ecto.contigs.fasta
Check the output:
cat ./anno_full/ecto_full_prokka.txt
- https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Evaluating-Assemblies
- http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html
- https://github.com/ljcohen/dib_ONP_MinION
- http://nbviewer.jupyter.org/github/arq5x/poretools/blob/master/poretools/ipynb/test_run_report.ipynb
- http://porecamp.github.io/2015/timetable.html
- http://porecamp.github.io/2016/
- http://porecamp.github.io/texas/
This is a modified lesson by Nick Loman from 2015, contributions by Torsten Seemann, Harriet Alexander, Mick Watson, Danny Miller, Jon Badalamenti, and Lisa Cohen.
If you're interested in read length distributions from R9.5 data:
[ljcohen@dev-intel14 killifish_assembly]$ cat killifish.report
[CORRECTION/READS]
--
-- In gatekeeper store 'correction/killifish.gkpStore':
-- Found 614587 reads.
-- Found 4883910982 bases (4.43 times coverage).
--
-- Read length histogram (one '*' equals 4297.12 reads):
-- 0 4999 300799 **********************************************************************
-- 5000 9999 145553 *********************************
-- 10000 14999 81269 ******************
-- 15000 19999 40302 *********
-- 20000 24999 20366 ****
-- 25000 29999 11161 **
-- 30000 34999 6305 *
-- 35000 39999 3560
-- 40000 44999 2043
-- 45000 49999 1217
-- 50000 54999 734
-- 55000 59999 463
-- 60000 64999 297
-- 65000 69999 165
-- 70000 74999 103
-- 75000 79999 61
-- 80000 84999 33
-- 85000 89999 30
-- 90000 94999 15
-- 95000 99999 17
-- 100000 104999 13
-- 105000 109999 5
-- 110000 114999 5
-- 115000 119999 5
-- 120000 124999 8
-- 125000 129999 2
-- 130000 134999 8
-- 135000 139999 6
-- 140000 144999 4
-- 145000 149999 0
-- 150000 154999 2
-- 155000 159999 3
-- 160000 164999 2
-- 165000 169999 2
-- 170000 174999 0
-- 175000 179999 1
-- 180000 184999 0
-- 185000 189999 4
-- 190000 194999 0
-- 195000 199999 2
-- 200000 204999 0
-- 205000 209999 1
-- 210000 214999 1
-- 215000 219999 1
-- 220000 224999 0
-- 225000 229999 0
-- 230000 234999 0
-- 235000 239999 1
-- 240000 244999 1
-- 245000 249999 1
-- 250000 254999 0
-- 255000 259999 1
-- 260000 264999 0
-- 265000 269999 1
-- 270000 274999 0
-- 275000 279999 0
-- 280000 284999 1
-- 285000 289999 1
-- 290000 294999 1
-- 295000 299999 0
-- 300000 304999 0
-- 305000 309999 0
-- 310000 314999 0
-- 315000 319999 0
-- 320000 324999 0
-- 325000 329999 0
-- 330000 334999 1
-- 335000 339999 1
-- 340000 344999 0
-- 345000 349999 0
-- 350000 354999 0
-- 355000 359999 0
-- 360000 364999 0
-- 365000 369999 0
-- 370000 374999 0
-- 375000 379999 2
-- 380000 384999 0
-- 385000 389999 0
-- 390000 394999 0
-- 395000 399999 0
-- 400000 404999 0
-- 405000 409999 0
-- 410000 414999 0
-- 415000 419999 1
-- 420000 424999 1
-- 425000 429999 0
-- 430000 434999 0
-- 435000 439999 0
-- 440000 444999 0
-- 445000 449999 0
-- 450000 454999 0
-- 455000 459999 0
-- 460000 464999 0
-- 465000 469999 0
-- 470000 474999 1
-- 475000 479999 0
-- 480000 484999 0
-- 485000 489999 0
-- 490000 494999 0
-- 495000 499999 0
-- 500000 504999 1
-- 505000 509999 0
-- 510000 514999 0
-- 515000 519999 0
-- 520000 524999 0
-- 525000 529999 0
-- 530000 534999 0
-- 535000 539999 0
-- 540000 544999 0
-- 545000 549999 0
-- 550000 554999 0
-- 555000 559999 0
-- 560000 564999 0
-- 565000 569999 0
-- 570000 574999 0
-- 575000 579999 0
-- 580000 584999 0
-- 585000 589999 0
-- 590000 594999 0
-- 595000 599999 0
-- 600000 604999 0
-- 605000 609999 0
-- 610000 614999 0
-- 615000 619999 0
-- 620000 624999 0
-- 625000 629999 0
-- 630000 634999 0
-- 635000 639999 0
-- 640000 644999 0
-- 645000 649999 0
-- 650000 654999 0
-- 655000 659999 0
-- 660000 664999 0
-- 665000 669999 0
-- 670000 674999 0
-- 675000 679999 0
-- 680000 684999 0
-- 685000 689999 0
-- 690000 694999 0
-- 695000 699999 0
-- 700000 704999 0
-- 705000 709999 0
-- 710000 714999 0
-- 715000 719999 0
-- 720000 724999 0
-- 725000 729999 0
-- 730000 734999 0
-- 735000 739999 0
-- 740000 744999 0
-- 745000 749999 0
-- 750000 754999 0
-- 755000 759999 0
-- 760000 764999 0
-- 765000 769999 0
-- 770000 774999 0
-- 775000 779999 0
-- 780000 784999 0
-- 785000 789999 0
-- 790000 794999 0
-- 795000 799999 0
-- 800000 804999 0
-- 805000 809999 0
-- 810000 814999 0
-- 815000 819999 0
-- 820000 824999 0
-- 825000 829999 0
-- 830000 834999 0
-- 835000 839999 0
-- 840000 844999 0
-- 845000 849999 0
-- 850000 854999 0
-- 855000 859999 0
-- 860000 864999 0
-- 865000 869999 0
-- 870000 874999 0
-- 875000 879999 0
-- 880000 884999 0
-- 885000 889999 0
-- 890000 894999 0
-- 895000 899999 1
-- 900000 904999 1
-- 905000 909999 0
-- 910000 914999 0
-- 915000 919999 0
-- 920000 924999 0
-- 925000 929999 0
-- 930000 934999 0
-- 935000 939999 0
-- 940000 944999 0
-- 945000 949999 0
-- 950000 954999 0
-- 955000 959999 0
-- 960000 964999 0
-- 965000 969999 0
-- 970000 974999 1
[CORRECTION/MERS]
--
-- 16-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 513587462 *******************************************************************--> 0.3659 0.1054
-- 2- 2 302853120 ********************************************************************** 0.5817 0.2296
-- 3- 4 297949496 ******************************************************************** 0.7116 0.3419
-- 5- 7 166204607 ************************************** 0.8485 0.5152
-- 8- 11 72366333 **************** 0.9314 0.6771
-- 12- 16 28375480 ****** 0.9700 0.7901
-- 17- 22 11291731 ** 0.9861 0.8576
-- 23- 29 4915132 * 0.9929 0.8966
-- 30- 37 2381975 0.9960 0.9200
-- 38- 46 1279754 0.9975 0.9351
-- 47- 56 750151 0.9983 0.9454
-- 57- 67 471014 0.9988 0.9529
-- 68- 79 316003 0.9992 0.9587
-- 80- 92 217116 0.9994 0.9633
-- 93- 106 152316 0.9995 0.9671
-- 107- 121 108860 0.9996 0.9701
-- 122- 137 79676 0.9997 0.9726
-- 138- 154 60884 0.9998 0.9747
-- 155- 172 47146 0.9998 0.9765
-- 173- 191 36929 0.9998 0.9780
-- 192- 211 29814 0.9999 0.9794
-- 212- 232 23866 0.9999 0.9806
-- 233- 254 19555 0.9999 0.9817
-- 255- 277 15830 0.9999 0.9826
-- 278- 301 13153 0.9999 0.9835
-- 302- 326 11139 0.9999 0.9843
-- 327- 352 9446 0.9999 0.9850
-- 353- 379 7738 0.9999 0.9856
-- 380- 407 6591 1.0000 0.9862
-- 408- 436 5681 1.0000 0.9867
-- 437- 466 5128 1.0000 0.9872
-- 467- 497 4612 1.0000 0.9877
-- 498- 529 3954 1.0000 0.9882
-- 530- 562 3413 1.0000 0.9886
-- 563- 596 3177 1.0000 0.9890
-- 597- 631 2742 1.0000 0.9893
-- 632- 667 2515 1.0000 0.9897
-- 668- 704 2268 1.0000 0.9900
-- 705- 742 1991 1.0000 0.9903
-- 743- 781 1873 1.0000 0.9906
-- 782- 821 1654 1.0000 0.9909
--
-- 620808 (max occurrences)
-- 4361104715 (total mers, non-unique)
-- 890053109 (distinct mers, non-unique)
-- 513587462 (unique mers)