The steps to build the de novo genome assembly include:
- Estimating the best k-mer length for the assembly
- Program: KmerGenie v1.7051
- Performing the assembly
- Program: Abyss v2.1.5
- Program: Platanus v1.2.4
- Assembly clean-up and checking
- Programs: xxx
Prior to assembly, the first step is to select an appropriate k-mer length to use for the assembly. Rather than running multiple assemblies at different vlaues for k, we will use the software KmerGenie v1.7051.
The publication can be found here:
Chikhi R and Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1): 31–37. https://doi.org/10.1093/bioinformatics/btt310
Installation:
# Install KmerGenie
wget http://kmergenie.bx.psu.edu/kmergenie-1.7051.tar.gz
tar -zxvf kmergenie-1.7051.tar.gz
cd kmergenie-1.7051/
make
Run KmerGenie
Please see the script kmergenie.sh for more details on Job information. According to KmerGenie, only reads used by the assembler, not those for scaffolding (i.e. mate pairs), should be used.
# Make list of sequence files
ls /work/frr6/SHAD/MUSKET/PE500*.fq.gz > reads.list
# Run kmergenie
kmergenie \
reads.list \
--diploid \
-t 12 \
-o kmer
Parameters Explained:
- reads.list :: file with list of reads files to include, one per line. (does/not recognize gzipped files)
- --diploid :: diploid organism
- -t :: number of cpus to use
- -o :: output file prefix
See the Output HTML/PDF Files:
Summary of Results: KmerGenie reported:
From the website: "ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes." I have used it previously to assemble the dromedary and Florida panther genomes. The publications can be found here:
- Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, and Birol I (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Research 27: 768-777. https://doi.org/10.1101/gr.214346.116
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. (2009) ABySS: A parallel assembler for short read sequence data. Genome Research 19: 1117-1123. https://doi.org/10.1101/gr.089532.108
Installation:
# Install google sparsehash
git clone https://github.com/sparsehash/sparsehash.git
cd sparsehash/
./configure --prefix=/dscrhome/frr6/bin/
make
make install
# Install Abyss v2.1.5
wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.1.5/abyss-2.1.5.tar.gz
tar -zxvf abyss-2.1.5.tar.gz
cd abyss-2.1.5/
./configure \
--prefix=/dscrhome/frr6/bin/ \
--with-sparsehash=/dscrhome/frr6/bin \
--with-mpi=/opt/apps/slurm/openmpi-2.0.0/
make
make install
Run Abyss
Please see the script abyss.sh for more details on Job information. Abyss was run by varying several parameters found in the file parameters.
# Setup TMPDIR
export TMPDIR=/work/frr6
# This is a basic command for abyss:
abyss-pe \
k=${k} \
G=1300000000 \
S=${S} \
s=${s} \
np=12 \
v=-v \
name=Asap${n} \
lib='PE500' \
mp='MP5k MP10k' \
PE500='/work/frr6/SHAD/MUSKET/PE500_F.trimmed.uniq.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/PE500_R.trimmed.uniq.noMito.corrected.fq.gz' \
MP5k='/work/frr6/SHAD/MUSKET/MP5k_F.trimmed.uniq.unj.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/MP5k_R.trimmed.uniq.unj.noMito.corrected.fq.gz' \
MP10k='/work/frr6/SHAD/MUSKET/MP10k_F.trimmed.uniq.unj.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/MP10k_R.trimmed.uniq.unj.noMito.corrected.fq.gz'
Parameters Explained:
- k :: k-mer length for the assembly
- G :: genome size estimate for NG50, 1.3 pg (~1.3 Gb) for A. sapidissima
- Taken from genomesize.com
- Hinegardner R and Rosen DE (1972) Cellular DNA Content and the Evolution of Teleostean Fishes. American Naturalist 106(951): 621-644. https://www.jstor.org/stable/2459724
- -n :: print out the complete list of commands to run (dry run)
- v=-v :: verbose output
- name :: name of assembly
- lib :: name of PE library
- mp :: name(s) of mate-pair libraries
- ... :: lists of the files in each library
Summary of Abyss Assemblies at Various Parameters:
Parameter | AbyssA | AbyssB | Abyss1 | Abyss2 | Abyss3 | Abyss4 | Abyss5 | Abyss6 | Abyss7 | Abyss8 | Abyss9 | Abyss10 | Abyss11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
k | 97 | 97 | 97 | 97 | 97 | 51 | 51 | 61 | 61 | 71 | 71 | 81 | 81 |
G | 1300000000 | 900000000 | 1300000000 | 1300000000 | 900000000 | 1300000000 | 1300000000 | 1300000000 | 1300000000 | 1300000000 | 1300000000 | 1300000000 | 1300000000 |
s | 1000 | 1000 | 200 | 500 | 1000 | 1000 | 200 | 1000 | 200 | 1000 | 200 | 1000 | 200 |
S | 1000-10000 | 1000-10000 | 1000-10000 | 1000-10000 | 11000-15000 | 11000-10000 | 1000-10000 | 1000-10000 | 1000-10000 | 1000-10000 | 1000-10000 | 1000-10000 | 1000-10000 |
c | sqrt(median) | 2 | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) | sqrt(median) |
n:500 | 250,927 | 267,319 | 342,115 | 218,851 | 250,927 | 287,606 | 305,924 | 277,688 | 299,210 | 264,419 | 300,682 | 252,556 | 312,240 |
L50 | 35,128 | 37,850 | 52,889 | 34,953 | 35,126 | 45,830 | 49,880 | 41,175 | 45,702 | 37,989 | 45,082 | 35,895 | 47,071 |
N50 | 5,839 | 5,190 | 5,156 | 6,085 | 5,839 | 2,709 | 2,776 | 3,456 | 3,588 | 4,322 | 4,348 | 5,185 | 4,882 |
Longest | 82,529 | 82,529 | 94,534 | 82,529 | 82,529 | 51,495 | 52,621 | 51,485 | 52,620 | 52,630 | 54,398 | 127,329 | 130,763 |
Total Size | 676.5 Mb | 665.7 Mb | 972.3 Mb | 696.3 Mb | 676.5 Mb | 508.2 Mb | 572.2 Mb | 559.8 Mb | 643.5 Mb | 601.4 Mb | 726.1 Mb | 635.2 Mb | 824.2 Mb |
Platanus is another assembler built specifically to assemble genomes from high coverage data. You can perform all the necessary read trimming/cleaning using platanus_trim, but we have already conservatively trimmed our dataset. From the website:
- Compared with other major assemblers, Platanus assembler was designed to provide good results when using higher coverage data. The optimal coverage depth for Platanus is approximately >80. In some procedures, Platanus attempts to assemble each haplotype sequence separately. In other words, Platanus requires twice as high coverage sequences as other assemblers. This is the main reason why Platanus requires high coverage. You can find more details on Supplemental Materials page 68–74 of our Genome Research publication.
- To get good statistical results, mate-pair library sequences are indispensable. We received many claims and questions of poor assembling results. However, in almost all cases, only paired-end sequences were inputted. Except in the case of assembling very simple and small size genomes, it is impossible to get good results without using a mate-pair library.
- The publication can be found here:
- Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, and Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Research 24(8):1384-1395. https://doi.org/10.1101/gr.170720.113
Installation:
wget http://platanus.bio.titech.ac.jp/?ddownload=150
tar -zxvf Platanus_v1.2.4.tar.gz
cd Platanus_v1.2.4
make
# Or download pre-built binary directly from http://platanus.bio.titech.ac.jp/platanus-assembler/platanus-1-2-4
Run Platanus: Please see the script platanus.sh for more details on Job information.
# Make uncompressed reads files
zcat PE500_F.trimmed.uniq.noMito.corrected.fq.gz > F.fq
zcat PE500_R.trimmed.uniq.noMito.corrected.fq.gz > R.fq
# Run platanus
platanus \
assemble \
-o Asap1 \
-f [FR].fq \
-t 16 \
-m 200
# Remove read files
rm F.fq R.fq
Parameters Explained:
- -o :: name prefix for output files
- -f :: read files DOES NOT READ GZIPPED FILES!
- -t :: number of threads
- -m :: memory to use