Assembling the Genome

The steps to build the de novo genome assembly include:

Estimating the best k-mer length for the assembly
- Program: KmerGenie v1.7051
Performing the assembly
- Program: Abyss v2.1.5
- Program: Platanus v1.2.4
Assembly clean-up and checking
- Programs: xxx

Step 1: Estimating k-mer length

Prior to assembly, the first step is to select an appropriate k-mer length to use for the assembly. Rather than running multiple assemblies at different vlaues for k, we will use the software KmerGenie v1.7051. The publication can be found here:
Chikhi R and Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1): 31–37. https://doi.org/10.1093/bioinformatics/btt310

Installation:

# Install KmerGenie
wget http://kmergenie.bx.psu.edu/kmergenie-1.7051.tar.gz
tar -zxvf kmergenie-1.7051.tar.gz
cd kmergenie-1.7051/
make

Run KmerGenie
Please see the script kmergenie.sh for more details on Job information. According to KmerGenie, only reads used by the assembler, not those for scaffolding (i.e. mate pairs), should be used.

# Make list of sequence files
ls /work/frr6/SHAD/MUSKET/PE500*.fq.gz > reads.list

# Run kmergenie
kmergenie \
   reads.list \
   --diploid \
   -t 12 \
   -o kmer

Parameters Explained:

reads.list :: file with list of reads files to include, one per line. (does/not recognize gzipped files)
--diploid :: diploid organism
-t :: number of cpus to use
-o :: output file prefix

See the Output HTML/PDF Files:

kmergenie_report

Summary of Results: KmerGenie reported:

An estimated best k=97
An estimated genome size of 842,670,695 bp
See the plot below:

Step 2a: Assembly with Abyss 2.1.5

From the website: "ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes." I have used it previously to assemble the dromedary and Florida panther genomes. The publications can be found here:

Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, and Birol I (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Research 27: 768-777. https://doi.org/10.1101/gr.214346.116
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. (2009) ABySS: A parallel assembler for short read sequence data. Genome Research 19: 1117-1123. https://doi.org/10.1101/gr.089532.108

Installation:

# Install google sparsehash
git clone https://github.com/sparsehash/sparsehash.git
cd sparsehash/
./configure --prefix=/dscrhome/frr6/bin/
make
make install

# Install Abyss v2.1.5
wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.1.5/abyss-2.1.5.tar.gz
tar -zxvf abyss-2.1.5.tar.gz
cd abyss-2.1.5/
./configure \
   --prefix=/dscrhome/frr6/bin/ \
   --with-sparsehash=/dscrhome/frr6/bin \
   --with-mpi=/opt/apps/slurm/openmpi-2.0.0/
make
make install

Run Abyss
Please see the script abyss.sh for more details on Job information. Abyss was run by varying several parameters found in the file parameters.

# Setup TMPDIR
export TMPDIR=/work/frr6

# This is a basic command for abyss:
abyss-pe \
   k=${k} \
   G=1300000000 \
   S=${S} \
   s=${s} \
   np=12 \
   v=-v \
   name=Asap${n} \
   lib='PE500' \
   mp='MP5k MP10k' \
   PE500='/work/frr6/SHAD/MUSKET/PE500_F.trimmed.uniq.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/PE500_R.trimmed.uniq.noMito.corrected.fq.gz' \
   MP5k='/work/frr6/SHAD/MUSKET/MP5k_F.trimmed.uniq.unj.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/MP5k_R.trimmed.uniq.unj.noMito.corrected.fq.gz' \
   MP10k='/work/frr6/SHAD/MUSKET/MP10k_F.trimmed.uniq.unj.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/MP10k_R.trimmed.uniq.unj.noMito.corrected.fq.gz'

Parameters Explained:

k :: k-mer length for the assembly
G :: genome size estimate for NG50, 1.3 pg (~1.3 Gb) for A. sapidissima
- Taken from genomesize.com
- Hinegardner R and Rosen DE (1972) Cellular DNA Content and the Evolution of Teleostean Fishes. American Naturalist 106(951): 621-644. https://www.jstor.org/stable/2459724
-n :: print out the complete list of commands to run (dry run)
v=-v :: verbose output
name :: name of assembly
lib :: name of PE library
mp :: name(s) of mate-pair libraries
... :: lists of the files in each library

Summary of Abyss Assemblies at Various Parameters:

Parameter	AbyssA	AbyssB	Abyss1	Abyss2	Abyss3	Abyss4	Abyss5	Abyss6	Abyss7	Abyss8	Abyss9	Abyss10	Abyss11
k	97	97	97	97	97	51	51	61	61	71	71	81	81
G	1300000000	900000000	1300000000	1300000000	900000000	1300000000	1300000000	1300000000	1300000000	1300000000	1300000000	1300000000	1300000000
s	1000	1000	200	500	1000	1000	200	1000	200	1000	200	1000	200
S	1000-10000	1000-10000	1000-10000	1000-10000	11000-15000	11000-10000	1000-10000	1000-10000	1000-10000	1000-10000	1000-10000	1000-10000	1000-10000
c	sqrt(median)	2	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)	sqrt(median)
n:500	250,927	267,319	342,115	218,851	250,927	287,606	305,924	277,688	299,210	264,419	300,682	252,556	312,240
L50	35,128	37,850	52,889	34,953	35,126	45,830	49,880	41,175	45,702	37,989	45,082	35,895	47,071
N50	5,839	5,190	5,156	6,085	5,839	2,709	2,776	3,456	3,588	4,322	4,348	5,185	4,882
Longest	82,529	82,529	94,534	82,529	82,529	51,495	52,621	51,485	52,620	52,630	54,398	127,329	130,763
Total Size	676.5 Mb	665.7 Mb	972.3 Mb	696.3 Mb	676.5 Mb	508.2 Mb	572.2 Mb	559.8 Mb	643.5 Mb	601.4 Mb	726.1 Mb	635.2 Mb	824.2 Mb

Step 2b: Assembly with Platanus v1.2.4

Platanus is another assembler built specifically to assemble genomes from high coverage data. You can perform all the necessary read trimming/cleaning using platanus_trim, but we have already conservatively trimmed our dataset. From the website:

Compared with other major assemblers, Platanus assembler was designed to provide good results when using higher coverage data. The optimal coverage depth for Platanus is approximately >80. In some procedures, Platanus attempts to assemble each haplotype sequence separately. In other words, Platanus requires twice as high coverage sequences as other assemblers. This is the main reason why Platanus requires high coverage. You can find more details on Supplemental Materials page 68–74 of our Genome Research publication.
To get good statistical results, mate-pair library sequences are indispensable. We received many claims and questions of poor assembling results. However, in almost all cases, only paired-end sequences were inputted. Except in the case of assembling very simple and small size genomes, it is impossible to get good results without using a mate-pair library.
The publication can be found here:
- Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, and Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Research 24(8):1384-1395. https://doi.org/10.1101/gr.170720.113

Installation:

wget http://platanus.bio.titech.ac.jp/?ddownload=150
tar -zxvf Platanus_v1.2.4.tar.gz
cd Platanus_v1.2.4
make

# Or download pre-built binary directly from http://platanus.bio.titech.ac.jp/platanus-assembler/platanus-1-2-4

Run Platanus: Please see the script platanus.sh for more details on Job information.

# Make uncompressed reads files
zcat PE500_F.trimmed.uniq.noMito.corrected.fq.gz > F.fq
zcat PE500_R.trimmed.uniq.noMito.corrected.fq.gz > R.fq

# Run platanus
platanus \
   assemble \
   -o Asap1 \
   -f [FR].fq \
   -t 16 \
   -m 200

# Remove read files
rm F.fq R.fq

Parameters Explained:

-o :: name prefix for output files
-f :: read files DOES NOT READ GZIPPED FILES!
-t :: number of threads
-m :: memory to use

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assembly.md

assembly.md

Assembling the Genome

Step 1: Estimating k-mer length

Step 2a: Assembly with Abyss 2.1.5

Step 2b: Assembly with Platanus v1.2.4

Files

assembly.md

Latest commit

History

assembly.md

File metadata and controls

Assembling the Genome

Step 1: Estimating k-mer length

Step 2a: Assembly with Abyss 2.1.5

Step 2b: Assembly with Platanus v1.2.4