Skip to content
This repository has been archived by the owner on Jul 9, 2021. It is now read-only.

Latest commit

 

History

History
169 lines (149 loc) · 8.77 KB

assembly.md

File metadata and controls

169 lines (149 loc) · 8.77 KB

Assembling the Genome

The steps to build the de novo genome assembly include:

  1. Estimating the best k-mer length for the assembly
  2. Performing the assembly
  3. Assembly clean-up and checking
    • Programs: xxx

Step 1: Estimating k-mer length

Prior to assembly, the first step is to select an appropriate k-mer length to use for the assembly. Rather than running multiple assemblies at different vlaues for k, we will use the software KmerGenie v1.7051. The publication can be found here:
Chikhi R and Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1): 31–37. https://doi.org/10.1093/bioinformatics/btt310

Installation:

# Install KmerGenie
wget http://kmergenie.bx.psu.edu/kmergenie-1.7051.tar.gz
tar -zxvf kmergenie-1.7051.tar.gz
cd kmergenie-1.7051/
make

Run KmerGenie
Please see the script kmergenie.sh for more details on Job information. According to KmerGenie, only reads used by the assembler, not those for scaffolding (i.e. mate pairs), should be used.

# Make list of sequence files
ls /work/frr6/SHAD/MUSKET/PE500*.fq.gz > reads.list

# Run kmergenie
kmergenie \
   reads.list \
   --diploid \
   -t 12 \
   -o kmer

Parameters Explained:

  • reads.list :: file with list of reads files to include, one per line. (does/not recognize gzipped files)
  • --diploid :: diploid organism
  • -t :: number of cpus to use
  • -o :: output file prefix

See the Output HTML/PDF Files:

Summary of Results: KmerGenie reported:

  1. An estimated best k=97
  2. An estimated genome size of 842,670,695 bp
  3. See the plot below: KmerGenie plot

Step 2a: Assembly with Abyss 2.1.5

From the website: "ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes." I have used it previously to assemble the dromedary and Florida panther genomes. The publications can be found here:

  • Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, and Birol I (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Research 27: 768-777. https://doi.org/10.1101/gr.214346.116
  • Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. (2009) ABySS: A parallel assembler for short read sequence data. Genome Research 19: 1117-1123. https://doi.org/10.1101/gr.089532.108

Installation:

# Install google sparsehash
git clone https://github.com/sparsehash/sparsehash.git
cd sparsehash/
./configure --prefix=/dscrhome/frr6/bin/
make
make install

# Install Abyss v2.1.5
wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.1.5/abyss-2.1.5.tar.gz
tar -zxvf abyss-2.1.5.tar.gz
cd abyss-2.1.5/
./configure \
   --prefix=/dscrhome/frr6/bin/ \
   --with-sparsehash=/dscrhome/frr6/bin \
   --with-mpi=/opt/apps/slurm/openmpi-2.0.0/
make
make install

Run Abyss
Please see the script abyss.sh for more details on Job information. Abyss was run by varying several parameters found in the file parameters.

# Setup TMPDIR
export TMPDIR=/work/frr6

# This is a basic command for abyss:
abyss-pe \
   k=${k} \
   G=1300000000 \
   S=${S} \
   s=${s} \
   np=12 \
   v=-v \
   name=Asap${n} \
   lib='PE500' \
   mp='MP5k MP10k' \
   PE500='/work/frr6/SHAD/MUSKET/PE500_F.trimmed.uniq.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/PE500_R.trimmed.uniq.noMito.corrected.fq.gz' \
   MP5k='/work/frr6/SHAD/MUSKET/MP5k_F.trimmed.uniq.unj.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/MP5k_R.trimmed.uniq.unj.noMito.corrected.fq.gz' \
   MP10k='/work/frr6/SHAD/MUSKET/MP10k_F.trimmed.uniq.unj.noMito.corrected.fq.gz /work/frr6/SHAD/MUSKET/MP10k_R.trimmed.uniq.unj.noMito.corrected.fq.gz'

Parameters Explained:

  • k :: k-mer length for the assembly
  • G :: genome size estimate for NG50, 1.3 pg (~1.3 Gb) for A. sapidissima
  • -n :: print out the complete list of commands to run (dry run)
  • v=-v :: verbose output
  • name :: name of assembly
  • lib :: name of PE library
  • mp :: name(s) of mate-pair libraries
  • ... :: lists of the files in each library

Summary of Abyss Assemblies at Various Parameters:

Parameter AbyssA AbyssB Abyss1 Abyss2 Abyss3 Abyss4 Abyss5 Abyss6 Abyss7 Abyss8 Abyss9 Abyss10 Abyss11
k 97 97 97 97 97 51 51 61 61 71 71 81 81
G 1300000000 900000000 1300000000 1300000000 900000000 1300000000 1300000000 1300000000 1300000000 1300000000 1300000000 1300000000 1300000000
s 1000 1000 200 500 1000 1000 200 1000 200 1000 200 1000 200
S 1000-10000 1000-10000 1000-10000 1000-10000 11000-15000 11000-10000 1000-10000 1000-10000 1000-10000 1000-10000 1000-10000 1000-10000 1000-10000
c sqrt(median) 2 sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median) sqrt(median)
n:500 250,927 267,319 342,115 218,851 250,927 287,606 305,924 277,688 299,210 264,419 300,682 252,556 312,240
L50 35,128 37,850 52,889 34,953 35,126 45,830 49,880 41,175 45,702 37,989 45,082 35,895 47,071
N50 5,839 5,190 5,156 6,085 5,839 2,709 2,776 3,456 3,588 4,322 4,348 5,185 4,882
Longest 82,529 82,529 94,534 82,529 82,529 51,495 52,621 51,485 52,620 52,630 54,398 127,329 130,763
Total Size 676.5 Mb 665.7 Mb 972.3 Mb 696.3 Mb 676.5 Mb 508.2 Mb 572.2 Mb 559.8 Mb 643.5 Mb 601.4 Mb 726.1 Mb 635.2 Mb 824.2 Mb

Step 2b: Assembly with Platanus v1.2.4

Platanus is another assembler built specifically to assemble genomes from high coverage data. You can perform all the necessary read trimming/cleaning using platanus_trim, but we have already conservatively trimmed our dataset. From the website:

  • Compared with other major assemblers, Platanus assembler was designed to provide good results when using higher coverage data. The optimal coverage depth for Platanus is approximately >80. In some procedures, Platanus attempts to assemble each haplotype sequence separately. In other words, Platanus requires twice as high coverage sequences as other assemblers. This is the main reason why Platanus requires high coverage. You can find more details on Supplemental Materials page 68–74 of our Genome Research publication.
  • To get good statistical results, mate-pair library sequences are indispensable. We received many claims and questions of poor assembling results. However, in almost all cases, only paired-end sequences were inputted. Except in the case of assembling very simple and small size genomes, it is impossible to get good results without using a mate-pair library.
  • The publication can be found here:
    • Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, and Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Research 24(8):1384-1395. https://doi.org/10.1101/gr.170720.113

Installation:

wget http://platanus.bio.titech.ac.jp/?ddownload=150
tar -zxvf Platanus_v1.2.4.tar.gz
cd Platanus_v1.2.4
make

# Or download pre-built binary directly from http://platanus.bio.titech.ac.jp/platanus-assembler/platanus-1-2-4

Run Platanus: Please see the script platanus.sh for more details on Job information.

# Make uncompressed reads files
zcat PE500_F.trimmed.uniq.noMito.corrected.fq.gz > F.fq
zcat PE500_R.trimmed.uniq.noMito.corrected.fq.gz > R.fq

# Run platanus
platanus \
   assemble \
   -o Asap1 \
   -f [FR].fq \
   -t 16 \
   -m 200

# Remove read files
rm F.fq R.fq

Parameters Explained:

  • -o :: name prefix for output files
  • -f :: read files DOES NOT READ GZIPPED FILES!
  • -t :: number of threads
  • -m :: memory to use