Whole-genome de-novo Assembly

In this practical we will perform the assembly of M. genitalium, a bacterium published in 1995 by Fraser et al in Science.

Getting the data

Go to ENA, and search for the run ERR486840.

Download the 2 fastq files associated with the run.

Quality control

Perform quality control of the data as you did in the QC tutorial

How many reads are in the fastq file? What is the read length? Does the data need trimming or other filtering? If so, do it.

Find the genome size of M. genitalium in the Fraser paper abstract. Based on the expected genome size, the read length and the number of reads – what average coverage do you expect to get from this fastq read files?

De-novo assembly

We will be using the SPAdes assembler to assemble our bacterium.

module load spades
spades.py -k 21,33,55,77,99 --careful --only-assembler -1 read_1.fastq -2 read_2.fastq -o output

This will produce a series of outputs. The scaffolds will be in fasta format.

How well does the assembly total consensus size and coverage correspond to your earlier estimation? What is the N50 of the assembly? What does this mean? How many contigs in total did the assembly produce? How many contigs longer than 500bp? What is the N50 of those contigs only?

Perform another assembly with the following options:

Use the raw reads (no trimming, but with adapters removed), wthout the --only-assembler option.

If you have time, try out the following genome assemblers:

MaSurCa
Ray
Velvet

Comparing assemblies

QUAST is a software evaluating the quality of genome assemblies by computing various metrics, including

N50, length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
NG50, where length of the reference genome is being covered
NA50 and NGA50, where aligned blocks instead of contigs are taken
misassemblies, misassembled and unaligned contigs or contigs bases
genes and operons covered

To run Quast:

module load quast
quast.py assembly1.fasta assembly2.fasta ... -R reference.fasta -G reference.gff

Quast will produce a pdf in the quast_results directory. Download it on your computer and take a look. Which assembly is better?

Fixing misassemblies

Pilon is a software tool which can be used to automatically improve draft assemblies. It attempts to make improvements to the input genome, including:

Single base differences
Small Indels
Larger Indels or block substitution events
Gap filling
Identification of local misassemblies, including optional opening of new gaps

Pilon then outputs a FASTA file containing an improved representation of the genome from the read data and an optional VCF file detailing variation seen between the read data and the input genome.

You can read how Pilon works in detail here

Before running Pilon itself, you have to map your reads back to the assembly!

bowtie2-build assembly.fasta index_prefix
(bowtie2 -x index_prefix -1 read_1.fastq -2 read_2.fastq | samtools view -bS -o output_to_sort.bam - ) 2> bowtie.err
samtools sort output_to_sort.bam alignment.bam
samtools index alignment.bam

Run Pilon with the following command:

module load pilon
pilon --genome assembly.fasta --frags alignment.bam --output pilon_output

Once Pilon is finished running, compare the new assembly with the old one using Quast!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assembly.md

assembly.md

Whole-genome de-novo Assembly

Getting the data

Quality control

De-novo assembly

Comparing assemblies

Fixing misassemblies

Files

assembly.md

Latest commit

History

assembly.md

File metadata and controls

Whole-genome de-novo Assembly

Getting the data

Quality control

De-novo assembly

Comparing assemblies

Fixing misassemblies