Skip to content

Latest commit

 

History

History
97 lines (63 loc) · 3.4 KB

assembly.md

File metadata and controls

97 lines (63 loc) · 3.4 KB

Whole-genome de-novo Assembly

In this practical we will perform the assembly of M. genitalium, a bacterium published in 1995 by Fraser et al in Science.

Getting the data

Go to ENA, and search for the run ERR486840.

Download the 2 fastq files associated with the run.

Quality control

Perform quality control of the data as you did in the QC tutorial

How many reads are in the fastq file? What is the read length? Does the data need trimming or other filtering? If so, do it.

Find the genome size of M. genitalium in the Fraser paper abstract. Based on the expected genome size, the read length and the number of reads – what average coverage do you expect to get from this fastq read files?

De-novo assembly

We will be using the SPAdes assembler to assemble our bacterium.

module load spades
spades.py -k 21,33,55,77,99 --careful --only-assembler -1 read_1.fastq -2 read_2.fastq -o output

This will produce a series of outputs. The scaffolds will be in fasta format.

How well does the assembly total consensus size and coverage correspond to your earlier estimation? What is the N50 of the assembly? What does this mean? How many contigs in total did the assembly produce? How many contigs longer than 500bp? What is the N50 of those contigs only?

Perform another assembly with the following options:

Use the raw reads (no trimming, but with adapters removed), wthout the --only-assembler option.

If you have time, try out the following genome assemblers:

  • MaSurCa
  • Ray
  • Velvet

Comparing assemblies

QUAST is a software evaluating the quality of genome assemblies by computing various metrics, including

  • N50, length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
  • NG50, where length of the reference genome is being covered
  • NA50 and NGA50, where aligned blocks instead of contigs are taken
  • misassemblies, misassembled and unaligned contigs or contigs bases
  • genes and operons covered

To run Quast:

module load quast
quast.py assembly1.fasta assembly2.fasta ... -R reference.fasta -G reference.gff

Quast will produce a pdf in the quast_results directory. Download it on your computer and take a look. Which assembly is better?

Fixing misassemblies

Pilon is a software tool which can be used to automatically improve draft assemblies. It attempts to make improvements to the input genome, including:

  • Single base differences
  • Small Indels
  • Larger Indels or block substitution events
  • Gap filling
  • Identification of local misassemblies, including optional opening of new gaps

Pilon then outputs a FASTA file containing an improved representation of the genome from the read data and an optional VCF file detailing variation seen between the read data and the input genome.

You can read how Pilon works in detail here

Before running Pilon itself, you have to map your reads back to the assembly!

bowtie2-build assembly.fasta index_prefix
(bowtie2 -x index_prefix -1 read_1.fastq -2 read_2.fastq | samtools view -bS -o output_to_sort.bam - ) 2> bowtie.err
samtools sort output_to_sort.bam alignment.bam
samtools index alignment.bam

Run Pilon with the following command:

module load pilon
pilon --genome assembly.fasta --frags alignment.bam --output pilon_output

Once Pilon is finished running, compare the new assembly with the old one using Quast!