Skip to content

How to run Polypolish

Ryan Wick edited this page Aug 3, 2021 · 27 revisions

Quick start

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam
polypolish draft.fasta alignments_1.sam alignments_2.sam > polished.fasta

Using paired-end short reads

Most short-read sets are paired-end, so this will be the most common way to run Polypolish.

In these commands, I will assume your files are named like this:

  • draft.fasta: the input assembly that you want to polish
  • reads_1.fastq.gz: short reads (first in pair)
  • reads_2.fastq.gz: short reads (second in pair)

The first step is to align the reads to your draft genome with BWA:

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam

Important things to note here:

  • The -a option is key! This makes BWA align each read to all possible locations, not just the best location. Polypolish needs this to polish repeat sequences.
  • You have to align the two read files separately. I.e. don't align both read files with a single bwa mem command. This is because BWA's -a option doesn't work with paired-end alignment.
  • I used -t 16 in these commands to align using 16 threads. Adjust this value as is appropriate for your system. This will only affect the speed performance (the alignments themselves are unaffected by thread count).

Then give the draft genome and the alignments to Polypolish. It will output information to stderr and the polished assembly to stdout, so redirect its output to a file:

polypolish draft.fasta alignments_1.sam alignments_2.sam > polished.fasta

Using other types of short reads

Polypolish works with any number of SAM files, so unpaired reads are easy:

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads.fastq.gz > alignments.sam
polypolish draft.fasta alignments.sam > polished.fasta

And multiple paired-end read sets work too:

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_a_1.fastq.gz > alignments_a_1.sam
bwa mem -t 16 -a draft.fasta reads_a_2.fastq.gz > alignments_a_2.sam
bwa mem -t 16 -a draft.fasta reads_b_1.fastq.gz > alignments_b_1.sam
bwa mem -t 16 -a draft.fasta reads_b_2.fastq.gz > alignments_b_2.sam
polypolish draft.fasta alignments_*.sam > polished.fasta

Debug file

Polypolish has a --debug option which you can use to save per-base polishing information in a tab-delimited format. You can use it like this:

polypolish --debug polished.tsv draft.fasta alignments_1.sam alignments_2.sam > polished.fasta

The resulting file will have one line for each base of the assembly, so it will be hundreds of megabytes in size for a typical bacterial genome. See the Toy example page for an example of what this file looks like.

Help text

If you run polypolish --help, you will see this:

  _____        _                       _  _       _     
 |  __ \      | |                     | |(_)     | |    
 | |__) |___  | | _   _  _ __    ___  | | _  ___ | |__  
 |  ___// _ \ | || | | || '_ \  / _ \ | || |/ __|| '_ \ 
 | |   | (_) || || |_| || |_) || (_) || || |\__ \| | | |
 |_|    \___/ |_| \__, || .__/  \___/ |_||_||___/|_| |_|
                   __/ || |                             
                  |___/ |_|                             

Polypolish v0.4.2
short-read polishing of long-read assemblies
github.com/rrwick/Polypolish

USAGE:
    polypolish [OPTIONS] <assembly> <sam>...

ARGS:
    <assembly>    Assembly to polish (FASTA format, one file)
    <sam>...      Short read alignments (SAM format, one or more files)

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
        --debug <debug>                  Optional file to store per-base information for debugging
                                         purposes
    -m, --max_errors <max-errors>        Ignore alignments with more than this many mismatches and
                                         indels [default: 10]
    -d, --min_depth <min-depth>          A base must occur at least this many times in the pileup to be
                                         considered valid [default: 5]
    -f, --min_fraction <min-fraction>    A base must make up at least this fraction of the read depth to
                                         be considered valid [default: 0.5]
    -r, --min_ratio <min-ratio>          A base will only be changed if the best option occurs this many
                                         times more often than the second-best option [default: 2.0]