Skip to content

Latest commit

 

History

History
147 lines (120 loc) · 9.86 KB

README.textile

File metadata and controls

147 lines (120 loc) · 9.86 KB

Visualization and quality assessment of de novo genome assemblies

Citation

This software is fully described in the paper:
Riba-Grognuz, Keller, Falquet, Xenarios & Wurm (2011) Visualization and quality assessment of de novo genome assemblies.

In brief, our scripts create Cytoscape files to visualize transcript evidence that suggests adjacency between scaffolds and contigs.

Software requirements

  • BLAT (tested with Standalone BLAT v. 32×1). Source Binaries .
  • Cytoscape (tested with versions 2.7.0, 2.8.2)
  • a UNIX machine (tested on Mac OS X 10.6 and CentOS 4.6)

Usage

Input : assembly fasta (1), transcript fasta (2), AGP scaffolding file (3), blat alignment (instead fasta) (4)

  1. requires scaffolded contig fasta to proceed on contig level (-c flag)
    requires scaffold (and unscaffolded contig) fasta to proceed on scaffold level
  2. fasta with transcript sequences obtained independently from genomic assembly
  3. requires scaffolding information in AGP file
  4. blat alignment in psl format instead of fasta (transcripts to genome assembly)

Output : tab delimited files with nodes, edges and properties and summary of blat alignments

  1. On contig level (-c flag)
    out.contig.nw – defines nodes, edges and edge attributes
    out.contig.attr – defines node attributes
    out.contig.blat.stat – statistics of blat alignments
  2. On scaffold level (default)
    out.scaffold.nw – defines nodes, edges and edge attributes
    out.scaffold.attr – defines node attributes
    out.withinOneScaffold – self loop nodes
    out.scaffold.blat.stat – statistics of blat alignments

Processing On Scaffold Level
makeCytoscapeNetwork.sh -g genome.fasta -t transcript.fasta -a file.agp [-klinoqsw]
makeCytoscapeNetwork.sh -p blat.alignment.psl -a file.agp [-klinoqsw]

Processing On Contig Level
makeCytoscapeNetwork.sh -g genome.fasta -t transcript.fasta -a file.agp -c [-klinoqsw]
makeCytoscapeNetwork.sh -p blat.alignment.psl -a file.agp -c [-klinoqsw]

Use makeCytoscapeNetwork.sh -h for option description

Options
-a file specifying scaffolding way in agp format
-c flag to build network on contig level (required if contig.fasta is submitted)
-g fasta file with scaffolded genome contigs (requires -c flag) or
fasta file with genomic scaffolds (can include unscaffolded contigs)
-h print help
-k flag to keep blat alignment files
default: do not keep
-l match length (bp, minimum transcript match length for blat output filterting)
default=200
-n network.name (string to be used for out files)
default=out
-o overlap (bp, maximum overlap of split transcript alignments)
default=50
-p transcript to genome alignment file in psl format (can be submitted instead fasta files)
-q flag to keep folder with temporary files
-i intron size threshold (bp)
default=20000
-s maximum allowed size difference between a gap and a sequence to fill in the gap (bp, required only for scaffold level)
default=4000
-t fasta file with transcript sequences
-w mismatch (floating, percentage of allowed mismatch bases)
default=0.05

Example data

The example/input directory contains some sequence (FASTA) and scaffolding (AGP) data required to generate the equivalent of Figure 1 from Riba-Grognuz et al. These input files are subset from the fire ant genome project . These files can be processed as follows:

  • On scaffold level:
    ./makeCytoscapeNetwork.sh -g example/input/ExampleScaffold.fasta -t example/input/ExampleTranscript.fasta -a example/input/Example.agp -n Example
    or
    ./makeCytoscapeNetwork.sh -p example/input/ExampleScaffold.psl -a example/input/Example.agp -n Example
  • On contig level:
    ./makeCytoscapeNetwork.sh -c -g example/input/ExampleContig.fasta -t example/input/ExampleTranscript.fasta -a example/input/Example.agp -n Example
    or
    ./makeCytoscapeNetwork.sh -c -p example/input/ExampleContig.psl -a example/input/Example.agp -n Example

If all goes well, this creates files named Example.scaffold.attr, Example.scaffold.nw, Example.contig.attr, Example.contig.nw for scaffold and contig levels respectfully. They should be identical to the Cytoscape visualization output found in example/output/.

Example of Cytoscape session

  • Importing network
    • From cytoscape menu select File, then Import and choose Network from Table option in drop-down menu
    • Select file example/output/Example.scaffold.nw or example/output/Example.contig.nw
    • Specify columns 1,2,3 as target,interaction,source respectfully, select all other columns
    • Tick a check-box Show Text File Import Options
    • Tick a check-box Transfer first line as attribute names
    • Import
  • Importing node attributes
    • From cytoscape menu select File, then Import and choose Attribute from Table option in drop-down menu
    • Select file example/output/Example.scaffold.attr or example/output/Example.contig.attr
    • Tick a check-box Show Text File Import Options
    • Tick a check-box Transfer first line as attribute names
    • Import
  • Importing visual style
    • From cytoscape menu select File, then Import and choose Vizmap Property File
    • Import a pre-built file styles/TGNet.props
    • From VizMapper menu select a style called TGNet
  • Network Layout
    • From cytoscape menu select Layout, then Cytoscape Layouts, then choose Force-Directed Layout, unweighted
  • Saving Cytoscape Session
    • From cytoscape menu select File, then Save

If all works fine you should get the following session (files below can by opened with Cytoscape)
example/cytoscape/Example.scaffold.cys
example/cytoscape/Example.contig.cys

Network interpretation

Network can be interpreted based on topology and visual properties. In many cases topology highlights potential regions for improvements and potential problematic regions. In contig example one of the contigs has three alternative paths: 2 suggested by scaffolding and an extra one suggested by transcript alignment. This is an illustration of potentially problematic case. In scaffold example the same problematic case is not visible from topology, and therefore the inconsistency between scaffolding and transcript mapping information is shown with zigzag connection. In contig example double (full-line and vertical slash) connections between nodes indicate consistent scaffolding. In scaffold example small nodes linked to bigger ones with triangular arrohead lines indicate potential improvement cases.

  • Contig-level Visual Style Description
    • nodes represent scaffolded contigs,
    • node sizes are proportional to contig lengths,
    • line widths are proportional to transcript lengths,
    • vertical dashes indicate contig adjacency according to scaffolding information,
    • full lines connect nodes with common aligned transcripts,
    • zigzag lines indicate inconsistency between scaffolding and transcript mapping information,
    • triangle arrowheads show contig order within a scaffold,
    • line colors indicate mapping strands: magenta – plus/plus, cyan – minus/minus, violet – opposite strands.
  • Scaffold-level Visual Style Description
    • nodes represent scaffolds (white with gray border) and un-scaffolded contigs (gray),
    • node sizes are proportional to contig and scaffold lengths,
    • line widths are proportional to transcript lengths,
    • full lines connect nodes with common aligned transcripts,
    • zigzag lines indicate inconsistency between scaffolding and transcript mapping information,
    • arrowheads indicate merge (circle) and join cases (triangle, show order of adjacent scaffolds),
    • line colors indicate mapping strands: magenta – plus/plus, cyan – minus/minus, violet – opposite strands.

Frequently asked questions (FAQ)

  • How can I get an AGP file?
    Some assemblers generate AGP files, for example, newbler generates a file called 454Scaffolds.txt, which is an AGP-format file with scaffolding information. Other assemblers, like SOAPdenovo, do not produce an AGP file. In this case it is possible to create one based on scaffold sequences: the contiguous sequences separated by gaps should then be defined as contigs and the respective contig and gap coordinates in each scaffold should be specified according to AGP-format.
  • I do not have an AGP file. Can I assess the quality of my assembly?
    In the absence of AGP file unscaffolded contigs and scaffolds can be processed on scaffold level. All merge cases will be defined as inconsistent in the absence of scaffolding information. Processing on contig-level requires contig fasta and AGP file, both can be generated if scaffold fasta is available.
  • My node sizes are not scaled according to respective sequence lengths
    Check if the maximum and minimum values for node size scaling cover your data (in Vizual properties menu click on node size, then on graph, then on button “Min/Max” and adjust values). If this does not help, check whether the lengths of genomic sequences were improted as integer values.
  • Why visual style is not applied?
    Check whether you have selected a style called TGNet from VizMapper menu. Make sure that during import of network and attribute files the first row was transferred as column headers (check-box Transfer first line as attribute names).