This software is fully described in the paper:
Riba-Grognuz, Keller, Falquet, Xenarios & Wurm (2011) Visualization and quality assessment of de novo genome assemblies.
In brief, our scripts create Cytoscape files to visualize transcript evidence that suggests adjacency between scaffolds and contigs.
- BLAT (tested with Standalone BLAT v. 32×1). Source Binaries .
- Cytoscape (tested with versions 2.7.0, 2.8.2)
- a UNIX machine (tested on Mac OS X 10.6 and CentOS 4.6)
Input : assembly fasta (1), transcript fasta (2), AGP scaffolding file (3), blat alignment (instead fasta) (4)
- requires scaffolded contig fasta to proceed on contig level (
-c
flag)
requires scaffold (and unscaffolded contig) fasta to proceed on scaffold level - fasta with transcript sequences obtained independently from genomic assembly
- requires scaffolding information in AGP file
- blat alignment in psl format instead of fasta (transcripts to genome assembly)
Output : tab delimited files with nodes, edges and properties and summary of blat alignments
- On contig level (
-c
flag)
out.contig.nw – defines nodes, edges and edge attributes
out.contig.attr – defines node attributes
out.contig.blat.stat – statistics of blat alignments - On scaffold level (default)
out.scaffold.nw – defines nodes, edges and edge attributes
out.scaffold.attr – defines node attributes
out.withinOneScaffold – self loop nodes
out.scaffold.blat.stat – statistics of blat alignments
Processing On Scaffold Level
makeCytoscapeNetwork.sh -g genome.fasta -t transcript.fasta -a file.agp [-klinoqsw]
makeCytoscapeNetwork.sh -p blat.alignment.psl -a file.agp [-klinoqsw]
Processing On Contig Level
makeCytoscapeNetwork.sh -g genome.fasta -t transcript.fasta -a file.agp -c [-klinoqsw]
makeCytoscapeNetwork.sh -p blat.alignment.psl -a file.agp -c [-klinoqsw]
Use makeCytoscapeNetwork.sh -h
for option description
Options
-a
file specifying scaffolding way in agp format
-c
flag to build network on contig level (required if contig.fasta is submitted)
-g
fasta file with scaffolded genome contigs (requires -c flag) or
fasta file with genomic scaffolds (can include unscaffolded contigs)
-h
print help
-k
flag to keep blat alignment files
default: do not keep
-l
match length (bp, minimum transcript match length for blat output filterting)
default=200
-n
network.name (string to be used for out files)
default=out
-o
overlap (bp, maximum overlap of split transcript alignments)
default=50
-p
transcript to genome alignment file in psl format (can be submitted instead fasta files)
-q
flag to keep folder with temporary files
-i
intron size threshold (bp)
default=20000
-s
maximum allowed size difference between a gap and a sequence to fill in the gap (bp, required only for scaffold level)
default=4000
-t
fasta file with transcript sequences
-w
mismatch (floating, percentage of allowed mismatch bases)
default=0.05
The example/input
directory contains some sequence (FASTA) and scaffolding (AGP) data required to generate the equivalent of Figure 1 from Riba-Grognuz et al. These input files are subset from the fire ant genome project . These files can be processed as follows:
- On scaffold level:
./makeCytoscapeNetwork.sh -g example/input/ExampleScaffold.fasta -t example/input/ExampleTranscript.fasta -a example/input/Example.agp -n Example
or
./makeCytoscapeNetwork.sh -p example/input/ExampleScaffold.psl -a example/input/Example.agp -n Example
- On contig level:
./makeCytoscapeNetwork.sh -c -g example/input/ExampleContig.fasta -t example/input/ExampleTranscript.fasta -a example/input/Example.agp -n Example
or
./makeCytoscapeNetwork.sh -c -p example/input/ExampleContig.psl -a example/input/Example.agp -n Example
If all goes well, this creates files named Example.scaffold.attr
, Example.scaffold.nw
, Example.contig.attr
, Example.contig.nw
for scaffold and contig levels respectfully. They should be identical to the Cytoscape visualization output found in example/output/
.
- Importing network
- From cytoscape menu select File, then Import and choose Network from Table option in drop-down menu
- Select file example/output/Example.scaffold.nw or example/output/Example.contig.nw
- Specify columns 1,2,3 as target,interaction,source respectfully, select all other columns
- Tick a check-box Show Text File Import Options
- Tick a check-box Transfer first line as attribute names
- Import
- Importing node attributes
- From cytoscape menu select File, then Import and choose Attribute from Table option in drop-down menu
- Select file example/output/Example.scaffold.attr or example/output/Example.contig.attr
- Tick a check-box Show Text File Import Options
- Tick a check-box Transfer first line as attribute names
- Import
- Importing visual style
- From cytoscape menu select File, then Import and choose Vizmap Property File
- Import a pre-built file
styles/TGNet.props
- From VizMapper menu select a style called TGNet
- Network Layout
- From cytoscape menu select Layout, then Cytoscape Layouts, then choose Force-Directed Layout, unweighted
- Saving Cytoscape Session
- From cytoscape menu select File, then Save
If all works fine you should get the following session (files below can by opened with Cytoscape)
example/cytoscape/Example.scaffold.cys
example/cytoscape/Example.contig.cys
Network can be interpreted based on topology and visual properties. In many cases topology highlights potential regions for improvements and potential problematic regions. In contig example one of the contigs has three alternative paths: 2 suggested by scaffolding and an extra one suggested by transcript alignment. This is an illustration of potentially problematic case. In scaffold example the same problematic case is not visible from topology, and therefore the inconsistency between scaffolding and transcript mapping information is shown with zigzag connection. In contig example double (full-line and vertical slash) connections between nodes indicate consistent scaffolding. In scaffold example small nodes linked to bigger ones with triangular arrohead lines indicate potential improvement cases.
- Contig-level Visual Style Description
- nodes represent scaffolded contigs,
- node sizes are proportional to contig lengths,
- line widths are proportional to transcript lengths,
- vertical dashes indicate contig adjacency according to scaffolding information,
- full lines connect nodes with common aligned transcripts,
- zigzag lines indicate inconsistency between scaffolding and transcript mapping information,
- triangle arrowheads show contig order within a scaffold,
- line colors indicate mapping strands: magenta – plus/plus, cyan – minus/minus, violet – opposite strands.
- Scaffold-level Visual Style Description
- nodes represent scaffolds (white with gray border) and un-scaffolded contigs (gray),
- node sizes are proportional to contig and scaffold lengths,
- line widths are proportional to transcript lengths,
- full lines connect nodes with common aligned transcripts,
- zigzag lines indicate inconsistency between scaffolding and transcript mapping information,
- arrowheads indicate merge (circle) and join cases (triangle, show order of adjacent scaffolds),
- line colors indicate mapping strands: magenta – plus/plus, cyan – minus/minus, violet – opposite strands.
- How can I get an AGP file?
Some assemblers generate AGP files, for example, newbler generates a file called 454Scaffolds.txt, which is an AGP-format file with scaffolding information. Other assemblers, like SOAPdenovo, do not produce an AGP file. In this case it is possible to create one based on scaffold sequences: the contiguous sequences separated by gaps should then be defined as contigs and the respective contig and gap coordinates in each scaffold should be specified according to AGP-format.
- I do not have an AGP file. Can I assess the quality of my assembly?
In the absence of AGP file unscaffolded contigs and scaffolds can be processed on scaffold level. All merge cases will be defined as inconsistent in the absence of scaffolding information. Processing on contig-level requires contig fasta and AGP file, both can be generated if scaffold fasta is available.
- My node sizes are not scaled according to respective sequence lengths
Check if the maximum and minimum values for node size scaling cover your data (in Vizual properties menu click on node size, then on graph, then on button “Min/Max” and adjust values). If this does not help, check whether the lengths of genomic sequences were improted as integer values.
- Why visual style is not applied?
Check whether you have selected a style called TGNet from VizMapper menu. Make sure that during import of network and attribute files the first row was transferred as column headers (check-box Transfer first line as attribute names).