Files & formats

DISCLAIMER: This page details the base formats used by panache. If you wish to preformat your files into JSONs to set your own version of Panache with preloaded files, check the installation guideline.

Here you will find what are the requested and optional files taken into account by Panache, as well as information on how to format them.

Pangenome File - .tsv, .pav ...
Functional annotation file - .gff
Newick tree - .nwk, .tree ...

Pangenome File - .tsv, .pav ...

Syntax example

Panache's main input file is a tsv (Tab Separated Values) file that combines features from both a bed format and a presence/absence matrix. It lists pangenomic blocks positionned on a linear coordinate system (from either one of the genomes, a pan reference or a flatten graph). Below is an overview of accepted syntaxes:

#Chromosome	FeatureStart	FeatureStop	Sequence_IUPAC_Plus	SimilarBlocks	Function	Geno1	gen_2	genomeThree
1	45	56	.	.	0	1	1	0
1	56	78	.	.	7	0	0	23
1	210	230	ATTCNNatTWCCAGgaGATT	.	4	16	0	17
chr2	30	43	CAGWggTGACNNT	chr2:30:43;C_Four:120:133	3	12	1	1
chr2	74	96	tTTAGAaANNNAATAAGgACTAC	chr2:74:96;C_Four:133:155;chrom-5:0:23	5	0	25	6
Chromosome3	780	789	TATacGTGN	.	0	1	1	1
C_Four	120	133	CAGWggTGACNNT	chr2:30:43;C_Four:120:133	3	0	0	0
C_Four	133	155	tTTAGAaANNNTTTAAGgACTAC	chr2:74:96;C_Four:133:155;chrom-5:0:23	5	aGeneID	0	anotherGeneID
chrom-5	0	23	tTTAGAaANNTTTAAGgACTACAA	chr2:74:96;C_Four:133:155;chrom-5:0:23	5	1	1	0
chrom-5	345	351	ATTACA	.	7	0	1	1

For better results, please check that yours is sorted according to the 'Chromosome' then 'FeatureStart' columns, in growing order. Blocks may not be consecutive. Overlapping blocks are allowed, however this may not be easily visible in the final representation. For more details about each column, see below...

Header

Show...

The first line of the file, and the only one starting with a # character. The header row of is very specific and must always start with these exact column names, case-specific:

#Chromosome	FeatureStart	FeatureStop	Sequence_IUPAC_Plus	SimilarBlocks	Function

Those six columns are mandatory, even when there is no available information for them. If that is the case a simple . character will work. Do not forget the "#" as first character !

Added to these columns are the genome names used for comparison. There can be as many as you want them to be, as long as they are placed after the mandatory columns. For instance, a header row with six genomes could be written as:

#Chromosome	FeatureStart	FeatureStop	Sequence_IUPAC_Plus	SimilarBlocks	Function	Geno1-Kenobi	Geno2	genome_3	g4	genFive	Basix

Digit as first character for a column name should be avoided, as well as unusual characters (., é, ?, / and so on).

#Chromosome

Show...

MANDATORY VALUE - As in a BED file, a string with the name of the chromosome where the feature/pangenomic block was found. It is recommended to put all unmapped blocks within a single ChrUnknown instead of keeping them indivdually. Example of possible syntaxes:

#Chromosome
1
2
chr_One
Chromosome2
chr42

Please do notice that different syntaxes will be considered to be different chromosomes. Chr1 and Chr01 will not be merged as a single chromosome within Panache.

FeatureStart

Show...

MANDATORY VALUE - Number giving the starting position of the feature on the chosen linear coordinate system. Origin is at 0.

FeatureStop

Show...

`MANDATORY VALUE` - Same as FeatureStart, but for the end position. It is the first position that does not belong to the feature anymore, meaning that a FeatureStop could have the same value than a FeatureStart in another block. Examples:

FeatureStart	FeatureStop
182	1030
1030	2250
80001	80503

Sequence_IUPAC_Plus

Show...

. per default. Not used yet, was planned to store a block's sequence. Can be any String. Examples:

Sequence_IUPAC_Plus
.
GATTAcA
NNAGcgTTATT
ATGCCnAAAWGc

SimilarBlocks

Show...

. per default. Can store information of similarities (duplication...) by listing related blocks from somewhere else in the pangenome. IDs and coordinates of all related blocks should be written, including the current one. These IDs follow this pattern : chromosomeNameA:startPositionA:endPositionA;chromosomeNameB:startPositionB:endPositionB. Data from a same block are separated with a : while related blocks are separated with ;. Examples:

#Chromosome	FeatureStart	FeatureStop	...	SimilarBlocks
chr1	156	283	...	chr1:156:283;chr3:82:209	<--This sequence appears two times, once in chr1 and once in chr3
chr1	542	620	...	.	<--A feature with no similar sequence
...
chr3	82	209	...	chr1:156:283;chr3:82:209	<--This is the feature similar to the one in chr1:156:283

Function

Show...

. per default. Numbers can be use to pair blocks with similar function, coloring them accordingly in the presence absence matrix. It does not accept standard format yet (for instance GO terms cannot be used) but only integers. When all blocks have the same value (all as . or with the same number), a rainbow color scale will be applied depending on the position instead. It is advised not to mix . and numbers, prefer using numbers only. Example :

Function
0
0
1	<-- These two have the same 'function'
1	<-- These two have the same 'function'
0
0
2	<-- This is marked with yet another 'function'
0
0

Genome columns

Show...

These columns give the presence/absence status of a given feature for every genome. An absence is always encoded as a 0 in the matrix. Presence can be encoded as 1 or any positive integer, or even a gene name. Proportions (float numbers between 0 and 1) are not recommended, as this field is directly linked to the opacity of shown blocks. Examples:

Geno1	Geno2	Geno3
0	1	1
1	1	1
0	0	0
0	0	1

will have the same display than

Geno1	Geno2	Geno3
0	3	7
anAwesomeGene	1	aGeneToo
0	0	0
0	0	UK

CAUTION : In this current configuration blocks marked as UK, NA or any other string will be counted as a presence.

Functional annotation file - .gff

Once a presence absence file is loaded, one may add a companion file of annotations in GFF3 format, provided that both files are based on the same linear coordinate system.

DISCLAIMER: In gff files, the last column (for attribute) can use different keywords. Panache uses...

Name for extracting the gene names, and not ID
Note for the functionnal annotation

Panache only considers features labelled as ‘gene’ in the gff3 for parsing, and the information of start and stop positions of the genes and exons are kept to build “Annotation Cards” (tooltips grouping information of strand, exons and functions if any), available for query on the annotation track.

Pangenome Graph

In the case you have a graph dataset as GFA, Graph can be linearized in a compatible format with BioGraph.jl

Newick tree - .nwk, .tree ...

Information of phylogeny can be added to Panache by uploading a newick file with the .nwk extension (.tree and . txt extensions have been added as valid since v1.0.0). See more about this format on the dedicated wikipedia page. The genome names used in this file must be exactly the same than those in the main file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files & formats

Pangenome File - .tsv, .pav ...

Syntax example

Header

#Chromosome

FeatureStart

FeatureStop

Sequence_IUPAC_Plus

SimilarBlocks

Function

Genome columns

Functional annotation file - .gff

Pangenome Graph

Newick tree - .nwk, .tree ...

Clone this wiki locally