Skip to content

Files & formats

Eloi Durant edited this page Mar 9, 2022 · 13 revisions

DISCLAIMER: This page details the base formats used by panache. If you wish to preformat your files into JSONs to set your own version of Panache with preloaded files, check the installation guideline.

Here you will find what are the requested and optional files taken into account by Panache, as well as information on how to format them.

Pangenome File - .tsv, .pav ...

Syntax example

Panache's main input file is a tsv (Tab Separated Values) file that combines features from both a bed format and a presence/absence matrix. It lists pangenomic blocks positionned on a linear coordinate system (from either one of the genomes, a pan reference or a flatten graph). Below is an overview of accepted syntaxes:

#Chromosome FeatureStart FeatureStop Sequence_IUPAC_Plus SimilarBlocks Function Geno1 gen_2 genomeThree
1 45 56 . . 0 1 1 0
1 56 78 . . 7 0 0 23
1 210 230 ATTCNNatTWCCAGgaGATT . 4 16 0 17
chr2 30 43 CAGWggTGACNNT chr2:30:43;C_Four:120:133 3 12 1 1
chr2 74 96 tTTAGAaANNNAATAAGgACTAC chr2:74:96;C_Four:133:155;chrom-5:0:23 5 0 25 6
Chromosome3 780 789 TATacGTGN . 0 1 1 1
C_Four 120 133 CAGWggTGACNNT chr2:30:43;C_Four:120:133 3 0 0 0
C_Four 133 155 tTTAGAaANNNTTTAAGgACTAC chr2:74:96;C_Four:133:155;chrom-5:0:23 5 aGeneID 0 anotherGeneID
chrom-5 0 23 tTTAGAaANNTTTAAGgACTACAA chr2:74:96;C_Four:133:155;chrom-5:0:23 5 1 1 0
chrom-5 345 351 ATTACA . 7 0 1 1

For better results, please check that yours is sorted according to the 'Chromosome' then 'FeatureStart' columns, in growing order. Blocks may not be consecutive. Overlapping blocks are allowed, however this may not be easily visible in the final representation. For more details about each column, see below...

Header

Show...

The first line of the file, and the only one starting with a # character. The header row of is very specific and must always start with these exact column names, case-specific:

#Chromosome	FeatureStart	FeatureStop	Sequence_IUPAC_Plus	SimilarBlocks	Function

Those six columns are mandatory, even when there is no available information for them. If that is the case a simple . character will work. Do not forget the "#" as first character !

Added to these columns are the genome names used for comparison. There can be as many as you want them to be, as long as they are placed after the mandatory columns. For instance, a header row with six genomes could be written as:

#Chromosome	FeatureStart	FeatureStop	Sequence_IUPAC_Plus	SimilarBlocks	Function	Geno1-Kenobi	Geno2	genome_3	g4	genFive	Basix

Digit as first character for a column name should be avoided, as well as unusual characters (., é, ?, / and so on).

#Chromosome

Show...

MANDATORY VALUE - As in a BED file, a string with the name of the chromosome where the feature/pangenomic block was found. It is recommended to put all unmapped blocks within a single ChrUnknown instead of keeping them indivdually. Example of possible syntaxes:

#Chromosome
1
2
chr_One
Chromosome2
chr42

Please do notice that different syntaxes will be considered to be different chromosomes. Chr1 and Chr01 will not be merged as a single chromosome within Panache.

FeatureStart

Show...

MANDATORY VALUE - Number giving the starting position of the feature on the chosen linear coordinate system. Origin is at 0.

FeatureStop

Show...

`MANDATORY VALUE` - Same as FeatureStart, but for the end position. It is the first position that does not belong to the feature anymore, meaning that a FeatureStop could have the same value than a FeatureStart in another block. Examples:

FeatureStart FeatureStop
182 1030
1030 2250
80001 80503

Sequence_IUPAC_Plus

Show...

. per default. Not used yet, was planned to store a block's sequence. Can be any String. Examples:

Sequence_IUPAC_Plus
.
GATTAcA
NNAGcgTTATT
ATGCCnAAAWGc

SimilarBlocks

Show...

. per default. Can store information of similarities (duplication...) by listing related blocks from somewhere else in the pangenome. IDs and coordinates of all related blocks should be written, including the current one. These IDs follow this pattern : chromosomeNameA:startPositionA:endPositionA;chromosomeNameB:startPositionB:endPositionB. Data from a same block are separated with a : while related blocks are separated with ;. Examples:

#Chromosome FeatureStart FeatureStop ... SimilarBlocks
chr1 156 283 ... chr1:156:283;chr3:82:209 <--This sequence appears two times, once in chr1 and once in chr3
chr1 542 620 ... . <--A feature with no similar sequence
...
chr3 82 209 ... chr1:156:283;chr3:82:209 <--This is the feature similar to the one in chr1:156:283

Function

Show...

. per default. Numbers can be use to pair blocks with similar function, coloring them accordingly in the presence absence matrix. It does not accept standard format yet (for instance GO terms cannot be used) but only integers. When all blocks have the same value (all as . or with the same number), a rainbow color scale will be applied depending on the position instead. It is advised not to mix . and numbers, prefer using numbers only. Example :

Function
0
0
1 <-- These two have the same 'function'
1 <-- These two have the same 'function'
0
0
2 <-- This is marked with yet another 'function'
0
0

Genome columns

Show...

These columns give the presence/absence status of a given feature for every genome. An absence is always encoded as a 0 in the matrix. Presence can be encoded as 1 or any positive integer, or even a gene name. Proportions (float numbers between 0 and 1) are not recommended, as this field is directly linked to the opacity of shown blocks. Examples:

Geno1 Geno2 Geno3
0 1 1
1 1 1
0 0 0
0 0 1

will have the same display than

Geno1 Geno2 Geno3
0 3 7
anAwesomeGene 1 aGeneToo
0 0 0
0 0 UK

CAUTION : In this current configuration blocks marked as UK, NA or any other string will be counted as a presence.

Functional annotation file - .gff

Once a presence absence file is loaded, one may add a companion file of annotations in GFF3 format, provided that both files are based on the same linear coordinate system.

DISCLAIMER: In gff files, the last column (for attribute) can use different keywords. Panache uses...

  • Name for extracting the gene names, and not ID
  • Note for the functionnal annotation

Panache only considers features labelled as ‘gene’ in the gff3 for parsing, and the information of start and stop positions of the genes and exons are kept to build “Annotation Cards” (tooltips grouping information of strand, exons and functions if any), available for query on the annotation track.

Pangenome Graph

In the case you have a graph dataset as GFA, Graph can be linearized in a compatible format with BioGraph.jl

Newick tree - .nwk, .tree ...

Information of phylogeny can be added to Panache by uploading a newick file with the .nwk extension (.tree and . txt extensions have been added as valid since v1.0.0). See more about this format on the dedicated wikipedia page. The genome names used in this file must be exactly the same than those in the main file.