-
Notifications
You must be signed in to change notification settings - Fork 4
Files & formats
DISCLAIMER: This page details the base formats used by panache. If you wish to preformat your files into JSONs to set your own version of Panache with preloaded files, check the installation guideline.
Here you will find what are the requested and optional files taken into account by Panache, as well as information on how to format them.
Panache's main input file is a tsv (Tab Separated Values) file that combines features from both a bed format and a presence/absence matrix. It lists pangenomic blocks positionned on a linear coordinate system (from either one of the genomes, a pan reference or a flatten graph). Below is an overview of accepted syntaxes:
#Chromosome | FeatureStart | FeatureStop | Sequence_IUPAC_Plus | SimilarBlocks | Function | Geno1 | gen_2 | genomeThree |
---|---|---|---|---|---|---|---|---|
1 | 45 | 56 | . | . | 0 | 1 | 1 | 0 |
1 | 56 | 78 | . | . | 7 | 0 | 0 | 23 |
1 | 210 | 230 | ATTCNNatTWCCAGgaGATT | . | 4 | 16 | 0 | 17 |
chr2 | 30 | 43 | CAGWggTGACNNT | chr2:30:43;C_Four:120:133 | 3 | 12 | 1 | 1 |
chr2 | 74 | 96 | tTTAGAaANNNAATAAGgACTAC | chr2:74:96;C_Four:133:155;chrom-5:0:23 | 5 | 0 | 25 | 6 |
Chromosome3 | 780 | 789 | TATacGTGN | . | 0 | 1 | 1 | 1 |
C_Four | 120 | 133 | CAGWggTGACNNT | chr2:30:43;C_Four:120:133 | 3 | 0 | 0 | 0 |
C_Four | 133 | 155 | tTTAGAaANNNTTTAAGgACTAC | chr2:74:96;C_Four:133:155;chrom-5:0:23 | 5 | aGeneID | 0 | anotherGeneID |
chrom-5 | 0 | 23 | tTTAGAaANNTTTAAGgACTACAA | chr2:74:96;C_Four:133:155;chrom-5:0:23 | 5 | 1 | 1 | 0 |
chrom-5 | 345 | 351 | ATTACA | . | 7 | 0 | 1 | 1 |
For better results, please check that yours is sorted according to the 'Chromosome' then 'FeatureStart' columns, in growing order. Blocks may not be consecutive. Overlapping blocks are allowed, however this may not be easily visible in the final representation. For more details about each column, see below...
Show...
The first line of the file, and the only one starting with a #
character.
The header row of is very specific and must always start with these exact column names, case-specific:
#Chromosome FeatureStart FeatureStop Sequence_IUPAC_Plus SimilarBlocks Function
Those six columns are mandatory, even when there is no available information for them. If that is the case a simple .
character will work. Do not forget the "#" as first character !
Added to these columns are the genome names used for comparison. There can be as many as you want them to be, as long as they are placed after the mandatory columns. For instance, a header row with six genomes could be written as:
#Chromosome FeatureStart FeatureStop Sequence_IUPAC_Plus SimilarBlocks Function Geno1-Kenobi Geno2 genome_3 g4 genFive Basix
Digit as first character for a column name should be avoided, as well as unusual characters (.
, é
, ?
, /
and so on).
Show...
MANDATORY VALUE
- As in a BED file, a string with the name of the chromosome where the feature/pangenomic block was found. It is recommended to put all unmapped blocks within a single ChrUnknown instead of keeping them indivdually. Example of possible syntaxes:
#Chromosome |
---|
1 |
2 |
chr_One |
Chromosome2 |
chr42 |
Please do notice that different syntaxes will be considered to be different chromosomes. Chr1
and Chr01
will not be merged as a single chromosome within Panache.
Show...
MANDATORY VALUE
- Number giving the starting position of the feature on the chosen linear coordinate system. Origin is at 0.
Show...
`MANDATORY VALUE` - Same as FeatureStart, but for the end position. It is the first position that does not belong to the feature anymore, meaning that a FeatureStop could have the same value than a FeatureStart in another block. Examples:
FeatureStart | FeatureStop |
---|---|
182 | 1030 |
1030 | 2250 |
80001 | 80503 |
Show...
.
per default. Not used yet, was planned to store a block's sequence. Can be any String. Examples:
Sequence_IUPAC_Plus |
---|
. |
GATTAcA |
NNAGcgTTATT |
ATGCCnAAAWGc |
Show...
.
per default. Can store information of similarities (duplication...) by listing related blocks from somewhere else in the pangenome. IDs and coordinates of all related blocks should be written, including the current one. These IDs follow this pattern : chromosomeNameA:startPositionA:endPositionA;chromosomeNameB:startPositionB:endPositionB
. Data from a same block are separated with a :
while related blocks are separated with ;
. Examples:
#Chromosome | FeatureStart | FeatureStop | ... | SimilarBlocks | |
---|---|---|---|---|---|
chr1 | 156 | 283 | ... | chr1:156:283;chr3:82:209 | <--This sequence appears two times, once in chr1 and once in chr3 |
chr1 | 542 | 620 | ... | . | <--A feature with no similar sequence |
... | |||||
chr3 | 82 | 209 | ... | chr1:156:283;chr3:82:209 | <--This is the feature similar to the one in chr1:156:283 |
Show...
.
per default. Numbers can be use to pair blocks with similar function, coloring them accordingly in the presence absence matrix. It does not accept standard format yet (for instance GO terms cannot be used) but only integers. When all blocks have the same value (all as .
or with the same number), a rainbow color scale will be applied depending on the position instead. It is advised not to mix .
and numbers, prefer using numbers only. Example :
Function | |
---|---|
0 | |
0 | |
1 | <-- These two have the same 'function' |
1 | <-- These two have the same 'function' |
0 | |
0 | |
2 | <-- This is marked with yet another 'function' |
0 | |
0 |
Show...
These columns give the presence/absence status of a given feature for every genome. An absence is always encoded as a 0
in the matrix. Presence can be encoded as 1
or any positive integer, or even a gene name. Proportions (float numbers between 0 and 1) are not recommended, as this field is directly linked to the opacity of shown blocks. Examples:
Geno1 | Geno2 | Geno3 |
---|---|---|
0 | 1 | 1 |
1 | 1 | 1 |
0 | 0 | 0 |
0 | 0 | 1 |
will have the same display than
Geno1 | Geno2 | Geno3 |
---|---|---|
0 | 3 | 7 |
anAwesomeGene | 1 | aGeneToo |
0 | 0 | 0 |
0 | 0 | UK |
CAUTION : In this current configuration blocks marked as UK
, NA
or any other string will be counted as a presence.
Once a presence absence file is loaded, one may add a companion file of annotations in GFF3 format, provided that both files are based on the same linear coordinate system.
DISCLAIMER: In gff files, the last column (for attribute) can use different keywords. Panache uses...
-
Name
for extracting the gene names, and notID
Note
for the functionnal annotation
Panache only considers features labelled as ‘gene’ in the gff3 for parsing, and the information of start and stop positions of the genes and exons are kept to build “Annotation Cards” (tooltips grouping information of strand, exons and functions if any), available for query on the annotation track.
In the case you have a graph dataset as GFA, Graph can be linearized in a compatible format with BioGraph.jl
Information of phylogeny can be added to Panache by uploading a newick file with the .nwk extension (.tree and . txt extensions have been added as valid since v1.0.0). See more about this format on the dedicated wikipedia page. The genome names used in this file must be exactly the same than those in the main file.