-
Notifications
You must be signed in to change notification settings - Fork 55
Task: aln2meta
This can be used if you have a multiple alignment of one or more sets of reference sequences and SNP information that you want to call using ARIBA. The variant grouping option of ARIBA can be used to track the "same" SNPs across all the sequences.
The procedure is explained using the following example on toy data.
We will use the following toy sequences. They are supposed to represent different alleles of the same (very short!) gene.
>seq1
ATGGCTAATTAG
>seq2
ATGTTTAATTAG
>seq3
ATGTTTTGTAATTAG
>seq4
ATGTTTGATAATTAG
They translate to the following amino acid sequences.
>seq1
MAN*
>seq2
MFN*
>seq3
MFCN*
>seq4
MFDN*
Here is a multiple alignment of the amino acid sequences:
>seq1
M-AN*
>seq2
MF-N*
>seq3
MFCN*
>seq4
MFDN*
and the corresponding nucleotide sequences:
>seq1
ATG---GCTAATTAG
>seq2
ATGTTT---AATTAG
>seq3
ATGTTTTGTAATTAG
>seq4
ATGTTTGATAATTAG
This final file is the one that must be used as input to ariba aln2meta
.
Every sequence must have the same length in this file (length includes the
gaps).
In addition, a file of SNP information is needed. Suppose we know the following two SNPs confer antibiotic resistance:
- A2D in sequence seq1
- F2E in sequence seq4
ARIBA can be used to identify the corresponding SNPs in any of the sequences. The second required file is a TSV file containing information on these SNPs. It must have four columns:
-
Sequence name. Must exactly match a sequence the multialignment FASTA file.
-
The SNP, for example A2D.
-
Group name. If you do not want to put the SNP into a group, use ".".
-
A description of the SNP, for example "Causes resistance to antibiotic x".
In this example, we will use the file:
seq1 A2D group1 Description of A2D.group1
seq4 F2E group2 Description of F2E.group2
Run aln2meta like this:
ariba aln2meta seqs.aln.fa snps.tsv coding out
where:
-
seqs.aln.fa
is the multifasta alignment file of nucleotide sequences -
snps.tsv
is the TSV file of SNP information -
coding
, because these are coding sequences. For non-coding sequences, usenoncoding
instead, and the SNPs should be nucleotide SNPs, as opposed to amino acids. -
out
is the prefix of the names of the output files.
Note that ARIBA sanity checks the SNPs against the sequences. It outputs these two warnings:
Warning: position has a gap in sequence seq2 corresponding to variant A2D (group1) in sequence seq1 ... Ignoring for seq2
Warning: position has a gap in sequence seq1 corresponding to variant F2E (group2) in sequence seq4 ... Ignoring for seq1
which makes sense looking at the sequences. For example, the A2D variant in seq2 aligns to a gap in seq1, so it gets ignored for seq1 (but included for the other sequences).
The aln2meta command above outputs three files,
which can be used as input to ariba prepareref
like this:
prepareref -f out.fa -m out.tsv --cdhit_clusters out.cluster out.prepareref
and then ariba run
can be run as normal.
It is possible to use more than one set of multiple alignments, eg you have several genes, each of which have multiple alleles and SNPs of interest. Run aln2meta once for each gene/set of alleles. For example:
ariba aln2meta seqs.aln.1.fa snps.1.tsv coding out1
ariba aln2meta seqs.aln.2.fa snps.2.tsv coding out2
ariba aln2meta seqs.aln.3.fa snps.3.tsv coding out3
Then cat the relevant files together and run prepareref:
cat out*fa > all.fa
cat out*tsv > all.tsv
cat out*cluster > all.cluster
ariba prepareref -f all.fa -m all.tsv --cdhit_clusters all.cluster out.prepareref
(or you could not cat the files, and instead use -f
and -m
once for each file),
and finally ariba run
can be run as normal.