-
Notifications
You must be signed in to change notification settings - Fork 55
Home
ARIBA is a tool that identifies anitbiotic resistance genes by running local assemblies.
The input is a FASTA file of reference genes and paired sequencing reads. ARIBA reports which of the reference genes were found, plus detailed information on the quality of the assemblies and any variants between the sequencing reads and the reference genes.
Please see the readme from the ARIBA github repository for installation instructions.
The installation installs a single script called ariba
, which can be used to run several tasks.
A FASTA file of reference genes is required. This can be sanity checked using
ariba refcheck genes.fasta
and a new, fixed version can be made with
ariba refcheck -o genes.fixed genes.fa
To see all the options, run
ariba refcheck --help
First, sanity check the input genes FASTA file using ariba refcheck
(see above). Next, with your reference genes in the FASTA file genes.fasta
, and forwards and reverse paired reads in the FASTQ files reads_1.fq
, reads_2.fq
, run:
ariba run genes.fasta reads_1.fq reads_2.fq out_dir
Important: ARIBA assumes that read N in the file reads_1.fq
is the mate of read N in the file reads_2.fq
.
All output files will be put in a new directory called out_dir
.
To see all the options, use --help
:
ariba run --help
A summary of ARIBA's findings can be found in the file report.tsv
(or the same information is written to the Excel file report.xls
, if you prefer). There is at least one row per gene that had reads mapped to it. The meaning of the columns is as follows:
Column | description |
---|---|
1. gene | Name of gene from reference input FASTA file |
2. flag | Number encoding assembly outcome (see below for explanation) |
3. reads | Number of reads in the cluster |
4. cluster | Cluster number that gene came from |
5. gene_len | Length of reference gene in nucleotides |
6. assembled | Number of nucleotides of gene assembled |
7. pc_ident | Percent identity of gene and scaffold |
8. var_type | Type of variant between gene and assembly |
9. var_effect | Effect of variant |
10. new_aa | New amino acid(s) in the assembled gene |
11. gene_start | Position of nucleotide in reference gene where variant starts |
12. gene_end | Position of nucleotide in reference gene where variant ends |
13. gene_nt | Nucleotide in reference gene |
14. scaffold | Name of assembly scaffold |
15. scaff_len | Length of scaffold |
16. scaff_start | Position of nucleotide in scaffold where variant starts |
17. scaff_end | Position of nucleotide in scaffold where variant ends |
18. scaff_nt | Nucleotide in scaffold |
19. read_depth | Read depth on scaffold, as called by samtools mpileup |
20. alt_bases | ALT bases reported by samtools/bcftools on the scaffold |
21. ref_alt_depth | Read depth of alternative alleles from column 20, reported by samtools/bcftools |
If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1, 2 and 3. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:
cut -f1,2 report.tsv | uniq
ARIBA currently keeps all files relating to assembly in the output directory. There is one directory per gene that had any reads mapped to it. This may change in future and most files will be deleted unless the user requests for them to be left on disk. The files likely to be of most use are:
-
assembly.fa
- a FASTA file of the assembly of the gene -
assembly.reads_mapped.bam
- a sorted indexed BAM file of reads mapped to the assembly -
reads_1.fq
,reads_2.fq
- FASTA files of the reads and their mates that mapped to the reference gene -
gene.reads_mapped.bam
- sorted indexed BAM file of reads mapped to the reference gene.
Report files can be summarised using
ariba summary out.tsv in.report.1.tsv in.report.2.tsv ...
The output file format is tab-delimited unless the filename ends with '.xls', in which case an Excel file is made. All the input files must be in TSV format, not Excel. Each gene for each run is summarised into a single number, with the meaning as follows.
- 0 - gene not present (assembly failure or % identity between scaffold and gene too low)
- 1 - it's there, but doesn't have a complete ORF or is not unique
- 2 - it's there, assembled into a unique contig, but not with a complete ORF
- 3 - it's there in a unique contig with a complete ORF, but has at least one non-synonymous variant
- 4 - it's there in a unique contig with a complete ORF, and the only variants (if any at all) are synonymous.
To get all the options, run
ariba summary --help
During assembly, various things can happen. The possibilities are encoded into a flag (column 2 of the report). To get the meaning of a flag, for example 27, run:
ariba flag 27
The output is:
Meaning of flag 27
[X] gene_assembled
[X] gene_assembled_into_one_contig
[ ] gene_region_assembled_twice
[X] complete_orf
[X] unique_contig
[ ] scaffold_graph_bad
[ ] assembly_fail
[ ] variants_suggest_collapsed_repeat
[ ] hit_both_strands
[ ] has_nonsynonymous_variants
Flag 27 is an ideal result - the gene was assembled into one unique contig with a complete open reading frame.