Skip to content
martinghunt edited this page Nov 25, 2015 · 38 revisions

ARIBA: Antibiotic Resistance Identification By Assembly

ARIBA is a tool that identifies anitbiotic resistance genes by running local assemblies.

The input is a FASTA file of reference genes and paired sequencing reads. ARIBA reports which of the reference genes were found, plus detailed information on the quality of the assemblies and any variants between the sequencing reads and the reference genes.

Installation

Please see the readme from the ARIBA github repository for installation instructions.

Usage

The installation installs a single script called ariba, which can be used to run several tasks.

Reference genes

A FASTA file of reference genes is required. This can be sanity checked using

ariba refcheck genes.fasta

and a new, fixed version can be made with

ariba refcheck -o genes.fixed genes.fa

To see all the options, run

ariba refcheck --help

Run local assemblies

First, sanity check the input genes FASTA file using ariba refcheck (see above). Next, with your reference genes in the FASTA file genes.fasta, and forwards and reverse paired reads in the FASTQ files reads_1.fq, reads_2.fq, run:

ariba run genes.fasta reads_1.fq reads_2.fq out_dir

Important: ARIBA assumes that read N in the file reads_1.fq is the mate of read N in the file reads_2.fq. All output files will be put in a new directory called out_dir.

To see all the options, use --help:

ariba run --help

Output

A summary of ARIBA's findings can be found in the file report.tsv (or the same information is written to the Excel file report.xls, if you prefer). There is at least one row per gene that had reads mapped to it. The meaning of the columns is as follows:

Column description
1. gene Name of gene from reference input FASTA file
2. flag Number encoding assembly outcome (see below for explanation)
3. reads Number of reads in the cluster
4. cluster Cluster number that gene came from
5. gene_len Length of reference gene in nucleotides
6. assembled Number of nucleotides of gene assembled
7. pc_ident Percent identity of gene and scaffold
8. var_type Type of variant between gene and assembly
9. var_effect Effect of variant
10. new_aa New amino acid(s) in the assembled gene
11. gene_start Position of nucleotide in reference gene where variant starts
12. gene_end Position of nucleotide in reference gene where variant ends
13. gene_nt Nucleotide in reference gene
14. scaffold Name of assembly scaffold
15. scaff_len Length of scaffold
16. scaff_start Position of nucleotide in scaffold where variant starts
17. scaff_end Position of nucleotide in scaffold where variant ends
18. scaff_nt Nucleotide in scaffold
19. read_depth Read depth on scaffold, as called by samtools mpileup
20. alt_bases ALT bases reported by samtools/bcftools on the scaffold
21. ref_alt_depth Read depth of alternative alleles from column 20, reported by samtools/bcftools

If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1, 2 and 3. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:

cut -f1,2 report.tsv | uniq

ARIBA currently keeps all files relating to assembly in the output directory. There is one directory per gene that had any reads mapped to it. This may change in future and most files will be deleted unless the user requests for them to be left on disk. The files likely to be of most use are:

  • assembly.fa - a FASTA file of the assembly of the gene
  • assembly.reads_mapped.bam - a sorted indexed BAM file of reads mapped to the assembly
  • reads_1.fq, reads_2.fq - FASTA files of the reads and their mates that mapped to the reference gene
  • gene.reads_mapped.bam - sorted indexed BAM file of reads mapped to the reference gene.

Summarising output from more than one ARIBA run

Report files can be summarised using

ariba summary out.tsv in.report.1.tsv in.report.2.tsv ...

The output file format is tab-delimited unless the filename ends with '.xls', in which case an Excel file is made. All the input files must be in TSV format, not Excel. Each gene for each run is summarised into a single number, with the meaning as follows.

  • 0 - gene not present (assembly failure or % identity between scaffold and gene too low)
  • 1 - it's there, but doesn't have a complete ORF or is not unique
  • 2 - it's there, assembled into a unique contig, but not with a complete ORF
  • 3 - it's there in a unique contig with a complete ORF, but has at least one non-synonymous variant
  • 4 - it's there in a unique contig with a complete ORF, and the only variants (if any at all) are synonymous.

To get all the options, run

ariba summary --help

Flag

During assembly, various things can happen. The possibilities are encoded into a flag (column 2 of the report). To get the meaning of a flag, for example 27, run:

ariba flag 27

The output is:

Meaning of flag 27
[X] gene_assembled
[X] gene_assembled_into_one_contig
[ ] gene_region_assembled_twice
[X] complete_orf
[X] unique_contig
[ ] scaffold_graph_bad
[ ] assembly_fail
[ ] variants_suggest_collapsed_repeat
[ ] hit_both_strands
[ ] has_nonsynonymous_variants

Flag 27 is an ideal result - the gene was assembled into one unique contig with a complete open reading frame.

Clone this wiki locally