AmpliCI, Amplicon Clustering Inference, denoises Illumina amplicon data by approximate model-based clustering.
AmpliCI v2.0 now incorporates our new Unique Molecular Identifier (UMI)-aware software DAUMI to denoise UMI-tagged Illumina amplicon sequences. This README focuses on installation of AmpliCI and how to use it to estimate haplotypes and abundance for amplicons without UMIs. See the DAUMI instructions for information on how to estimate haplotypes and abundance for amplicons with UMIs. The development version of DAUMI is at this page.
AmpliCI now is available on Bioconda. You can install AmpliCI with conda install
.
conda install bioconda::amplici
Then you can run AmpliCI under your current conda environment.
run_AmpliCI -h
- Prerequisites
- Installation
- Preparing input
- Usage
- Output
- Downstream analysis
- Troubleshooting
- Detailed options
- C library
- Acknowledgements
- Citation
- Contact
AmpliCI has been tested under Linux and MacOS.
- Clone the repository.
git clone https://github.com/DormanLab/AmpliCI.git
- Configure the project.
cd AmpliCI/src cmake .
- Compile AmpliCI. The executable is called
run_AmpliCI
. It will appear in thesrc
directory you are currently in.make
The input of AmpliCI is a FASTQ file, but there is some necessary preprocessing.
Like all other denoising methods, the starting point of the analysis is FASTQ sequence data after demultiplexing. If you start with separate barcode and read FASTQ files, you can use the qiime script split_libraries_fastq.py for demultiplexing. Use the option --store_demultiplexed_fastq
to keep demultiplexed fastq files.
AmpliCI requires all input reads have the same length, with no ambiguous nucleotides (only A, C, G, T base calls allowed). One way to truncate or filter reads with ambiguous nucleotides is via the R package ShortRead or simply use seqkit with following command,
seqkit grep -srv -p 'N' reads1.fastq > reads1_noN.fastq
AmpliCI takes a single demultiplexed FASTQ file (one per sample) generated from the Illumina sequencing platform, with reads trimmed to the same length and containing no ambiguous nucleotides (see above steps). If you have paired end data, AmpliCI can analyze the forward reads, the reverse reads, or the merged reads, but not both forward and reverse reads simultaneously.
You can find example input FASTQ files in the test directory.
One read in the input FASTQ file should fit in exactly four lines, as in the format below.
@SRR2990088.351
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTTTTAAGTCAGCGGTGAAAGTCTGTGGCTCAACCATAGAATTGCCGTTGAAACTGGGAGGCTTGAGTATGTTTGAGGCAGGCGGAATGCGTGGTGTAGCGGTGAAATGCGTAGATATCAAGCAGAACACCGATTGCGAAGGCAGCTTGCTAAGCCATGACTGACGCTGATGCACGAAAGCGTGGGGATGAAACA
+
CCCCCGGGG8CFCFGGEGGGGGGGGGB@FFEEFFGFCFFFGGGGGGGEFGGG9@@F@FF9EFFG<EEGD@EFFGGGG,ECBCEFGCAFEFEEF<E?FEFFG<F@FFFGGG9FG@FGGG8DEGGGD,A=4,AEDF+F3BCCEEE7DFCGEEDEFEGFEGEGE<@@F>*:?BB7@;,>,5,*;CC:,4C957*:AB5<=DF6:>/*5*121/(/*500.<<;52(444+164-83::>:B021;91-(.<6).
First line: @sequence name
Second line: DNA sequence (A, T, C, G)
Third line: +any content on a single line
Fourth line: quality score sequence (ASCII [!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ])
If your read or quality scores are split over multiple lines, AmpliCI will not work. One possible script for fixing your FASTQ-formatted files is given by Damian Kao on BioStars. You can also use seqkit with following command,
seqkit seq reads_1.fq -w 0
AmpliCI runs in two major steps:
-
Use AmpliCI to estimate the error profile directly from the data (the executable is called run_AmpliCI):
./run_AmpliCI --fastq <input_fastq_file> --outfile <output_error_profile_file> --error
An example (from the
src
directory):./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/error.out --error
-
Use Amplici to estimate the haplotypes and their abundance using the estimated error profile:
./run_AmpliCI --fastq <input_fastq_file> --outfile <output_base_filename> --abundance 2 --profile <input_error_profile_file>
An example (from the
src
directory):./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/test --abundance 2 --profile ../test/error.out
If you provide no input error profile with the
--profile
option, AmpliCI will assume the error rates are the error rates dictated by Phred quality scores. Assuming Phred quality scores is not a good idea. Using Phred quality scores tends to generate high numbers of false positives and runs very slowly.- You can also use AmpliCI to reassign reads given input haplotypes. You can provide your haplotype set with '--haplotypes' option:
An example (from the
./run_AmpliCI --fastq <input_fastq_file> --outfile <output_assignment_filename> --profile <input_error_profile_file> --haplotypes <input_haplotypes_fasta_file>
src
directory):./run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/test.id --profile ../test/error.out --haplotypes ../test/test.fa
- Detailed help can be obtained with:
./run_AmpliCI --help
- If you apply AmpliCI on longer reads with length > 300 (like merged reads), you may want to decrease the default Lower bound for screening reads during cluster assignment with
--log_likelihood
[DEFAULT: -100.000000]. For example, you can set the lower bound at -200../run_AmpliCI --fastq ../test/sim3.8.1.fastq --outfile ../test/test.id --profile ../test/error.out --haplotypes ../test/test.fa --log_likelihood -200
- You can also use AmpliCI to reassign reads given input haplotypes. You can provide your haplotype set with '--haplotypes' option:
When run to estimate the error profile, AmpliCI will output an error profile <output_error_profile_file>
in text format. This is simply a list of comma-separated probabilities (times 1000) of the probability that haplotype nucleotide n
is misread as read nucleotide m
with quality score q
. They are ordered as (n,m,q)
, with the last index varying the fastest. Both haplotype nucleotide n
and read nucleotide m
are in the order (A,C,T,G) , and q
has the range from 0 to 40 (41 in total). For example, the first 41 entries are estimated transition probabilities for A->A when observed quality score q is in [0:40]; Then the 42nd - 82nd entries are estimated transition probabilities for A->C; the 165th - 205th entries are estimated transition probabilities for C->A.... In our recommended workflow the error profile should only be used with the dataset from which it was estimated. If you apply AmpliCI estimates to other datasets or use other estimates with AmpliCI, you should consider the following:
-
AmpliCI encodes nucleotides in the order of (A, C, T, G), which is different from the commonly used alphabetic order (A, C, G, T).
-
Not all quality scores will be observed, especially for quality scores < 3. To avoid extrapolation, error rates for quality scores outside the range of observed quality scores will not be estimated by LOESS regression. Instead, we just assume the error rates dictated by Phred quality scores with equal probability of each possible nucleotide substitution when there is an error.
When run to estimate haplotypes and their abundances with argument --outfile <output_base_filename>
or --outfile <fasta_output_file> <information_output_file>
, there will be two output files:
1.output_base_filename.fa
or fasta_output_file
FASTA-formatted file (will be used in the downstream analysis) containing denoised sequences (or haplotypes). For each sequence, we also provide size
(scaled true abundance), DiagP
(diagnostic probability), ee
(mean expected number of errors in reads), useful for chimera detection and post hoc filtering. For example for the first haplotype, the FASTA header might look like:
>H0;size=516.000;DiagP=0.00e+00;ee=0.405;
-
size
: scaled true abundance (expected number of error free reads) estimated for each selected haplotype, required for the subsequent chimera detection with UCHIME3. -
DiagP
: diagnostic probability, which could be used as a criterion to check false positives. We suggest post hoc removal of haplotypes withDiagP
> 1e-40 when applying AmpliCI on real datasets with more than 1 million reads to reduce false positives. The diagnostic probability may contain an allowance for contaminating sequences (see option--contaminants
). For further information of the diagnostic probability and contamination screening, please see our paper. -
ee
: mean expected number of errors per read. Edgar and Flyvbjerg (Edgar and Flyvbjerg, 2015) suggested a strategy to filter reads according to their expected number of errors. For example, you could remove haplotypes withee
> 1. Though this strategy works for some mock datasets, we have observedee
> 1 for several true haplotypes with very low abundance when analyzing a specific mock dataset (stag1, see our paper). You can read more aboutee
.
2.output_base_filename.out
or information_output_file
A text file with the following information provided as key: value pairs, one per line. The keys are:
-
K
: Number of haplotypes selected by AmpliCI. -
assignments
: AmpliCI-assigned haplotype by posterior probability for each read in FASTQ-determined input order. Haplotypes are numbered 0, 1, ..., and match the sequences H0, H1, ... in the output FASTA file of haplotypes. NA is output if the read's maximum conditional log likelihood (given the source haplotype) does not exceed a user-defined threshold (option-ll
; default -100). These assignments are not based on alignment of reads to the haplotypes, so some reads, particularly indel errors, may not be assigned (NA). See option--haplotypes
for more careful read assignment. -
cluster sizes
: Number of reads assigned to each haplotype. -
pi
: Estimated$\boldsymbol{\pi}$ from AmpliCI. Each read is assigned to a haplotype by maximum transition probability$\Pr(r_i|h_k)$ (distinct from posterior probability used for assignments) and$\pi_k$ is the proportion of reads assigned to haplotype$k$ . -
reads ll
: For each read, the maximum conditional log likelihood (given the source haplotype),$\ln \pi_k + \ln \Pr(r_i|h_k)$ . -
There is also a fasta listing of the haplotypes reported in this file.
-
ee
: For each read, the mean expected number of errors. See discussion onee
inoutput_base_filename.fa
above. -
uniq seq id
: The index of each selected haplotype in the unique sequence list, ordered from highest abundance to lowest. If the haplotypes were selected in observed abundance order, then these will be increasing integers from 0. If any unique sequence was discarded, some integers will be skipped. For example, this line is0 1 2 3 4 5 6 7 10 42 45
for test filetest/sim3.8.1.fastq
, indicating that the first 8 most observed sequences were selected as haplotypes, but the 9th and 10th most observed sequences were discarded, and so on. -
scaled true abun
: The estimated scaled true abundances of each selected haplotype (expected number of error free reads). -
obser abun
: The observed abundance of each selected haplotype. -
Estimated common ancestor
(value on next two lines in FASTA format): The final, estimated common ancestor of all the haplotypes used in the BIC calculation. -
Evolution_rate
: The estimated evolutionary time separating each haplotype from the ancestor. -
log likelihood from JC69 model
: The log likelihood of the JC69 hierarchical model computed on the final, fitted model. -
Diagnostic Probability threshold
: The threshold used to reject candidate haplotypes in the contamination test. This is the value input through option--diagnostic
divided by the number of possible candidate haplotypes. -
aic
: The estimated Akaike Information Criterion value from the final fitted model. -
bic
: The estimated Bayesian Information Criterion value from the final fitted model.
When run with option --haplotypes
to reassign reads to the user-provided haplotype set (a FASTA-formatted file), AmpliCI will output a read assignment file <output_assignment_filename>
in text format. The keys are
-
assignments
: See the description above for outfileoutput_base_filename.out
. There should be fewer NA assignments because reads with low log likelihood are aligned to the haplotypes to detect indel sequencing errors. -
cluster sizes
: See the description above for outfileoutput_base_filename.out
. The sizes should be higher if more reads are successfully assigned to haplotypes.
The output FASTA file contains denoised raw haplotype sequences, which may include chimeric sequences. The first step of any downstream analysis should be to remove chimeric sequences.
The AmpliCI-outputted FASTA file is in acceptable format to input into the uchime3_denovo method implemented in usearch.
Haplotype sorting by abundance
./usearch -sortbysize <input_fasta_file> -fastaout <output_sorted_fasta_file>
Chimera detection
./usearch -uchime3_denovo <input_sorted_fasta_file> -uchimeout <uchime_outfile> -chimeras <chimera_fasta_outfile> -nonchimeras <nonchimera_fasta_outfile>
You may also use other chimera detection algorithms to remove chimeras.
We have provided an R script to help to generate the ASV (sOTU) table, where scaled true abundances (see size
) per sample per ASVs/sOTUs are reported.
You may want to identify the non-chimeric haplotypes detected in your sample. There are multiple methods.
-
DECIPHER contains IDTAXA, a novel approach for taxonomic classification.
-
RDP classifier, a Naive Bayesian Classifier.
mothur, qiime2, LefSE, phyloseq, ....
The algorithm may stop if your:
-
quality scores are not in the typical range for Illumina datasets [33,73]
-
reads contain ambiguous nucleotides
-
reads are not in the right FASTQ input format, for example reads and quality scores cannot contain newline characters
-
there are too few reads or reads are so noisy that there are no sequences observed more than the lower bound number of times (option
--abundance
, default 2.0)
Main options:
-
--fastq
The fastq input file. [REQUIRED] -
--outfile
Output file(s) for haplotype discovery, estimated error profile (when used with --error), or cluster assignments (when used with --haplotypes). [REQUIRED] -
--profile
The input error profile. If none, convert quality score to Phred error probability. [DEFAULT: none] -
--error
Estimate the error profile. [Used in error estimation only] -
--haplotypes
FASTA file with haplotypes. [Used in reads assignment only]
Options for sensitivity:
-
--abundance
Lower bound for scaled true abundance during haplotype reconstruction (should be >= 2.0). [DEFAULT: 2.0] -
--contaminants
Baseline count abundance of contaminating or noise sequences. [DEFAULT: 1] -
--indel
Indel sequencing error rate. Cannot also use options --insertion or --deletion. [DEFAULT: 0.00006] -
--diagnostic
Threshold of diagnostic probability in the diagnostic/contamination test. [DEFAULT: 0.001 / number_candidates]
Other important options:
-
--align
Align all reads to haplotypes (slow). [DEFAULT: none] -
--log_likelihood
Lower bound for screening reads during cluster assignment. This is the minimum log assignment likelihood,$\ln \pi_k + \ln \Pr(r_i|h_k)$ . [DEFAULT: -100.000000] -
--nJC69
Disable JC69 model. By default, AmpliCI assume all sequences are generated from an ancestral sequence, which slightly increases the sensitivity for detecting closed haplotypes. [Use the option when biological sequences are unrelated]
AmpliCI provides both a shared and a static C library for users to call function amplici_wfile()
to cluster amplicon sequences from another program. The library libamplici.*
(libamplici.so
for Linux or libamplici.dylib
for MacOS, and libamplici.a
) will appear in the src
directory when you compile AmpliCI.
Input
-
fastq_file
: The fastq input file. [REQUIRED] -
error_profile_name
: The input error profile. IfNULL
, convert quality score to Phred error probability. -
low_bound
: Allowed lowest abundance. See the description of option--abundance
. [REQUIRED]
Output
-
seeds
: Estimated haplotypes. -
seeds_length
: Lengths of Estimated haplotypes. -
cluster_id
: See the description ofassignments
above for outfileoutput_base_filename.out
. Noteamplici_wfile()
does not filter reads with maximal conditional log likelihood under the given threshold. Instead, it assigns all reads to its closest haplotypes with the maximum likelihood. -
cluster sizes
: Number of reads assigned to each haplotype. -
K
: Number of estimated haplotypes. -
sample_size
: Number of reads in the fastq input file -
max_read_length
: Maximum read length l. The kth (in [0,1,2,...K-1]) haplotype starts at seeds[k*l]. -
abun
: See the description ofscaled true abun
above for outfileoutput_base_filename.out
. -
ll
: See the description ofreads ll
above for outfileoutput_base_filename.out
.
An example to call function amplici_wfile()
is provided in example_wfile.c. You can compile the source file with the C library libamplici.a (in the src
directory):
gcc -o myprog example_wfile.c -lamplici -lRmath -lm -I ./src/ -L ./src/
Use -I to provide path to header file of the library libamplici.h and -L to provide path to the library libamplici.a. You may need to add additional path to Rmath library and header files if needed. Note example_wfile.c needs two more header files in the src
directory, which are not required by the library libamplici.a.
If you use the shared library (libamplici.so
or libamplici.dylib
), you need to add the PATH to the shared library before running your executable file.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/your/full/path/to/library
-
AmpliCI contains LOESS regression for error estimation, the original file is available at https://www.netlib.org/a/dloess. However, we modified and used related code from R, which derives from the above.
-
AmpliCI uses some C and FORTRAN libraries provided by R. The relative codes have been cooperated in the software.
-
We used the hash implemented in uthash.h.
-
For amplicons without UMIs: Peng, X. and Dorman, K. (2020) ‘AmpliCI: A High-resolution Model-Based Approach for Denoising Illumina Amplicon Data’, Bioinformatics. doi: 10.1093/bioinformatics/btaa648.
-
For amplicons with UMIs (also see DAUMI instructions): Peng, X. and Dorman, K. S. (2023) 'Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers', Bioinformatics. Advanced access.
If you have any problems with AmpliCI, please contact:
Xiyu Peng ([email protected])
Karin Dorman ([email protected])