If you build a comprehensive database, you may want to use a read-based approach to functionally profile a large set of samples. This tutorial will show you how to build a custom HUMAnN database from your annotations and how to profile your samples where there is full accounting of reads and your genomes.
What you'll end up with at the end of this is a merged taxonomy table, a custom HUMAnN annotation table, and HUMAnN profiles.
Please refer to the end-to-end metagenomics or recovering viruses from metatranscriptomics workflows for details on binning, clustering, and annotation.
- Merge taxonomy from all domains
- Compile custom
HUMAnN
database from de novo genomes - Functional profiling using
HUMAnN
of custom database - Merge the tables
Conda Environment: conda activate VEBA
. Use this for intermediate scripts.
At this point, it's assumed you have the following:
- A concatenated protein fasta (preferably with proteins ≥ 100k removed
seqkit seq -M 99999
) calledveba_output/misc/all_genomes.all_proteins.lt100k.faa
- Taxonomy classifications for all domains
- Annotations results from the
annotate.py
module - Clustering results from the
cluster.py
module - A directory of preprocessed reads:
veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
andveba_output/preprocess/${ID}/output/cleaned_2.fastq.gz
where${ID}
represents the identifiers inidentifiers.list
.
merge_taxonomy_classifications.py -i veba_output/classify/ -o veba_output/classify/ --no_header --no_domain
Here we are going to compile all the UniRef
annotations and taxonomy identifiers to build a custom HUMAnN
database. The script supports piping to stdin so we are going to cut some columns from identifier_mapping.proteins.tsv.gz
zcat veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz | cut -f1,3 | tail -n +2 | compile_custom_humann_database_from_annotations.py -a veba_output/annotation/output/annotations.proteins.tsv.gz -s veba_output/misc/all_genomes.all_proteins.lt100k.faa -o veba_output/misc/humann_uniref_annotations.tsv -t veba_output/classify/taxonomy_classifications.tsv
This generates a table that has the following columns (no header):
[id_protein]<tab>[id_uniref]<tab>[length]<tab>[lineage]
- e.g.,
S1__NODE_1648_length_2909_cov_4.311843_4 UniRef50_A0A1C7D7V0 214 d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Sphingomonadales;f__Sphingomonadaceae;g__Erythrobacter;s__Erythrobacter sp003149575;t__S1__CONCOCT__P.1__24
Now it's time to align reads. Since HUMAnN
takes in single reads, we join the reads using bbmerge.sh
in the backend. You could also use the aligned reads in the form of a bam
file from the mapping.py
module.
N_JOBS=4
OUT_DIR=veba_output/profiling/pathways
UNIREF_ANNOTATIONS=veba_output/misc/humann_uniref_annotations.tsv
FASTA=veba_output/misc/all_genomes.all_proteins.lt100k.faa
mkdir -p logs
for ID in $(cat identifiers.list);
do
N="profile-pathway__${ID}";
rm -f logs/${N}.*
R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz
CMD="source activate VEBA && veba --module profile-pathway --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -i ${UNIREF_ANNOTATIONS} -f ${FASTA}\""
# Either run this command or use SunGridEnginge/SLURM
done
The following output files will produced for each sample:
- reads.seqkit_stats.tsv - Sequence stats for input reads
- humann_pathabundance.tsv - Stratified abundance of taxa-specific metabolic pathways
- humann_pathcoverage.tsv - Stratified pathway completion ratio of taxa-specific metabolic pathways
- humann_genefamilies.tsv - Stratified abundance of taxa-specific gene families
- humann_diamond_unaligned.fa.gz - Joined reads that did not align to database
- humann_diamond_aligned.tsv.gz - Aligned reads from translated blast search to database (blast6 format)
merge_generalized_mapping.py -o veba_output/profiling/pathways/merged.humann_pathcoverage.tsv veba_output/profiling/pathways/*/output/humann_pathcoverage.tsv
merge_generalized_mapping.py -o veba_output/profiling/pathways/merged.humann_pathabundance.tsv veba_output/profiling/pathways/*/output/humann_pathabundance.tsv
merge_generalized_mapping.py -o veba_output/profiling/pathways/merged.humann_genefamilies.tsv veba_output/profiling/pathways/*/output/humann_genefamilies.tsv
The following output files will produced for each sample:
- merged.humann_pathabundance.tsv - Stratified abundance of taxa-specific metabolic pathways of all samples
- meged.humann_pathcoverage.tsv - Stratified pathway completion ratio of taxa-specific metabolic of all samplespathways
- meged.humann_genefamilies.tsv - Stratified abundance of taxa-specific gene families of all samples
Subset stratified tables by their respective levels.