This walkthrough goes through converting your counts table (with or without metadata) to anndata or biom format.
What you'll end up with at the end of this is either a .h5ad
or .biom
format file for anndata
and biom
, respecitvely.
It is assumed you've completed either the end-to-end metagenomics or the recovering viruses from metatranscriptomics and the read mapping/counts tables walkthroughs.
anndata
is developed for single-cell genomics but I've been using it for my metagenomics workflow. I recommend trying it out.
- Provide just the counts table
- Provide a counts table and sample metadata
- Provide a counts table, sample metadata, and
Conda Environment: conda activate VEBA
. Use this for intermediate scripts.
convert_counts_table.py -i veba_output/counts/X_slcs.tsv.gz -f pickle -o veba_output/counts/X_slcs.no_metadata.pkl
Now in Python we can load it with Pandas
import pandas as pd
X_slcs = pd.read_pickle("veba_output/counts/X_slcs.no_metadata.pkl")
You probably have sample metadata so use that instead. Make sure that all the sample identifiers in your counts table are accounted for in your metadata table.
Here we don't have any so I'm going to just get the reads manifest for a toy sample metadata table.
# Generate a sample metadata for example but use your own instead
# Fastq/
# Fastq/S1_1.fastq.gz
# Fastq/S1_2.fastq.gz
# Fastq/etc.
compile_reads_table.py -f Fastq/ -n [ID]_[DIRECTION].fastq.gz --header > veba_output/misc/reads_table.tsv
Convert to biom format using the sample metadata
# Convert table
convert_counts_table.py -i veba_output/counts/X_slcs.tsv.gz -f biom -o veba_output/counts/X_slcs.with_sample_metadata.biom --sample_metadata veba_output/misc/reads_table.tsv
Now in Python we can load it with biom
import biom
table = biom.load_table("veba_output/counts/X_slcs.with_sample_metadata.biom")
# 16 x 4 <class 'biom.table.Table'> with 59 nonzero entries (92% dense)
We are going to use the sample metadata we generated in previous step. We are going to use the taxonomy classifications from the classify.py
module for the feature metadata.
convert_counts_table.py -i veba_output/counts/X_slcs.tsv.gz -f anndata -o veba_output/counts/X_slcs.with_sample_and_feature.metadata.h5ad --sample_metadata veba_output/misc/reads_table.tsv --feature_metadata veba_output/classify/taxonomy_classifications.clusters.tsv
Now in Python we can load it with anndata
import anndata as ad
adata = ad.read_h5ad("veba_output/counts/X_slcs.with_sample_and_feature.metadata.h5ad")
#AnnData object with n_obs × n_vars = 4 × 16
# obs: 'forward-absolute-filepath', 'reverse-absolute-filepath'
# var: 'domain', 'consensus_classification', 'homogeneity', 'number_of_unique_classifications', #'number_of_genomes', 'genomes', 'classifications', 'weights', 'score', #'number_of_unique_classification', 'consensus_taxon_id'
Now it's time to analyze the data (recommend using compositional data analysis (CoDA) approach).
If you have a pickle
format, then you probably know what to do already.
If you have a biom
format look into the QIIME2 ecosystem or BIRDMAn for CoDA-aware bayesian differential abundance.
If you have an anndata
format, look into the Scanpy ecosystem. Though Scanpy does not nativly support CoDA-aware methodologies, they are easy to adapt.
- Thomas P Quinn, Ionas Erb, Mark F Richardson, Tamsyn M Crowley, Understanding sequencing data as compositions: an outlook and review, Bioinformatics, Volume 34, Issue 16, 15 August 2018, Pages 2870–2878, https://doi.org/10.1093/bioinformatics/bty175
- Thomas P Quinn, Ionas Erb, Greg Gloor, Cedric Notredame, Mark F Richardson, Tamsyn M Crowley, A field guide for the compositional analysis of any-omics data, GigaScience, Volume 8, Issue 9, September 2019, giz107, https://doi.org/10.1093/gigascience/giz107
- Espinoza, J.L., Shah, N., Singh, S., Nelson, K.E. and Dupont, C.L. (2020), Applications of weighted association networks applied to compositional data in biology. Environ Microbiol, 22: 3020-3038. https://doi.org/10.1111/1462-2920.15091