Skip to content

Latest commit

 

History

History
99 lines (75 loc) · 5.13 KB

README.md

File metadata and controls

99 lines (75 loc) · 5.13 KB

sylph-tax - incorporating taxonomy into sylph

Note

This repo replaces the old sylph-utils scripts. sylph-tax is easier to download/install and use than sylph-utils.

Sylph is an efficient and accurate metagenome profiler. However, its output does not have taxonomic information. sylph-tax can turn sylph's TSV output into a taxonomic profile like Kraken or MetaPhlAn. sylph-tax does this by using custom taxonomy files to annotate sylph's output.

Taxonomy integration - available databases with taxonomy files

The following pre-built sylph databases have available taxonomic annotations. Custom taxonomies can also be incorporated.

sylph-tax identifier Database description Clades
GTDB_r220 GTDB-r220 (April 2024) Prokaryote
GTDB_r214 GTDB-r214 (April 2023) Prokaryote
OceanDNA OceanDNA - ocean MAGs from Nishimura & Yoshizawa Prokaryote
SoilSMAG Soil MAGs (SMAG) from Ma et al. Prokaryote
FungiRefSeq-2024-07-25 Refseq fungi representative genomes collected on 2024-07-25 Eukaryote
TaraEukaryoticSMAG TARA eukaryotic SMAGs from Delmont et al. Eukaryote
IMGVR_4.1 IMG/VR 4.1 high-confidence viral OTU genomes Virus

Install option 1 - Conda

conda install -c bioconda sylph-tax

Install option 2 - Python

git clone https://github.com/bluenote-1577/sylph-tax
cd sylph-tax
pip install .

Quick start

Important

Please see this manual for more information on

  1. output format information
  2. how to create taxonomy metadata for customized genome databases
# download all taxonomy files (~50 MB)
sylph-tax download --download-to /any/folder

# incorporate GTDB-r220 and IMGVR-4.1 taxonomies into sylph's results
sylph-tax taxprof sylph_results/*.tsv -t GTDB_r220 IMGVR_4.1 -o output_prefix-

ls output_prefix-sample1.sylphmpa
ls output_prefix-sample2.sylphmpa
...

# merge multiple results
sylph-tax merge *.sylphmpa --column relative_abundance -o merged_abundance_file.tsv

sylph-tax subcommands

download - download taxonomy metadata

sylph-tax download --download-to /my/folder/sylph_taxonomy_files/
  • Downloads taxonomic annotation files (~50 MB; see here) to --download-to.
  • This folder (must exist) can be wherever you want. Its location is written to ~/.config/sylph-tax/config.json.
  • If you don't have access to $HOME, you can specify a custom location in the SYLPH_TAXONOMY_CONFIG environment variable. E.g. export SYLPH_TAXONOMY_CONFIG=/write_access_folder/sylph-tax-config.json.

taxprof - taxonomic profiles from sylph's output

sylph-tax taxprof sylph_results/*.tsv  -o prefix_or_folder/ -t {sylph-tax identifier}
  • sylph_results/*.tsv: outputs from sylph. The databases used for sylph must be the same as the -t option.
  • -t/--taxonomy-metadata: A list of sylph-tax identifiers in the above table (e.g. GTDB_r220 or IMGVR_4.1). Multiple taxonomy metadata files can be input. Custom taxonomy files are also possible; see below.
  • -o: prepends this prefix to all of the output files. One file is output per sample in sylph_output.tsv
  • -a/--annotate-virus-hosts: annotates found viral genomes with host information metadata (only available for IMGVR_4.1 right now)
  • Output suffix is .sylphmpa.

Tip

In python/pandas, pd.read_csv('output.sylphmpa',sep='\t', comment='#') works.

merge - merge multiple taxonomic profiles

Merge multiple taxonomic profiles from sylph_to_taxprof.py into a TSV table

sylph-tax merge *.sylphmpa --column {ANI, relative_abundance, sequence_abundance} -o output_table.tsv
  • *.sylphmpa files are outputs from sylph-tax taxprof.
  • --column can be ANI, relative abundance, or sequence abundance (see paper for difference between abundances)
  • -o output file in TSV format.

Output format for merge (TSV)

clade_name  sample1.fastq.gz  sample2.fastq.gz
d__Archaea  0.0  1.1
d__Archaea|p__Methanobacteriota 0.0     0.0965
...