-
Notifications
You must be signed in to change notification settings - Fork 15
FCS GX quickstart
FCS-GX detects contamination from foreign organisms in genome sequences. This tool is one module within the NCBI Foreign Contamination Screening (FCS) program suite.
We recommend running FCS-GX after the initial contig assembly and on the final assembly prior to GenBank submission. If additional valid contaminants are identified in the final assembly, we recommend re-screening after contaminant removal.
FCS-GX operates in six main steps:
- Repeat and low-complexity sequence masking
- Alignment to reference database using GX aligner
- Alignment refinement with high-scoring taxa matches
- Classifying sequences to assign taxonomic divisions
- Generating contaminant cleaning actions
- Clean the genome
- Prerequisites
- Download FCS-GX
- Download the FCS-GX database
- Screen the genome
- Clean the genome
- Usage examples
- Input
- Output
- Troubleshooting
- Privacy and collection of usage data
- Docker or Singularity The current Singularity image is made using version 3.4.0.
- Python 3.7+.
- 470 GiB of disk space to save a local copy of the database files.
- A host with 512 GiB shared memory to hold the database and accessory files. Execution can utilize up to 48 CPU cores. Not running on a large-RAM server will result in extremely long run times (as much as a 10000x difference in performance).
- A genome assembly in FASTA format.
- The tax-id of the organism.
Note: FCS-GX can be run in AWS or GCP. Please see Amazon Web Services wiki or Google Cloud wiki to get started on creating a VM with Docker. Visit ncbi/fcs-gx repo for source code.
-
Retrieve the
fcs.py
runner script:curl -LO https://github.com/ncbi/fcs/raw/main/dist/fcs.py
Docker is the default image, and will be automatically downloaded and used by the runner script.
-
For Singularity users:
Retrieve the Singularity image filefcs-gx.sif
and set the environment variable to use the image with the runner:curl https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/releases/latest/fcs-gx.sif -Lo fcs-gx.sif export FCS_DEFAULT_IMAGE=fcs-gx.sif
To see the version of your Singularity image, you can run:
singularity inspect fcs-gx.sif
-
Download the db:
- Using
s5cmd
(fastest option for cloud to cloud):
curl -LO https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz tar -xvf s5cmd_2.0.0_Linux-64bit.tar.gz LOCAL_DB="/path/to/db/folder" ./s5cmd --no-sign-request cp --part-size 50 --concurrency 50 s3://ncbi-fcs-gx/gxdb/latest/all.* $LOCAL_DB
- Using
fcs.py db get
:
SOURCE_DB_MANIFEST="https://ncbi-fcs-gx.s3.amazonaws.com/gxdb/latest/all.manifest" LOCAL_DB="/path/to/db/folder" python3 fcs.py db get --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb"
- Using
-
Check if the database is downloaded successfully to $LOCAL_DB:
ls "$LOCAL_DB/gxdb" all.README.txt all.assemblies.tsv all.blast_div.tsv.gz all.gxi all.gxs all.manifest all.meta.jsonl all.seq_info.tsv.gz all.taxa.tsv
-
If you have access to a tmpfs- or ramfs-backed filesystem, e.g.,
/dev/shm
, you can copy the downloaded databases to RAM to ensure it is available in successive runs on the same server.sudo mkdir /my_tmpfs sudo mount -t tmpfs tmpfs /my_tmpfs -o size=470G python3 fcs.py db get --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
-
Check if there are any differences between the source 'all' db and the downloaded 'all' db. If you have access to a tmpfs- or ramfs-backed filesystem, you can also check if there are any differences between the downloaded 'all' db and the cached 'all' db:
python3 fcs.py db check --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb" python3 fcs.py db check --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
- Assign the path to the
--gx-db
folder to GXDB_LOC.
- If you are using the db stored on local disk:
GXDB_LOC=/path/to/db/folder
- If you are using the db stored in RAM:
GXDB_LOC=/my_tmpfs
- Retrieve the organism tax-id from NCBI Taxonomy.
- Screen the genome:
python3 ./fcs.py screen genome --fasta h_sapiens.fa.gz --out-dir ./gx_out/ --gx-db "$GXDB_LOC/gxdb" --tax-id 9606
-
Perform cleaning actions on input genome:
zcat h_sapiens.fa.gz | python3 ./fcs.py clean genome --action-report ./gx_out/h_sapiens.fa.9606.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
-
Split on internal contaminants instead of masking:
sed -i 's/FIX/SPLIT/g' ./gx_out/h_sapiens.fa.9606.fcs_gx_report.txt zcat h_sapiens.fa.gz | python3 ./fcs.py clean genome --action-report ./gx_out/h_sapiens.fa.9606.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
Test that FCS-GX is operating normally on a small FASTA file.
-
Download the test FASTA:
curl -LO https://zenodo.org/records/10932013/files/FCS_combo_test.fa
-
Screen the genome:
python3 ./fcs.py screen genome --fasta FCS_combo_test.fa --out-dir ./gx_out/ --gx-db /my_tmpfs/gxdb --tax-id 4932
A successful FCS-GX run will print the parameters of the run, sequence masking progress, and a contamination summary report:
----------------------------------------------------------------------------- tax-id : 4932 fasta : /sample-volume/FCS_combo_test.fa size : 12.18 MiB split-fa : True BLAST-div : budding yeasts gx-div : fung:budding yeasts w/same-tax: True bin-dir : /app/bin gx-db : /app/db/gxdb/gxdb/all.gxi gx-ver : Nov 27 2023 12:29:26; git:v0.5.0 output : /output-volume//FCS_combo_test.4932.taxonomy.rpt ----------------------------------------------------------------------------- Collecting masking statistics... Collected masking stats: 0.0125624 Gbp; 3.36762s; 3.73035 Mbp/s. Baseline: 1.04906 Processed 420 queries, 12.5732Mbp in 4.35433s. (2.88751Mbp/s); num-jobs:120 Species : None Asserted div : fung:budding yeasts Inferred primary-divs : ['fung:budding yeasts', 'fung:ascomycetes'] Corrected primary-divs : ['fung:budding yeasts', 'fung:ascomycetes'] Putative contaminant divs : ['prok:g-proteobacteria', 'anml:primates'] Aggregate coverage : 100% Minimum contam. coverage : 20% ----------------------------------------------------------------------------- fcs_gx_report.txt contamination summary: ---------------------------------------- seqs bases ----- ---------- TOTAL 405 404339 ----- ----- ---------- prok:g-proteobacteria 202 201923 anml:primates 201 200894 virs:eukaryotic viruses 1 1000 anml:nematodes 1 522 ----------------------------------------------------------------------------- fcs_gx_report.txt action summary: --------------------------------- seqs bases ----- ---------- TOTAL 405 404339 ----- ----- ---------- EXCLUDE 401 400522 FIX 2 1922 TRIM 2 1895 -----------------------------------------------------------------------------
The output directory will contain the following files:
FCS_combo_test.4932.taxonomy.rpt FCS_combo_test.4932.fcs_gx_report.txt
The output should be similar to the examples for the taxonomy report
taxonomy.rpt
and action reportfcs_gx_report.txt
. Note: Minor differences in output content are expected with code and database changes. See the FCS-GX Output page for additional information regarding interpreting outputs. -
Clean the genome:
cat FCS_combo_test.fa | python3 ./fcs.py clean genome --action-report ./gx_out/FCS_combo_test.4932.fcs_gx_report.txt --output clean.fasta --contam-fasta-out contam.fasta
By default this will exclude 401 sequences (
EXCLUDE
in action report), trim 2 sequences (TRIM
), and hardmask 2 sequences at internal contaminants (FIX
):Applied 405 actions; 402417 bps dropped; 1922 bps hardmasked.
Please create an Issue if you encounter any problems.
For all other questions or comments, please contact us at [email protected]
-
FCS-adaptor
-
FCS-GX
-
Setting up FCS in the cloud
-
FCS in Galaxy