Skip to content
Amir Mohseni edited this page Nov 12, 2024 · 5 revisions

Prologue

This tutorial was tested on Ubuntu 20.04.6 LTS. If you get stuck, or it becomes too confusing, it's because it is. Please go ahead and create an issue and we will help you out 👍

  1. First, download Miniconda https://docs.conda.io/en/main/miniconda.html
  2. Clone this repository by either clicking on the green "Code" button on the top right of this page and clicking "Download ZIP," or by $ git clone https://github.com/ucrbioinfo/Fugue.git in your desired directory.
  3. These installation instructions are for Ubuntu 20.04.6 LTS. Create a conda environment and activate it:
cd Fugue
conda create -n fugue python=3.10 -y
conda activate fugue
  1. Install the required FUGUE Python packages.
pip install pandas
pip install biopython
pip install PyYAML
pip install requests
  1. Create the required DIAMOND conda environment, and install DIAMOND. FUGUE and DIAMOND run on different Python versions. Hence, you need a separate environment for DIAMOND.
conda create -n diamond python=2.7 bioconda::diamond -y

Do not switch to this environment. FUGUE will use it on its own at a later stage.

In case you already have an environment called diamond, name this new environment anything you want, and then modify the file FUGUE/src/utils/diamond/0_just_run_this.sh and change the name diamond on lines 4, 5, and 6.

Authentication

FUGUE downloads fungi data from the following databases:

  1. NCBI
  2. FungiDB
  3. EnsemblFungi
  4. MycoCosm

For steps 1 and 4, visit this Google Drive shared folder and perform the mandatory steps in documents "[Data Source] NCBI Datasets" and "[Data Source] MycoCosm." These two databases need special authentication methods AKA your own credentials. The other two documents for data sources are for your reference only and no further action is needed from you.

Downloading the Data

Running FUGUE in the Background

We recommend opening a tmux session at this point as downloading and processing this data may take a couple of hours. When you open a tmux session simply by typing tmux, you may have to switch to the fugue conda environment again. After doing so, simply run $ python src/main.py and continue with the instruction below. While FUGUE is running, you may abandon the tmux session (doing so will not stop the program as it will run in the background) by pressing CTRL + B, then D. To switch back to your session and continue working with FUGUE, type tmux attach -t 0 where 0 is your tmux session. To see a list of all sessions, type tmux list-sessions. To kill session 0, type tmux kill-session -t 0.

Running FUGUE

  1. To get the data used in ALLEGRO, perform steps 6 and 8. Then run $ python src/main.py and enter 13. You may now leave the processes to run in the background and skip everything below. When finished, read the last bullet point at the bottom of this page. If you are using FUGUE for other purposes, continue reading.

  2. Run $ python src/main.py and download 1 through 4 (or simply choose just option 5).

  3. If you are preparing data for ALLEGRO, when all files are downloaded, you may create CDS files from GFF using option 7 (otherwise, skip 6 and 7). These CDS files are such that the intron/exon boundaries in each gene are separated by a pipe | character. ALLEGRO ignores gRNAs with pipes in their sequence.

  4. Select option 8 and then 9 to merge the downloaded files and remove duplicates. Run option 8 regardless of whether you downloaded from a single database, or multiple databases in options 1-5.

  5. In ALLEGRO, we run DIAMOND to find orthogroups of S. cerevisiae genes of interest across all other species. Navigate to src/utils/diamond. Ensure you have the diamond python package installed in a conda environment called 'diamond' (alternatively, modify the name of the environment in which diamond is installed in the shell script 0_just_run_this.sh). Place the .faa amino acid file for S. cerevisiae (or your reference species of interest) in inputs/reference. You may find saccharomyces_cerevisiae.faa under data/fourdbs_concat/proteomes/saccharomyces_cerevisiae.faa. Simply copy this file and place it under src/utils/diamond/inputs/reference.

  6. Modify src/utils/diamond/config.yaml to point to the correct FULL PATH of the CDS and proteome files for your reference species, as you see in the provided example configuration file that I used for myself.

    cds_path: '/point/to/your/own/path/FUGUE/my_data/saccharomyces_cerevisiae_cds.fna'
    proteome_path: '/point/to/your/own/path/FUGUE/my_data/proteomes/saccharomyces_cerevisiae.faa'
    
    query_proteins_dir: '/point/to/your/own/path/FUGUE/data/fourdbs_concat/proteomes/' # Do not change the path after /FUGUE/data/...
    
  7. Navigate to the root directory and run FUGUE again python src/main.py. Select option 10 to run DIAMOND. This will generate Orthogroups.tsv and removes orthogroups with < 30% protein identity. It will also generate a species list called fourdbs_hi_input_species.csv under data/fourdbs_concat/.

  8. Navigate to src/utils/ortholog_finder and set the reference_species name and gene_names in the config.yaml file.

  9. Navigate to the root directory and run FUGUE again python src/main.py. Select option 11 to generate fasta files for each species with only the genes of interest and their orthologs under orthogroups/.

  10. Run FUGUE one last time and select option 12 if you previously selected option 7. This will create orthologous gene files with pipe delimiters derived from GFF files.

  11. The files we used for all ALLEGRO experiments are under data/fourdbs_concat/ortho_from_gff and data/fourdbs_concat/fourdbs_input_species.csv. We copied the directory data/fourdbs_concat/ortho_from_gff and placed it under ALLEGRO/input and copied the CSV file and placed it in the same place.

Clone this wiki locally