phylotranscritpomic analysis pipeline for phycologist

Citation: Cheon, S., Zhang, J. and Park, C., 2020. Is phylotranscriptomics as reliable as phylogenomics?. Molecular biology and evolution, 37(12), pp.3672-3683.

##Notice: Consistent text formatting helps reders to interpret information. <text> means parameters for Linux shell and python scripts such as input file name or integer values.

Requirement

Hardware
32-core processors (recommend > 8-core processors)
256 Gb (Gigabytes) of RAM (recommend > 120 Gb of RAM)
> 1 Terabytes of space require for raw sequencing data and processed data
Ubuntu 16.04 LTS (recommend LTS version)
Internet accessible environments

Software
Python2.7 with Biopython
JAVA 1.8 (or higher)
cmake
SRA Toolkit (v 2.10.8 or higher)
Trimmomatic (v 0.36 or higher)
Samtools
Trinity (v 2.2.0 or higher)
Jellyfish (v 2.3.0 or higher)
Bowtie2 (v 2.3.5.1 or higher)
TransDecoder (v 3.0.0 or higher)
CD-hit (v 4.6.6 or higher)
BLAST+ (v 2.9.0 or higher)
OrthoFinder (v.2.4.0 or higher)
DIAMOND (v. 0.9.24 or higher)
MCL
Prank (v.150803)
Phyutility (v.2.7.1)
IQ-Tree (v. 1.6.11 or higher)

Installation on Linux with conda

Install anaconda

wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh  

sh https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh  
  
rm https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh

Create conda environmental and installation

conda update conda

conda env create -n phylo --file 2021.May.Phylo_env.yaml

conda activate phylo

Phyutility linux package download

sudo apt-get install phyutility

Trimmomatic file download

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

unzip Trimmomatic-0.39.zip

Quick start.

1.Quick start guideline for example raw data download
2.Quick start guideline from translated de novo assembled transcripts with example dataset

Tutorial

##Notice: if you have a proteome sequence or translated de novo assembled reference transcrits. you can starts from part 2.

Part 1. RNA-seq raw data download from NCBI SRA database

For phylogeny with RNA-seq data. we are download RNA-seq raw data from NCBI SRA database.

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files <SRA Accession ID>

part 2. de novo transcritpome assembly and translation

1. De novo transcriptome assembly with Trinity
Short reads RNA sequencing data processed by Trinity assembler with Trimmomatic read trimming toool for illumina NGS data.

For data sets with known adaptor sequence and phred scores for base quality.

If you have single-end sequencing data

Trinity --seqType fq --trimmomatic --quality_trimming_params <"ILLUMINACLIP:/home/your/path/trinity-plugins/Trimmomatic-0.36/adapters/TruSeq3-PE.fa>:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"> --max_memory <200G> --CPU <32> --full_cleanup --single <single-end reads.fastq> --output <trinity_output_Name>

or paired-end sequencing data

Trinity --seqType fq --trimmomatic --quality_trimming_params <"ILLUMINACLIP:/home/your/path/trinity-plugins/Trimmomatic-0.36/adapters/TruSeq3-PE.fa>:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"> --max_memory 200G --CPU 32 --full_cleanup --left <forward reads.fastq> --right <reverse reads.fastq> --output <trinity_output_Name>

2. Find Open Reading Frames and translate using TransDecoder with blastp for orfs selection

Assembled transcripts were translated with TransDecoder programs and choosing orfs with blastp results.
For blastp, download and make database file from Uniprot/Swiss-Prot

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
makeblastdb -in uniprot_sprot.fasta -dbtype prot
  
TransDecoder.LongOrfs -t <transcripts> -S
blastp -query <transcripts>.transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 10 -num_threads 32 -out Genus_Species.outfmt6
TransDecoder.Predict -t <transcripts> --retain_blastp_hits Genus_Species.outfmt6 --single_best_only

3. Clustering with CD-hit

Reduce translated sequence redundancy with CD-hit

cdhit -i <transcripts>.transdecoder.pep -o <Genus Species>.fa.cdhit -c 0.99 -n 5 -T 32

4. Sequence ID fixation.

CD-hit output file " Genus_Species.fa.cdhit" sequence ID change to shorten name to Genus_Species@seqID. The special character "@" is used to separate taxon name and sequence ID. Any "-" (hyphen) in the sequence name will be replaces py phyutility and cause problems in downstream process.

python2 fix_names_from_CDhit.py <CDhit output file.cdhit> <Genus name> <Species name>

part 3. Orthology inference and single copy orthologous extraction

1. Running OrthoFinder

Orthology inference, Copy all the Genus_Species.fix.fa files (or any proteom sequences) into a new directory such as <OrthoFinder_running_dir>.

orthofinder -f <OrthoFinder_running_dir> -S diamond -t <number of threads>

2. Single copy orthologous prediction

Choose the minimal number of taxa filters for single copy orthologs inference (recommend half of taxa)

python2 singlecopy_from_OrthoFinder.py <OrthoFinder_running_dir> SingleCopy <Min number of taxa>

3. Multiple sequence alignment with Prank

python2 prank_Wrapper.py SingleCopy

4. Alignment trimming with Phyutility

I usually use 0.3 for minimal aling column.

python2 phyutility_Wrapper.py SingleCopy <min_align_column>

5. Concatenate supermatrix

You can choose minimal cleaned alignment length per orthologs and minimal number of taxa filters (recommand 150, half of taxa for amino acid tree).
Concatenate with selected cleand orthologous for supermatrix.

python2 supermatrix_concatenate.py SingleCopy <min_align_length> <min_taxa> <output_name>

part 4. Phylotranscritpomic tree reconstruction

Run IQ-Tree with 1000 UFBoot replications and search for best fit tree. Use LG+C60+R+F model.

iqtree -s <Concatenate_matrix>.phy -spp <Concatenate_matrix>.model -m LG+C60+R+F -bb 1000 -nt 32

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
Example		Example
ExampleRawData		ExampleRawData
.DS_Store		.DS_Store
2021.May.PhycoPhylo_env.yml		2021.May.PhycoPhylo_env.yml
Quick start guideline for example raw data download.md		Quick start guideline for example raw data download.md
Quick start guideline from translated de novo assembled transcripts with example dataset.md		Quick start guideline from translated de novo assembled transcripts with example dataset.md
README.md		README.md
fix_names_from_CDhit.py		fix_names_from_CDhit.py
installation.md		installation.md
phyutility_Wrapper.py		phyutility_Wrapper.py
prank_Wrapper.py		prank_Wrapper.py
singlecopy_from_OrthoFinder.py		singlecopy_from_OrthoFinder.py
supermatrix_concatenate.py		supermatrix_concatenate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phylotranscritpomic analysis pipeline for phycologist

Requirement

Installation on Linux with conda

Quick start.

Tutorial

Part 1. RNA-seq raw data download from NCBI SRA database

part 2. de novo transcritpome assembly and translation

part 3. Orthology inference and single copy orthologous extraction

part 4. Phylotranscritpomic tree reconstruction

About

Releases

Packages

Languages

CSB-SeongminCheon/Phyco-phylo

Folders and files

Latest commit

History

Repository files navigation

phylotranscritpomic analysis pipeline for phycologist

Requirement

Installation on Linux with conda

Quick start.

Tutorial

Part 1. RNA-seq raw data download from NCBI SRA database

part 2. de novo transcritpome assembly and translation

part 3. Orthology inference and single copy orthologous extraction

part 4. Phylotranscritpomic tree reconstruction

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages