"Alleleome" is a specialized package designed to explore and analyze natural sequence variations within the Open Reading Frames (ORFs) of alleles of core genes in a species' pan-genome, both at the amino acid and nucleotide levels. This first version focuses on analyzing only the core genes of a pan-genome. It identifies variants such as substitutions, insertions, and deletions through a series of steps:
- Initial QCQA of sequences.
- Building consensus for each gene's allele set.
- Pairwise alignment of consensus sequences with individual alleles.
- Identification and generation of amino acid variant datasets.
- Analysis of synonymous and non-synonymous substitutions from codons and corresponding amino acid data.
The Alleleome workflow is specifically tailored to study the natural sequence variations in core genes of the pan-genome of a species, with an emphasis on variations at the amino acid and nucleotide level.
For more detailed information, refer to our publication: Early Release on BioRxiv
- Introduction
- Getting Started
- Usage
- Features
- Built With
- Versioning
- Authors and Acknowledgment
- License
- Alleleome is tested and confirmed for Linux systems with the Conda package manager.
- Requires Python version 3.10 or higher.
- For optimal performance, especially when processing a large dataset, such as 1400 core genes and their respective alleles across 3400 strains, a high RAM capacity is strongly recommended.
- Git Large File Storage (Git LFS) must be installed for handling large files in the repository. See the Git LFS Installation section below for instructions.
This repository uses Git Large File Storage (Git LFS) to manage large files. To properly clone and use this repository, please ensure you have Git LFS installed.
-
Install Git LFS:
- You can install Git LFS from git-lfs.github.com or use a package manager. For example, on Ubuntu, you can use:
sudo apt-get install git-lfs
- You can install Git LFS from git-lfs.github.com or use a package manager. For example, on Ubuntu, you can use:
-
Initialize Git LFS:
- After installing, set up Git LFS in your repository:
git lfs install
- After installing, set up Git LFS in your repository:
-
Clone the Repository with LFS:
- To clone the repository and download the LFS files, use:
git clone https://github.com/anpanche/Core-Alleleome.git git lfs pull
- To clone the repository and download the LFS files, use:
-
Update Existing Clone:
- If you have already cloned the repository without LFS, run:
git lfs pull
- This command will download the actual content of the large files.
- If you have already cloned the repository without LFS, run:
Before installing Alleleome, it's recommended to create a virtual environment. This helps to manage dependencies and isolate the project.
-
Create the Virtual Environment:
- Run the following command to create a virtual environment named
env
(you can choose any name):python3 -m venv env
- Run the following command to create a virtual environment named
-
Activate the Virtual Environment:
- On Linux or macOS, activate the virtual environment by running:
source env/bin/activate
- On Linux or macOS, activate the virtual environment by running:
-
Deactivate the Virtual Environment:
- You can deactivate the virtual environment after the job completion by running:
deactivate
- You can deactivate the virtual environment after the job completion by running:
-
Clone the repository (ensure Git LFS is set up as described above).
-
Navigate to the Alleleome directory:
cd Core-Alleleome
-
Activate the virtual environment as instructed above.
-
Install the package:
pip install .
Run Alleleome with sample data using:
Alleleome
Refer to the sample_data
directory for output organization and details.
To analyze your species data:
Alleleome --path1 path/to/pangenome_alignments --path2 path/to/alleleome
You can find the full usage and parameters of Alleleome
by using the --help
function:
$ Alleleome --help
usage: Alleleome [-h] [--path1 PATH1] [--path2 PATH2] [--table TABLE] [--log_to_terminal] [--version]
Alleleome - Explore and analyze natural sequence variations within the Open Reading Frames (ORFs) of alleles of core genes in a species pan-genome.
options:
-h, --help show this help message and exit
--path1 PATH1 Path to pangenome_alignments directory
--path2 PATH2 Path to alleleome output directory containing pangene_summary_v2.csv file generated by Roary
--table TABLE Path to a custom CSV pangene summary table. If not provided, pangene_summary_v2.csv file (generated by Roary) in the given Alleleome output directory a will be used.
--log_to_terminal Log message will be printed to the terminal instead of a file.
--version show program's version number and exit
Alleleome introduces the concept of "ORF alleleome," encapsulating the gene alleles found across all strains of a species, thus providing a comprehensive view of genome-scale sequence variations. This analysis can be instrumental in understanding sequence diversity characteristics and natural selection processes across different species within a family. The study of the alleleome offers insights into the genetic basis of natural selection in a species.
Key features include:
- Analysis of sequence variants using the consensus sequence of ORFs.
- Identification of dominant amino acids and their variants at specific positions.
- Revealing natural sequence and structural variations compared with the consensus sequence and structural attributes.
- Identification of genome-scale synonymous and non-synonymous mutations through the analysis of codon changes and their corresponding amino acid changes."
- Python and Biopython.
- Integrated with "BGCflow" workflow using SnakeMake.
- [Versioning details here]
- [List of authors and contributors]
- Special thanks to [acknowledgments].
This project is licensed under the MIT License - see the LICENSE file for details.