Lumbridge is a specialized project designed for Plant DNA Annotation and Preparation, tailored specifically for Neural Network applications. The primary objective is to streamline and unify the diverse annotations generated by various bioinformatics tools into a cohesive and structured format. The project focuses on converting these annotations into a comprehensive vector file, meticulously prepared and optimized for seamless integration with Neural Network models. This approach ensures that Lumbridge serves as a pivotal tool in bridging the complex world of genomics with advanced computational learning techniques.
- Project Goal
- Requirements and Dependencies
- Execution Methods
- Installation and Setup
- Usage Instructions
- Project Workflow
- Structure of Model File
- License Information
- Contact Information
- Acknowledgments
The primary objective of this project is to meticulously annotate and prepare plant DNA sequences for subsequent computational modeling. This involves a comprehensive analysis and labeling of various genetic elements within the DNA sequences of plants, followed by a systematic preparation of these annotated sequences to be utilized effectively in predictive modeling and simulation studies.
Key aspects of this process include:
- Utilizing gff3 files and cds files(in this case contains ORF)
as primary inputs for foundational genomic information. - Employing advanced bioinformatics tools to identify and annotate additional key genomic features that are not explicitly detailed in the gff3 and cds files.
- This includes the identification of promoter motifs, polyadenylation sequences, and other regulatory regions that play a crucial role in gene expression and regulation.
- Integrating comprehensive data from various sources and formats to build a complete and accurate representation of the plant DNA for modeling purposes.
- Ensuring the integrity and accuracy of the annotated data, setting a strong foundation for predictive modeling and simulation studies.
This meticulous preparation is crucial for the success of modeling efforts, as it lays the foundational groundwork for accurate and reliable biological simulations and analyses.
Before beginning the setup, ensure your system meets the following requirements:
- Operating System: Linux
- Python Version: Python 3.10 or higher
- Additional Tools: Homer2, Bedtools
The project can be executed using two primary methods:
-
Docker: Utilizes a Docker container to encapsulate the environment and dependencies, ensuring a consistent and isolated execution across different systems.
-
Linux System: Direct execution on a Linux-based system, where dependencies and environment configurations are managed locally.
-
Build image:
docker build -t lumbridge-app .
-
Run container:
docker run -p 4000:80 lumbridge-app
A Python virtual environment is recommended for managing the project's dependencies. Follow these steps to create and activate your virtual environment:
-
Create a Virtual Environment:
python3.10 -m venv venv
-
Activate the Virtual Environment:
source venv/bin/activate
-
Install Required Python Packages:
pip3 install -Ur requirements.txt
Homer2 is a key tool required for this project. Follow the detailed installation guide to set up Homer2 on your system:
-
Download Installation Script:
wget http://homer.ucsd.edu/homer/configureHomer.pl
-
Run Installation Script:
perl configureHomer.pl -install
-
Add to PATH in ~/.bash_profile:
# Add the following line to your ~/.bash_profile PATH=$PATH:/home/matej/homer2/bin # Verify installation which findMotifs.pl # Should output path in this case `/home/matej/homer2/bin/findMotifs.pl`
-
Update Configuration File:
# Update HOMER2_BIN_PATH in the configuration file HOMER2_BIN_PATH="/home/matej/homer2/bin"
Refer to the Homer2 Installation Guide for more details.
- Run Installation Script:
sudo apt-get install bedtools
To customize the application's behavior, you have two options for configuration:
-
Create a Custom Configuration File: You can create your own
config.py
file in the/etc/lumbridge/config.py
directory. This custom configuration file allows you to specify settings tailored to your needs. -
Modify the Local Configuration: Alternatively, you can directly modify the local
config.py
file in the project's root directory. This is a straightforward way to change settings if you don't require a separate configuration file.
By default, the application uses test data from Arabidopsis Thaliana. This dataset serves as a standard reference for initial runs and testing purposes. You can replace it with your specific data in the configuration settings.
The tests are testing output of the Pipeline. Just provide your input and output paths
in tests/test_config.py
and run run_tests.py
.
The workflow is predicated on the following assumptions and requirements for the input files:
-
Fasta Files:
- The fasta files are assumed to be from the forward (positive) strand.
- They contain the nucleotide sequences that will be used for further analysis and annotation.
-
GFF3 and ORF Files:
- These files may contain annotations from both the forward and reverse strands of DNA.
- However, for the purpose of this project, only annotations from the forward strand are considered.
- The GFF3 (General Feature Format version 3) files provide detailed annotations about genomic features.
- ORF (Open Reading Frame) files detail regions of the genome that potentially code for proteins.
-
File Naming Conventions:
- It's crucial that the files which are related (fasta, gff3, and orf) have the same base name in their filenames to be recognized as belonging together.
- For example, files for one chromosome of Arabidopsis thaliana should be named consistently
like
arabidopsis_chr1.fasta
,arabidopsis_chr1.gff3
, andarabidopsis_chr1.cds
.
This structured approach ensures that the data is correctly aligned and processed in subsequent stages of the project, allowing for accurate analysis and modeling based on the genomic information provided in these files.
fasta_folder gff3_folder orf_folder
| | |
v v v
---------------------------------
Lumbridge
---------------------------------
| |
v v
model_data others
The data file is structured into two distinct sections: the header and the data content.
The header section encapsulates critical metadata about the file, including the date of creation and the schema of the base pair (bp) vector. Each element in the bp vector represents a specific genomic feature, encoded as follows:
#HEADER#
#DATE=2024-01-14T16:35:29.982692
#pre_processing_version=[0, 1, 0]
#bp_vector_schema=['A', 'C', 'G', 'T', 'PROMOTOR_MOTIF', 'ORF', 'POLY_ADENYL', 'miRNA', 'rRNA', 'gene']
#description of nucleotide:A=[1, 0, 0, 0], C=[0, 1, 0, 0], G=[0, 0, 1, 0], T=[0, 0, 0, 1]
#description of feature:0=no_present, 1=start, 2=continuation/ongoing, 3=end
#max_feature_overlap=0
####END####
In the data section, each base pair (bp) from the FASTA file is represented by a vector.
The first four positions of the vector represent the nucleotides (A, C, G, T).
The parameter max_feature_overlap
is responsible for the maximum overlap for the given feature.
With a higher number of max_feature_overlap
, the size of the vector also increases.
For example, with max_feature_overlap=0
:
- [1, 0, 0, 0, 0, 0, 0, 0, 0, 3] This vector indicates that there is Adenine and the end of a gene.
For example, with max_feature_overlap=1
:
- [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3]
This vector indicates that it is Cytosine and at the end (the last 2 elements:
2, 3
), one gene is ongoing and the other ends.
MIT
In this project were used additional programs: Homer2 and Bedtools2. Thanks all people for contributions to these projects.
Please cite the following article if you use BEDTools in your research:
Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.
Dale RK, Pedersen BS, and Quinlan AR. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics (2011). doi:10.1093/bioinformatics/btr539