-
Notifications
You must be signed in to change notification settings - Fork 2
Managing a GLUE Project
Developing and managing a GLUE project involves a sequence of well-defined steps to import, curate, analyze, and organize virus sequence data. Each step is crucial for tailoring the project to your specific research or operational needs.
This section outlines the core activities involved in creating and managing a GLUE project.
Sequence data forms the foundation of a GLUE project. You can import sequences from public databases (e.g., GenBank) or other sources such as private collections or FASTA files.
- Supported Formats: GenBank XML, FASTA.
- Command Example:
GLUE> import source path/to/sequence/data/
To enrich the project, define custom metadata fields and populate them with corrected or normalized values. These fields might include collection dates, geographic origins, or host species.
- Schema Extensions: Extend the GLUE database schema to add new fields.
- Normalization: Use text file populators or scripts to standardize field values (e.g., country names based on the UN M.49 standard).
- Command Example:
GLUE> add field sequence collection_date "DATE"
-
Tips:
- Use the
genbankXmlPopulator
module to extract metadata from GenBank files automatically. - Consider creating separate sources for different datasets (e.g., reference sequences, newly acquired sequences).
- Use the
Filtering sequences is essential for maintaining data quality and relevance. This step might involve removing incomplete, low-quality, or redundant sequences.
- Filtering Criteria: Based on sequence length, host species, collection date, or other metadata.
- Command Example:
GLUE> delete sequence where "collection_date < '2000-01-01'"
Reference sequences are key to defining genome annotations and alignments. After selecting references, annotate them with genomic features like coding regions, regulatory elements, or structural domains.
-
Feature Definition: Use GLUE's
inherit feature-location
command to transfer feature annotations to non-reference sequences via alignments. - Command Example:
GLUE> set reference sequence L08816
GLUE> inherit feature-location "ORF1"
Aligning sequences is critical for capturing homologies and understanding evolutionary relationships.
- Alignment Methods: GLUE integrates with tools like MAFFT, BLAST, and RAxML for generating nucleotide or protein alignments.
- Reference-Constrained Alignments: Define alignments constrained by reference sequences to maintain coding feature homology.
- Command Example:
GLUE> align sequences mafft constrained alignment1
Group sequences into evolutionary clades or genotypes based on phylogenetic relationships or other criteria.
- Phylogenetic Analysis: Use RAxML or other tree-building modules to generate a reference phylogeny.
- Clade Assignment: Define clades and assign sequences using maximum-likelihood clade assignment (MLCA).
- Command Example:
GLUE> run file mlca_config.glue
Extend GLUE's functionality by defining custom modules tailored to your analysis needs.
- Available Modules: Includes phylogenetic tools, sequence variation analysis, database population, and external program integration.
-
Examples:
-
genbankXmlPopulator
for metadata extraction. -
raxmlPhylogenyGenerator
for phylogenetic tree construction. - Custom JavaScript programs for sequence processing.
-
- Command Example:
GLUE> add module textFilePopulator my_populator
- Learn how to query your GLUE project to retrieve insights.
- Explore advanced analysis workflows using custom scripts and modules.
- Share your project by exporting and publishing curated datasets.
GLUE by Robert J. Gifford Lab.
For questions, issues, or feedback, please open an issue on the GitHub repository.
- Project Data Model
- Schema Extensions
- Modules
- Alignments
- Variations
- Scripting Layer
- Freemarker Templates
- Example GLUE Project
- Command Line Interpreter
- Build Your Own Project
- Querying the GLUE Database
- Working With Deep Sequencing Data
- Invoking GLUE as a Unix Command
- Known Issues and Fixes
- Overview
- Hepatitis Viruses
- Arboviruses
- Respiratory Viruses
- Animal Viruses
- Spillover Viruses
- Virus Diversity
- Retroviruses
- Paleovirology
- Transposons
- Host Genes