Skip to content

Managing a GLUE Project

Robert J. Gifford edited this page Nov 25, 2024 · 6 revisions

Developing and managing a GLUE project involves a sequence of well-defined steps to import, curate, analyze, and organize virus sequence data. Each step is crucial for tailoring the project to your specific research or operational needs.


Typical Activities in GLUE Project Development

This section outlines the core activities involved in creating and managing a GLUE project.

1. Import Sequences

Sequence data forms the foundation of a GLUE project. You can import sequences from public databases (e.g., GenBank) or other sources such as private collections or FASTA files.

  • Supported Formats: GenBank XML, FASTA.
  • Command Example:
GLUE> import source path/to/sequence/data/

2. Define Additional Sequence Data Fields

To enrich the project, define custom metadata fields and populate them with corrected or normalized values. These fields might include collection dates, geographic origins, or host species.

  • Schema Extensions: Extend the GLUE database schema to add new fields.
  • Normalization: Use text file populators or scripts to standardize field values (e.g., country names based on the UN M.49 standard).
  • Command Example:
GLUE> add field sequence collection_date "DATE"
  • Tips:
    • Use the genbankXmlPopulator module to extract metadata from GenBank files automatically.
    • Consider creating separate sources for different datasets (e.g., reference sequences, newly acquired sequences).

3. Filter Sequences

Filtering sequences is essential for maintaining data quality and relevance. This step might involve removing incomplete, low-quality, or redundant sequences.

  • Filtering Criteria: Based on sequence length, host species, collection date, or other metadata.
  • Command Example:
GLUE> delete sequence where "collection_date < '2000-01-01'"

4. Select Reference Sequences and Define Genomic Features

Reference sequences are key to defining genome annotations and alignments. After selecting references, annotate them with genomic features like coding regions, regulatory elements, or structural domains.

  • Feature Definition: Use GLUE's inherit feature-location command to transfer feature annotations to non-reference sequences via alignments.
  • Command Example:
GLUE> set reference sequence L08816
GLUE> inherit feature-location "ORF1"

5. Apply Sequence Alignment Techniques

Aligning sequences is critical for capturing homologies and understanding evolutionary relationships.

  • Alignment Methods: GLUE integrates with tools like MAFFT, BLAST, and RAxML for generating nucleotide or protein alignments.
  • Reference-Constrained Alignments: Define alignments constrained by reference sequences to maintain coding feature homology.
  • Command Example:
GLUE> align sequences mafft constrained alignment1

6. Arrange Data into Evolutionary Clades

Group sequences into evolutionary clades or genotypes based on phylogenetic relationships or other criteria.

  • Phylogenetic Analysis: Use RAxML or other tree-building modules to generate a reference phylogeny.
  • Clade Assignment: Define clades and assign sequences using maximum-likelihood clade assignment (MLCA).
  • Command Example:
GLUE> run file mlca_config.glue

7. Define Project-Specific Analysis Modules

Extend GLUE's functionality by defining custom modules tailored to your analysis needs.

  • Available Modules: Includes phylogenetic tools, sequence variation analysis, database population, and external program integration.
  • Examples:
    • genbankXmlPopulator for metadata extraction.
    • raxmlPhylogenyGenerator for phylogenetic tree construction.
    • Custom JavaScript programs for sequence processing.
  • Command Example:
GLUE> add module textFilePopulator my_populator

Next Steps


Clone this wiki locally