Skip to content

Yggrasil is middleware platform designed to aggregate sample-related metadata, manage the execution of various pipelines for processing sequencing data, and ultimately handle the delivery of analyzed data.

License

Notifications You must be signed in to change notification settings

NationalGenomicsInfrastructure/Yggdrasil

Repository files navigation

Yggdrasil 🌳

Yggdrasil is a data processing framework designed to manage and automate workflows for various genomic sequencing projects (currently including TenX and SmartSeq3 modules). It provides a unified interface to handle data ingestion, processing, result generation, and ultimately project packing and delivery, streamlining the analysis pipeline for sequencing data.

Table of Contents

Prerequisites

  • Python 3.11 or higher
  • Conda for environment management
  • Git for version control
  • VSCode (recommended) for development

Installation

To get started with the Yggdrasil Project, you need to set up the necessary dependencies. Follow the instructions below:

  1. Clone the Repository:
git clone https://github.com/NationalGenomicsInfrastructure/Yggdrasil.git
cd Yggdrasil
  1. Create and Activate a Conda Environment:

It is recommended to use a conda environment to manage dependencies. You can set up the environment using conda:

conda create --name yggdrasil-env python=3.11
conda activate yggdrasil-env
  1. Install Required Packages:
pip install -r requirements.txt

Project Structure

Brief overview of the main components and directories:

Yggdrasil/
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ base/
β”‚   β”œβ”€β”€ core_utils/
β”‚   β”œβ”€β”€ couchdb/
β”‚   β”œβ”€β”€ module_utils/
β”‚   β”œβ”€β”€ realms/
β”‚   β”‚   β”œβ”€β”€ tenx/
β”‚   β”‚   └── smartseq3/
β”œβ”€β”€ tests/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”œβ”€β”€ ygg_trunk.py
β”œβ”€β”€ ygg-mule.py
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ LICENSE
└── README.md
  • lib/: Core library containing base classes and utilities.
    • base/: Abstract base classes and interfaces.
    • core_utils/: Utility modules for the Yggdrasil core functionalities.
    • couchdb/: Classes specific for Yggdrasil - CouchDB interactions.
    • module_utils/: Utility modules for various Yggdrasil module functionalities.
    • realms/: Modules specific to different sequencing technologies (e.g. TenX, SmartSeq3, etc.)
  • tests/: Test cases for the application.
  • .github/workflows/: GitHub Actions workflows for CI/CD.

Usage

Ygg-Mule

To run Yggdrasil manually, use the manual core script ygg-mule.py. It is used for processing documents manually by providing a CouchDB document ID.

Usage:

python ygg-mule.py <doc_id>

Replace <doc_id> with the actual CouchDB document ID you wish to process.

Configuration

Yggdrasil uses a configuration loader to manage settings. Configuration files should be placed in the yggdrasil_workspace/common/configurations directory. This directory path can be adjusted in the lib/core_utils/common.py script if needed.

Configuration Files

config.json: This file contains global settings for Yggdrasil.

Fields:

- yggdrasil_log_dir: Directory where logs will be stored.
- couchdb_url: URL of the CouchDB server. Example: "http://localhost:5984"
- couchdb_database: Name of the CouchDB project database.
- couchdb_status_tracking: Name of the CouchDB yggdrasil database for project status tracking.
- couchdb_poll_interval: Interval (in seconds) for polling CouchDB for changes.
- job_monitor_poll_interval: Interval (in seconds) for polling the job monitor.
- activate_ngi_cmd: Command to activate NGI environment

Example Configuration File (config.json)

{
    "yggdrasil_log_dir": "yggdrasil_workspace/logs",
    "couchdb_url": "http://localhost:5984",
    "couchdb_database": "my_project_db",
    "couchdb_status_tracking": "my_ygg_status_db",
    "couchdb_poll_interval": 3,
    "job_monitor_poll_interval": 60,
    "activate_ngi_cmd": "source sourceme_sthlm.sh && source activate NGI"
}

module_registry.json: This file maps different library construction methods to their respective processing modules. The modules specified here will be dynamically loaded and executed based on the entire name of a library_prep_method specified in the CouchDB document, or a designated prefix of them.

This file maps different library construction methods to their respective processing modules. The modules specified here will be dynamically loaded and executed based on the library construction method specified in the CouchDB document.

Example:

{
    "SmartSeq 3": {
        "module": "lib.realms.smartseq3.smartseq3.SmartSeq3"
    },
    "10X": {
        "module": "lib.realms.tenx.tenx_project.TenXProject",
        "prefix": true
    }
}
  • SmartSeq 3:
    • module: The path to the module handling SmartSeq 3 library data.
  • 10X:
    • module: The path to the module handling 10X-prefixed library data.

Environment Variables

Ensure the following environment variables are set:

- COUCH_USER: Your CouchDB username.
- COUCH_PASS: Your CouchDB password.

Logging

Yggdrasil uses a custom logging utility to manage logs. Logs are stored in the directory specified by the yggdrasil_log_dir configuration.

Enabling Debug Logging: To enable debug logging, modify the configure_logging call in your script:

from lib.utils.logging_utils import configure_logging
configure_logging(debug=True)

Development Guidelines

Setting Up the Development Environment

Ensure you have activated the Conda environment and installed all required packages as per the Installation section.

Pre-Commit Hooks

Use pre-commit to automate code formatting and linting on each commit.

  • Install pre-commit hooks:
pre-commit install
  • Run pre-commit hooks manually:
pre-commit run --all-files

Code Formatting, Linting and Type Checking

Use black for code formatting, ruff for linting and mypy for static type checking. It is recommended to have these tools set as extensions on your editor (e.g. VSCode) too, for a more seamless, automated experience. But if you preffer running them manually in cmd:

  • Format code with Black:
black .
  • Lint code with Ruff:
ruff check .
  • Run type checks:
mypy .

VSCode Integration

For an optimal development experience, we recommend using VSCode with the following extensions:

  • Python (by Microsoft)
  • Ruff (by Astral Software)
  • Black Formatter (by Microsoft)
  • Mypy Type Checker (by Microsoft)

VSCode Settings

Make sure your (user)settings.json contains the following settings to integrate the tools:

{
    "editor.defaultFormatter": "ms-python.black-formatter",
    "editor.formatOnSave": true,
    "ruff.lint.args": [ "--config=pyproject.toml" ],
    "mypy-type-checker.args": [ "--config-file=pyproject.toml" ]
}

Git Blame Configuration

To ensure git blame ignores bulk formatting commits.

  • Configure Git:
git config blame.ignoreRevsFile .git-blame-ignore-revs
  • Add Formatting Commits to .git-blame-ignore-revs:

Add the commit (full) hashes of your formatting commits to the .git-blame-ignore-revs file, one per line, e.g.:

a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0
b1c2d3e4f5g6h7i8j9k0l1m2n3o4p5q6r7s8t9u0

Continuous Integration

GitHub Actions are set up to automatically run ruff, black, and mypy on pushes and pull requests.

  • Workflow File: .github/workflows/lint.yml
  • Separate Jobs: Each tool runs in its own job for clear feedback.

Contributing

Contributions are very welcome! To have as smooth of an experience as possible, the following guidelines are recommended:

  • Forking: Fork the main repository to your personal GitHub account. Develop your changes and submit pull requests to the main repository for review.
  • Code Style: Format with black and lint with ruff.
  • Type Annotations: If you use type annotations make sure to set (and pass) mypy checks.
  • Documentation: Documented contributions are easier to understand and review.

Suggested contributions: Tests, Bug Fixes, Code Optimization, New Modules (reach out to Anastasios if you don't know where to start with developing a new module).

License

Yggdrasil is licensed under the MIT License - see the LICENSE file for details.

About

Yggrasil is middleware platform designed to aggregate sample-related metadata, manage the execution of various pipelines for processing sequencing data, and ultimately handle the delivery of analyzed data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages