Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.
Per Unneberg edited this page Nov 12, 2014 · 9 revisions

snakemakelib

snakemake library for various applications, with a focus on bioinformatics and next-generation sequencing.

Comparison with biomake

snakemakelib is basically a port of the rules in biomake to snakemake. The design principles are similar in that my aim is to compile a library of rules that can be reused and configured via a simple configuration interface.

Disclaimer

Use the rules at your own risk, and make sure you understand them before running any commands. I take no responsibility if you'd happen to run a snakemake clean in an inappropriate location, removing precious data in the process. You have been warned!

Introduction

The snakemake rules contain general recipies for commonly used applications and bioinformatics programs. The use cases reflect the needs I've had and do by no means have a comprehensive coverage. Nevertheless, many commands are so commonly used that the recipes may be of general interest.

Installation

Clone the repository https://github.com/percyfal/snakemakelib to an appropriate location:

git clone https://github.com/percyfal/snakemakelib /path/to/snakemakelib

Requirements

snakemake version >= 3.1 that supports the global config variable.

Usage

The intended usage is that the user first creates a Snakefile for use with a particular dataset/problem. Thereafter, include statements are used to include rules of interest. Here, we include the rules for the aligner bwa:

#-*- snakemake -*-
# Snakefile example

# Add path to snakemakelib, unless installed in a virtualenv or similar
sys.path.append('/path/to/snakemakelib')
# Include settings and utilities
include: "/path/to/snakemake/rules/settings.rules"
include: "/path/to/snakemake/rules/utils.rules"
# Include rules for bwa
include: "/path/to/snakemake/rules/bio/ngs/align/bwa.rules"

snakemake includes options to view tasks:

snakemake -l

The bwa_mem rule comes from bwa.rules. In addition, the rules file utils.rules included above defines convenience rules for viewing more detailed rule information. For instance, the rule rule_ll shows the following:

snakemake rule_ll

This rule prints the docstring and the definitions of the input, output, and shell parameters. In the example above, we see that the output looks like {prefix}.bam, where {prefix} is a wildcard that matches a given pattern in the input. To see what they look like in this example, run

snakemake test.bam

As its name implies, Snakemake works like GNU Make in that one seeks to build a target output, in this case test.bam. Had the files test_R1_001.fastq and test_R2_001.fastq been present, the rule bwa_mem had run the command defined in the shell section.

Configuration

The implementation of the configuration interface is still very much work in progress and is likely to undergo substantial changes!

The purpose of snakemakelib is to build a library of rules that can be reused without actually writing them anew. The motivation is that only parameters, e.g. program options, inputs and outputs, of a rule change from time to time, but the rule execution is identical. Therefore, my aim is to provide a very simplistic configuration interface in which the rule parameters can be modified with simple strings.

The default configuration

To begin with, each rule file consists of rules and an accompanying default configuration. The latter ensures that all rules have sensible defaults set, regardless whether the user decides to modify them or not. In principle, a rule file has two parts:

  1. default configuration
  2. rules

The configuration is a modified dict object, with at most three levels:

namespace
    section/parameter
        parameter

The namespace is an identifier for the rules file, and should be named path.to.rules, where path and to are directory names relative to the rules root path. The section/option is either a parameter related to the program, or a subprogram which in turn can have parameters assigned to it. The configuration default in bwa.rules is

config_default = BaseConfig({ 
    'bio.ngs.align.bwa' : BaseConfig({
        'cmd' : "bwa",
        'ref' : sml_config['bio.ngs.settings']['db']['ref'],
        'threads' : sml_config['bio.ngs.settings']['threads'],
        'options' : "-M",
        'mem' : BaseConfig({
            'options' : "",
        }),
    }),
})

The modified dict is a snakemakelib.config.BaseConfig object that does simple type checking and basically ensures that the user only can modify keys that have already been defined. The namespace is bio.ngs.align.bwa, reflecting the fact that the rules file is located in the folder rules/bio/ngs/align and is named bwa.rules. sml_config is a global BaseConfig object that stores all loaded rule configurations. Incidentally, this example shows another key idea of the configuration, namely that some options inherit from rules files higher up in the file hierarchy. The rules file rules/bio/ngs/settings.rules contains a generic configuration that is common to all ngs rules. This implementation makes it possible to override settings for specific programs, like for instance the ref parameter above.

Viewing the default configuration

utils.rules defines a rule conf that can be used to view the current configuration of included files:

snakemake conf

The output is section according to namespace, i.e. the rules file. Furthermore, Snakemake defines its own global configuration variable config that can be accessed via the command line. At the end of file rules/bio/ngs/settings.rules, three Snakemake config options have been added that are useful in the context of ngs:

# Add configuration variable to snakemake global config object
config['lanes'] = []
config['samples'] = []
config['flowcells'] = []

User-defined configuration

A user can modify the configuration by defining a BaseConfig object and updating the sml_config object mentioned in the previous section. This is done in the Snakefile that uses include statements to include rules files, and must be done before any include statement. The reason is that when a rules file is included, the default configuration values are compared to the existing sml_config. If the user has defined custom configurations, these will take precedence over the default values. If no custom configuration exists, the default values are applied.

As an example, imagine we want to change the options to -k 10 -w 50 for bwa mem in the example Snakefile above. The modified Snakefile would then look as follows:

#-*- snakemake -*-
# Snakefile example
# Add path to snakemakelib, unless installed in a virtualenv or similar
sys.path.append('/path/to/snakemakelib')

# Import config-related stuff
from snakemakelib.config import init_sml_config, update_sml_config, BaseConfig
my_config = BaseConfig({
    'bio.ngs.align.bwa' : BaseConfig({
        'mem' : BaseConfig({
            'options' : "-k 10 -w 50",
        }),
    })
})

# Initialize configuration
init_sml_config(my_config)

# Include settings and utilities
include: "/path/to/snakemake/rules/settings.rules"
include: "/path/to/snakemake/rules/utils.rules"
# Include rules for bwa
include: "/path/to/snakemake/rules/bio/ngs/align/bwa.rules"

Currently, it is necessary to use exactly the same structure as that of the default configuration for the relevant sections.

Loading configuration from a yaml file

TO BE IMPLEMENTED

Alternatively, the user configuration can be loaded from a yaml file. The user configuration in the previous section would simply look like

bio.ngs.align.bwa:
  mem:
    options: "-k 10 -w 50"
Clone this wiki locally