Skip to content

Commit

Permalink
Various bug fixes and extensions:
Browse files Browse the repository at this point in the history
* Added --tmp-path option for sfm.
* Fixed /dev/stdin and /dev/stdout for sfm.
* Added a command for merging intermediate metrics files.
* Fixed detection of directory path names in sfm.
  • Loading branch information
caherzee committed May 23, 2019
1 parent f6a70a4 commit 1fc34c4
Show file tree
Hide file tree
Showing 12 changed files with 262 additions and 41 deletions.
64 changes: 62 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,33 @@ This option is used to specify the number of levels for quantizing quality score

This option is used to indicate to use static quantized quality scores to a given number of levels during base quality score recalibration (--bqsr). This list should be of the form "[nr, nr, nr]". The default value is [].

### --mark-optical-duplicates-intermediate file

This option is used in the context of filtering files created using the elprep split command. It is used internally by
the elprep sfm command, but can be used when writing your own split/filter/merge scripts.

This option tells elPrep to perform optical duplicate marking and to write the result to an intermediate metrics file.
The intermediate metrics file generated this way can later be merged with other intermediate metrics files, see the
merge-optical-duplicates-metrics command.

### --bqsr-tables-only table-file

This option is used in the context of filtering files created using the elprep split command. It is used internally by
the elprep sfm command, but can be used when writing your own split/filter/merge scripts.

This option tells elPrep to perform base quality score recalibration and to write the result of the recalibration to an
intermediate table file. This table file will need to be merged with other intermediate recalibration results during the
application of the base quality score recalibration. See the --bqsr-apply option.

### --bqsr-apply path

This option is used when filtering files created by the elprep split command. It is used internally by the elprep sfm
command, and can be used when writing your own split/filter/merge scripts.

This option is used for applying base quality score recalibration on an input file. It expects a path parameter that
refers to a directory that contains intermediate recalibration results for multiple files created using the
--bqsr-tables-only option.

## Sorting Command Options

### --sorting-order [keep | unknown | unsorted | queryname | coordinate]
Expand Down Expand Up @@ -384,6 +411,7 @@ The elprep split command can be used to split up .sam files into smaller files t
Splitting the .sam file into smaller files for processing "per chromosome" is useful for reducing the memory pressure as these split files are typically significantly smaller than the input file as a whole. Splitting also makes it possible to parallelize the processing of a single .sam file by distributing the different split files across different processing nodes.

We provide an sfm command that executes a pipeline while silently using the elprep filter and split/merge tools. It is of course possible to write scripts to combine the filter and split/merge tools yourself.
We provide a recipe for writing your own split/filter/merge scripts on our github wiki.

## Name

Expand All @@ -395,8 +423,6 @@ We provide an sfm command that executes a pipeline while silently using the elpr

elprep sfm input.bam output.bam --mark-duplicates --mark-optical-duplicates output.metrics --sorting-order coordinate --bqsr output.recal --bqsr-reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites

elprep sfm --mark-duplicates --mark-optical-duplicates output.metrics --sorting-order coordinate --bqsr output.recal --bqsr-reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites

## Description

The elprep sfm command is a drop-in replacement for the elprep filter command that minimises the use of RAM. For this, it silently calls the elprep split and merge tools to split up the data "per chromosome" for processing, which requires less RAM than processing a .sam/.bam file as a whole (see Split and Merge tools).
Expand All @@ -409,6 +435,10 @@ The elprep sfm command has the same options as the elprep filter command, with t

This command option sets the format of the split files. By default, elprep uses the same format as the input file for the split files. Changing the intermediate file output type may improve either runtime (.sam) or reduce peak disk usage (.bam).

### --tmp-path

This command option is used to specify a path where elPrep can store temporary files that are created (and deleted) by the split and merge commands that are silently called by the elprep sfm command. The default path is the folder from where you call elprep sfm.

### --single-end

Use this command option to indicate the sfm command is processing single-end data. This information is important for the split/merge tools to operate correcly. For more details, see the description of the elprep split and elprep merge commands.
Expand Down Expand Up @@ -439,6 +469,8 @@ Choosing the value 1 for the --contig-group-size tells elprep split to split the

The elprep split command requires two arguments: 1) the input file or a path to multiple input files and 2) a path to a directory where elPrep can store the split files. The input file(s) can be .sam or .bam. It is also possible to use /dev/stdin as the input for using Unix pipes. There are no structural requirements on the input file(s) for using elprep split. For example, it is not necessary to sort the input file, nor is it necessary to convert to .bam or index the input file.

Warning: If you pass a path to multiple input files to the elprep split command, elprep assumes that they all have the same (or compatible) headers, and just picks the first header it finds as the header for all input files. elprep currently does not make an attempt to resolve potential conflicts between headers, especially with regard to the @SQ, @RG, or @PG header records. We will include proper merging of different SAM/BAM files in a future version of elprep. In the meantime, if you need proper merging of SAM/BAM files, please use samtools merge, Picard MergeSamFiles, or a similar tool. (If such a tool produces SAM file as output, it can be piped into elprep using Unix pipes.)

elPrep creates the output directory denoted by the output path, unless the directory already exists, in which case elPrep may override the existing files in that directory. Please make sure elPrep has the correct permissions for writing that directory.

By default, the elprep split command assumes it is processing pair-end data. The flag --single-end can be used for processing single-end data. The output will look different for paired-end and single-end data.
Expand Down Expand Up @@ -524,6 +556,34 @@ Sets the path for writing a log file.

The --contig-group-size parameter for the elprep merge command is deprecated since version 4.1.1. The elprep merge command now correctly processes the split files without that information.

## Name

### elprep merge-optical-duplicate-metrics - a commandline tool for merging intermediate metrics files created by the --mark-optical-duplicates-intermediate option

## Synopsis

elprep merge-optical-duplicates-metrics input-file output-file metrics-file /path/to/intermediate/metrics --remove-duplicates

## Description

The elprep merge-optical-duplicates-metrics command requires four arguments:
the names of the original input and output .sam/.bam files for which the metrics are calculated,
the metrics file to which the merged metrics should be written, and a path to the intermediate metrics files that need
to be merged (and were generated using --mark-optical-duplicates-intermediate).

## Options

### --nr-of-threads number

This command option sets the number of threads that elPrep uses during execution for parsing/outputting .sam/.bam data. The default number of threads is equal to the number of cpu threads.

It is normally not necessary to set this option. elPrep by default allocates the optimal number of threads.

## --remove-duplicates

Pass this option if the metrics were generated for a file for which the duplicates were removed. This information will
be included in the merged metrics file.

# Extending elPrep

If you wish to extend elPrep, for example by adding your own filters, please consult our [API documentation](https://godoc.org/github.com/ExaScience/elprep).
Expand Down
132 changes: 132 additions & 0 deletions cmd/merge-optical-duplicates-metrics.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
// elPrep: a high-performance tool for preparing SAM/BAM files.
// Copyright (c) 2017-2019 imec vzw.

// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version, and Additional Terms
// (see below).

// This program is distributed in the hope that it will be useful, but
// WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Affero General Public License for more details.

// You should have received a copy of the GNU Affero General Public
// License and Additional Terms along with this program. If not, see
// <https://github.com/ExaScience/elprep/blob/master/LICENSE.txt>.

package cmd

import (
"bytes"
"flag"
"fmt"
"log"
"os"
"path/filepath"
"runtime"

"github.com/exascience/elprep/v4/filters"
)

// MergeOpticalDuplicatesMetricsHelp is the help string for this command.
const MergeOpticalDuplicatesMetricsHelp = "\nmerge-optical-duplicates-metrics parameters:\n" +
"elprep merge-optical-duplicates-metrics sam-input-file sam-output-file metrics-file /path/to/intermediate/metrics\n" +
"[--remove-duplicates]\n" +
"[--nr-of-threads nr]\n" +
"[--timed]\n" +
"[--log-path path]\n"

// Merge implements the elprep merge command.
func MergeOpticalDuplicatesMetrics() error {
var (
profile, logPath string
nrOfThreads int
timed, removeDuplicates bool
)

var flags flag.FlagSet

flags.IntVar(&nrOfThreads, "nr-of-threads", 0, "number of worker threads")
flags.BoolVar(&timed, "timed", false, "measure the runtime")
flags.BoolVar(&removeDuplicates, "remove-duplicates", false, "use when duplicates were removed during duplicate marking")
flags.StringVar(&profile, "profile", "", "write a runtime profile to the specified file(s)")
flags.StringVar(&logPath, "log-path", "", "write log files to the specified directory")

parseFlags(flags, 6, MergeOpticalDuplicatesMetricsHelp)

input := getFilename(os.Args[2], MergeOpticalDuplicatesMetricsHelp)
output := getFilename(os.Args[3], MergeOpticalDuplicatesMetricsHelp)
metrics := getFilename(os.Args[4], MergeOpticalDuplicatesMetricsHelp)
intermediateMetrics := getFilename(os.Args[5], MergeOpticalDuplicatesMetricsHelp)

setLogOutput(logPath)

// sanity checks

var sanityChecksFailed bool

if !checkExist("", input) {
log.Println("Warning: Input file does not exist: ", input)
}

if !checkExist("", intermediateMetrics) {
sanityChecksFailed = true
}

if profile != "" && !checkCreate("--profile", profile) {
sanityChecksFailed = true
}

metricsDir, err := filepath.Abs(intermediateMetrics)
if err != nil {
return err
}

if nrOfThreads < 0 {
sanityChecksFailed = true
log.Println("Error: Invalid nr-of-threads: ", nrOfThreads)
}

if sanityChecksFailed {
fmt.Fprint(os.Stderr, MergeOpticalDuplicatesMetricsHelp)
os.Exit(1)
}

// building output command line

var command bytes.Buffer
fmt.Fprint(&command, os.Args[0], " merge-optical-duplicates-metrics ", input, " ", output, " ", metrics, " ", intermediateMetrics)
if nrOfThreads > 0 {
runtime.GOMAXPROCS(nrOfThreads)
fmt.Fprint(&command, " --nr-of-threads ", nrOfThreads)
}
if timed {
fmt.Fprint(&command, " --timed ")
}
if logPath != "" {
fmt.Fprint(&command, " --log-path ", logPath)
}
if removeDuplicates {
fmt.Fprint(&command, " --remove-duplicates")
}

// executing command

log.Println("Executing command:\n", command.String())

var ctr filters.DuplicatesCtrMap

// merge intermediate metrics files
err = timedRun(timed, profile, "Loading and combining duplicate metrics.", 1, func() error {
ctr = filters.LoadAndCombineDuplicateMetrics(metricsDir)
return ctr.Err()
})
if err != nil {
return err
}
return timedRun(timed, profile, "Printing comdined duplicate metrics.", 2, func() error {
return filters.PrintDuplicatesMetrics(input, output, metrics, removeDuplicates, ctr)
})
}
4 changes: 2 additions & 2 deletions cmd/merge.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
// elPrep: a high-performance tool for preparing SAM/BAM files.
// Copyright (c) 2017, 2018 imec vzw.
// Copyright (c) 2017-2019 imec vzw.

// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
Expand Down Expand Up @@ -84,7 +84,7 @@ func Merge() error {
if err != nil {
return err
}
filesToMerge, err := internal.Directory(fullInputPath)
fullInputPath, filesToMerge, err := internal.Directory(fullInputPath)
if err != nil {
log.Printf("Given directory %v causes error %v.\n", input, err)
sanityChecksFailed = true
Expand Down
Loading

0 comments on commit 1fc34c4

Please sign in to comment.