Skip to content

Day1 Task2

genomewalker edited this page May 26, 2024 · 5 revisions

A BASH script to map reads using Bowtie2

Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day1/task2
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/data/mapping
Environment: day1

This is a Bash script that takes command-line arguments, runs Bowtie2 to map reads, sorts the resulting BAM file with Samtools, and cleans up intermediate files.

Click to get the code used to generate the image! Paste this code into the Mermaid Editor
    graph TD
    A[Input files] --> B{Command line arguments}
    B -->|index| C(index)
    B -->|reads| D(reads)
    B -->|threads| E(threads)
    C --> F[Reference index]
    D --> G[Input reads]
    E -->|Default: 2| H[Number of threads]
    F -->|Required| I{Map reads with Bowtie2}
    G --> I
    I --> J{Sort with Samtools}
    J --> K{Clean up intermediate files}
    K --> L[Output BAM file]

Let's go through the script step by step to understand what it does and how it works.

  1. Input validation and argument parsing. This step checks the user input to make sure all required arguments are present, and that the specified files and options are valid. It also sets default values for any optional arguments that were not specified.

    #!/bin/bash
    set -euo pipefail # Enable strict mode
    trap 'echo "Error: ${BASH_SOURCE}:${LINENO}: ${BASH_COMMAND}" >&2' ERR
    trap 'echo "Interrupted" >&2 ; exit 1' INT
    
    # Define usage message
    usage() {
        echo "Usage: $(basename "$0") -i <index> -r <reads> [-t <threads>]"
    }
    
    # Parse command-line arguments
    while getopts ":i:r:t:" opt; do
        case ${opt} in
        i) index="${OPTARG}" ;;
        r) reads="${OPTARG}" ;;
        t) threads="${OPTARG}" ;;
        \?)
            echo "Invalid option: -${OPTARG}" >&2
            usage
            exit 1
            ;;
        :)
            echo "Option -${OPTARG} requires an argument" >&2
            usage
            exit 1
            ;;
        esac
    done
    
    # Check that required options were provided
    if [[ -z "${index:-}" || -z "${reads:-}" ]]; then
        usage
        exit 1
    fi
    
    # Set default values
    threads="${threads:-2}"
    
  2. Read mapping with Bowtie2 and Converting SAM to BAM and filtering mapped reads. This step uses Bowtie2, a short-read aligner, to map reads from the input FASTQ file(s) to a reference genome. Then, pipes the results from Bowtie2 into SAMtools to convert the SAM file output from the previous step to a binary BAM format. It then filters out any reads that were not successfully mapped to the reference genome, resulting in a BAM file containing only the mapped reads.

    # Construct output file paths
    file_name=$(basename ${reads})
    bam_mapped_file="${file_name%%[ .]*}".mapped.bam
    bam_file="${file_name%%[ .]*}".sorted.bam
    
    # Run Bowtie2 to map the reads
    bowtie2 --threads "${threads}" -x "${index}" -U "${reads}" \
        | samtools view -@ "${threads}" -bS -F 4 > "${bam_mapped_file}"
  3. Sorting the BAM file. This step uses SAMtools to sort the mapped BAM file by genomic coordinates, which is necessary for downstream analysis. Are you comfortable with the concept of sorting a BAM file? Do you have any questions about the sorting process?

    # Sort the resulting BAM file with Samtools
    samtools sort -@ "${threads}" -o "${bam_file}" "${bam_mapped_file}"
  4. Cleaning up intermediate files. This step removes any intermediate files generated during the previous steps, leaving only the final sorted BAM file.

    # Clean up intermediate files
    rm  -rf "${bam_mapped_file}"

We can call this script as follows:

cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day1/task2
./mapping-and-sorting.sh -r ../../data/mapping/sample1.fq.gz -i ../../data/mapping/SRS121011 -t 2
Clone this wiki locally