revise lessons 1 to 10

cambiotraining · Dec 7, 2023 · 31be15d · 31be15d
1 parent 0606e1e
commit 31be15d
Show file tree

Hide file tree

Showing 9 changed files with 216 additions and 130 deletions.
diff --git a/materials/02-know_your_bug.md b/materials/02-know_your_bug.md
@@ -5,7 +5,9 @@ title: "Know your bug"
 ::: {.callout-tip}
 ## Learning Objectives
 
-- Understand that there are different ways to approach your analyses depending on which species you're analysing
+- Recognise that there are different ways to approach your analyses depending on which species you're analysing.
+- List the three main questions that should be asked about the organism you are working with, which will determine downstream analyses steps.
+- Describe, at a high-level, the main analyses steps involved in each case.
 
 :::
 
@@ -24,4 +26,12 @@ This week, we're going to work with three different bacterial species that each
 ::: {.callout-tip}
 ## Key Points
 
+- Three main things should be considered when choosing an analysis workflow for bacterial sequencing data:
+  - How diverse is the species? Monoclonal species usually have lower diversity in the population.
+  - Do your isolates likely come from multiple lineages?
+  - Is bacterial recombination common in your species (transformation, transduction, conjugation)?
+- Depending on the answer to these questions, your analyses workflow may involve: 
+  - Mapping to a reference genome or using a pan-genome approach.
+  - Including a recombination removal step.
+- The end goal of most bacterial genomics projects is the generation of a phylogenetic tree representing the relationships and diversity of your isolates.
 :::
diff --git a/materials/03-nextflow.md b/materials/03-nextflow.md
@@ -5,8 +5,8 @@ title: "Workflow management"
 ::: {.callout-tip}
 ## Learning Objectives
 
-- Understand what a workflow management system is.
-- Understand the benefits of using a workflow management system.
+- Define what a workflow management system is.
+- List the benefits of using a workflow management system.
 - Explain the benefits of using Nextflow as part of your bioinformatics workflow.
 
 :::
@@ -15,7 +15,7 @@ title: "Workflow management"
 
 Analysing data involves a sequence of tasks, including gathering, cleaning, and processing data. These sequence of tasks are called a workflow or a pipeline. These workflows typically require executing multiple software packages, sometimes running on different computing environments, such as a desktop or a compute cluster. Traditionally these workflows have been joined together in scripts using general purpose programming languages such as Bash or Python.
 
-![Example bioinformatics variant calling workflow/pipeline diagram from nf-core (https://nf-co.re/bactmap).](images/Bactmap_pipeline.png)
+![Example bioinformatics variant calling workflow/pipeline diagram from nf-core ([bactmap](https://nf-co.re/bactmap)).](images/Bactmap_pipeline.png)
 
 However, as workflows become larger and more complex, the management of the programming logic and software becomes difficult.
 
@@ -31,8 +31,7 @@ Key features include;
 - **Software management**: Use of technology like containers, such as [Docker](https://www.docker.com) or [Singularity](https://sylabs.io/singularity), that packages up code and all its dependencies so the application runs reliably from one computing environment to another.
 - **Portability & Interoperability**: Workflows written on one system can be run on another computing infrastructure e.g., local computer, compute cluster, or cloud infrastructure.
 - **Reproducibility**: The use of software management systems and a pipeline specification means that the workflow will produce the same results when re-run, including on different computing platforms.
-- **Reentrancy**: Continuous checkpoints allow workflows to resume
-from the last successfully executed steps.
+- **Reentrancy**: Continuous checkpoints allow workflows to resume from the last successfully executed steps.
 
 ## Nextflow basic concepts
 
@@ -42,7 +41,7 @@ Nextflow is built around the idea that Linux is the lingua franca of data scienc
 
 Nextflow extends this approach, adding the ability to define complex program interactions and an accessible (high-level) parallel computational environment based on the [dataflow programming model](https://devopedia.org/dataflow-programming), whereby `processes` are connected via their `outputs` and `inputs` to other `processes`, and run as soon as they receive an input.  The diagram below illustrates the differences between a dataflow model and a simple linear program .
 
-![A simple program (a) and its dataflow equivalent (b) https://doi.org/10.1145/1013208.1013209.](images/dataflow.png)
+![A simple program (a) and its dataflow equivalent (b). Adapted from [Johnston, Hanna and Millar 2004](https://doi.org/10.1145/1013208.1013209).](images/dataflow.png)
 
 In a simple program **(a)**, these statements would be executed sequentially. Thus, the program would execute in three units of time. In the dataflow programming model **(b)**, this program takes only two units of time. This is because the read quantitation and QC steps have no dependencies on each other and therefore can execute simultaneously in parallel.
 
@@ -102,13 +101,21 @@ Nextflow provides out-of-the-box support for major batch schedulers and cloud pl
 
 ## Snakemake
 
-In this tutorial we've focused on Nextflow but many people in the bioinformatics community use Snakemake.  Similar to Nextflow, the Snakemake workflow management system is a tool for creating reproducible and scalable data analyses. The main difference is that workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
+In this section we've focused on Nextflow but many people in the bioinformatics community use [Snakemake](https://snakemake.readthedocs.io/en/stable/).  Similar to Nextflow, the Snakemake workflow management system is a tool for creating reproducible and scalable data analyses and it supports all the same features mentioned above. Perhaps the most noticeable difference for users is that Snakemake is based on the Python programming language. This makes it more approachable for those already familiar with this language. 
+
 
 ## Summary
 
 ::: {.callout-tip}
 ## Key Points
 
+- Workflow management software is designed to simplify the process of orchestrating complex computational pipelines that involve various tasks, inputs and outputs, and parallel processing. 
+- Using workfow managent software to manage complex pipelines has several advantages: reproducibility, parallel task execution, automatic software management, scalability (from a local computer to cloud and HPC cluster servers) and "checkpoint and resume" ability. 
+- Nextflow and Snakemake are two of the most popular workflow managers used in bioinformatics, with an active community of developers and several useful features: 
+  - Flexible syntax that can be adapted to any task.
+  - The ability to reuse and share modules written by the community.
+  - Integration with code sharing platforms such as GitHub and GitLab.
+  - Use of containerisation solutions (Docker and Singularity) and software package managers such as Conda.
 :::
 
 ## Credit

diff --git a/materials/04-nf_core.md b/materials/04-nf_core.md
@@ -5,22 +5,22 @@ title: "The nf-core project"
 ::: {.callout-tip}
 ## Learning Objectives
 
-- Understand what nf-core is and how it relates to Nextflow.
-- Use the nf-core helper tool to find nf-core pipelines.
+- Describe what nf-core is and how it relates to Nextflow.
+- Search for public nf-core pipelines and access their help documentation.
 - Understand how to configure nf-core pipelines.
 
 :::
 
 ## What is nf-core?
 
-nf-core is a community-led project to develop a set of best-practice pipelines built using Nextflow workflow management system.
+[nf-core](https://nf-co.re/) is a community-led project to develop a set of best-practice pipelines built using Nextflow workflow management system.
 Pipelines are governed by a set of guidelines, enforced by community code reviews and automatic code testing.
 
 ![nf-core](images/nf-core.png)
 
 ## What are nf-core pipelines?
 
-nf-core pipelines are an organised collection of Nextflow scripts,  other non-nextflow scripts (written in any language), configuration files, software specifications, and documentation hosted on [GitHub](https://github.com/nf-core). There is generally a single pipeline for a given data and analysis type e.g. There is a single pipeline for bulk RNA-Seq. All nf-core pipelines are distributed under the, permissive free software, [MIT licences](https://en.wikipedia.org/wiki/MIT_License).
+nf-core pipelines are an organised collection of Nextflow scripts, other non-nextflow scripts (written in any language), configuration files, software specifications, and documentation hosted on [GitHub](https://github.com/nf-core). There is generally a single pipeline for a given data and analysis type, e.g. there is a single pipeline for bulk RNA-Seq. All nf-core pipelines are distributed under the open [MIT licence](https://en.wikipedia.org/wiki/MIT_License).
 
 ## Running nf-core pipelines
 
@@ -35,7 +35,11 @@ Pipeline-specific documentation is bundled with each pipeline in the /docs folde
 
 Each pipeline has its own webpage e.g. [nf-co.re/rnaseq](https://nf-co.re/rnaseq/usage).
 
-In addition to this documentation, each pipeline comes with basic command line reference. This can be seen by running the pipeline with the parameter `--help` , for example:
+In addition to this documentation, each pipeline comes with basic command line reference. This can be seen by running the pipeline with the parameter `--help` . 
+It is also recommended to explicitly specify the version of the pipeline you want to run, to ensure reproducibility if you run it again in the future. 
+This can be done with the `-r` option.
+
+For example, the following command prints the help documentation for version 3.4 of the `nf-core/rnaseq` pipeline:
 
 ```bash
 nextflow run -r 3.4 nf-core/rnaseq --help
@@ -77,52 +81,55 @@ Nextflow can load pipeline configurations from multiple locations.  nf-core pipe
 ![Nextflow config loading order](images/nfcore_config.png)
 
 1. Pipeline: Default 'base' config
-* Always loaded. Contains pipeline-specific parameters and "sensible defaults" for things like computational requirements
-* Does not specify any method for software packaging. If nothing else is specified, Nextflow will expect all software to be available on the command line.
+   * Always loaded. Contains pipeline-specific parameters and "sensible defaults" for things like computational requirements.
+   * Does not specify any method for software packaging. If nothing else is specified, Nextflow will expect all software to be available on the command line.
 2. Core config profiles
-* All nf-core pipelines come with some generic config profiles. The most commonly used ones are for software packaging: docker, singularity and conda
-* Other core profiles are debug and two test profiles. There two test profile, a small test profile (nf-core/test-datasets) for quick test and a full test profile which provides the path to full sized data from public repositories.
+   * All nf-core pipelines come with some generic config profiles. The most commonly used ones are for software packaging: docker, singularity and conda.
+   * Other core profiles are 'debug' and two 'test' profiles. The two test profiles include: a small version for quick testing, which pulls data from a public repository at [nf-core/test-datasets](https://github.com/nf-core/test-datasets); and a full test profile which provides the path to full-sized data from public repositories.
 3. Server profiles
-* At run time, nf-core pipelines fetch configuration profiles from the [configs remote repository](https://github.com/nf-core/configs). The profiles here are specific to clusters at different institutions.
-* Because this is loaded at run time, anyone can add a profile here for their system and it will be immediately available for all nf-core pipelines.
+   * At run time, nf-core pipelines fetch configuration profiles from the [configs remote repository](https://github.com/nf-core/configs). The profiles here are specific to clusters at different institutions.
+   * Because this is loaded at run time, anyone can add a profile here for their system and it will be immediately available for all nf-core pipelines.
 4. Local config files given to Nextflow with the `-c` flag
-```bash
-nextflow run nf-core/rnaseq -r 3.0 -c mylocal.config
-```
-5. Command line configuration: pipeline parameters can be passed on the command line using the `--<parameter>` syntax.
 
-```bash
-nextflow run nf-core/rnaseq -r 3.0 --email "[email protected]"`
-```
+    ```bash
+    nextflow run nf-core/rnaseq -r 3.4 -c mylocal.config
+    ```
+
+5. Command line configuration: pipeline parameters can be passed on the command line using the `--<parameter>` syntax. We will see several examples of this use throughout the course.
 
-### Config Profiles
+
+### Config profiles
 
 nf-core makes use of Nextflow configuration `profiles` to make it easy to apply a group of options on the command line.
 
 Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the `-profile` command line option. Common profiles are `conda`, `singularity` and `docker` that specify which software manager to use.
 
 Multiple profiles are comma-separated. When there are differing configuration settings provided by different profiles, the right-most profile takes priority.
 
+For example, the following command runs the `test` profile (i.e. run the test pipeline) and use Singularity containers as a way to manage the software required by the pipeline:
+
 ```bash
-nextflow run nf-core/rnaseq -r 3.0 -profile test,conda
-nextflow run nf-core/rnaseq -r 3.0 -profile <institutional_config_profile>, test, conda
+nextflow run nf-core/rnaseq -r 3.4 -profile test,singularity
 ```
 
 **Note** The order in which config profiles are specified matters. Their priority increases from left to right.
 
-### Multiple Nextflow configuration locations
-Be clever with multiple Nextflow configuration locations. For example, use `-profile` for your cluster  configuration, the file `$HOME/.nextflow/config` for your personal config such as `params.email` and a working directory >`nextflow.config` file for reproducible run-specific configuration.
+:::{.callout-note}
+#### Multiple Nextflow configuration locations
+
+Be clever with multiple Nextflow configuration locations. For example, use `-profile` for your cluster configuration, the file `$HOME/.nextflow/config` for your personal config such as `params.email` and a working directory `nextflow.config` file for reproducible run-specific configuration.
+:::
 
 ### Running pipelines with test data
 
-The nf-core config profile `test` is special profile, which defines a minimal data set and configuration, that runs quickly and tests the workflow from beginning to end. Since the data is minimal, the output is often nonsense. Real world  example output are instead linked on the nf-core pipeline web page, where the workflow has been run with a full size data set:
+The nf-core config profile `test` is a special profile, which defines a minimal data set and configuration, that runs quickly and tests the workflow from beginning to end. Since the data is minimal, the output is often nonsense. Real world  example output are instead linked on the nf-core pipeline web page, where the workflow has been run with a full size data set:
 
 ```bash
-$ nextflow run nf-core/<pipeline_name-profile test
+$ nextflow run nf-core/<pipeline_name -profile test
 ```
 
 :::{.callout-tip}
-### Software configuration profile
+#### Software configuration profile
 Note that you will typically still need to combine this with a software configuration profile for your system - e.g.
 `-profile test,conda`.
 Running with the test profile is a great way to confirm that you have Nextflow configured properly for your system before attempting to run with real data
@@ -135,8 +142,8 @@ If you run into issues running your pipeline you can you the nf-core  website  t
 ### Extra resources and getting help
 
 If you still have an issue with running the pipeline then feel free to contact the nf-core community via the Slack channel .
-The nf-core Slack organisation has channels dedicated for each pipeline, as well as specific topics (eg. `#help`, `#pipelines`, `#tools`, `#configs` and much more).
-The nf-core Slack can be found at https://nfcore.slack.com (NB: no hyphen in nfcore!). To join you will need an invite, which you can get at https://nf-co.re/join/slack.
+The [nf-core Slack workspace](https://nfcore.slack.com) has channels dedicated for each pipeline, as well as specific topics (eg. `#help`, `#pipelines`, `#tools`, `#configs` and much more). 
+To join this workspace you will need an invite, which you can get at https://nf-co.re/join/slack.
 
 You can also get help by opening an issue in the respective pipeline repository on GitHub asking for help.
 
@@ -150,9 +157,9 @@ If you have problems that are directly related to Nextflow and not our pipelines
 If you use an nf-core pipeline in your work you should cite the main publication for the main nf-core paper, describing the community and framework,
 as follows:
 
-**The nf-core framework for community-curated bioinformatics pipelines.**
-Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
-Nat Biotechnol. 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). ReadCube: [Full Access Link](https://rdcu.be/b1GjZ)
+> **The nf-core framework for community-curated bioinformatics pipelines.**
+> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
+> Nat Biotechnol. 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). ReadCube: [Full Access Link](https://rdcu.be/b1GjZ)
 
 Many of the pipelines have a publication listed on the nf-core website that can be found [here](https://nf-co.re/publications).
 
@@ -161,11 +168,19 @@ Many of the pipelines have a publication listed on the nf-core website that can
 In addition, each release of an nf-core pipeline has a digital object identifiers (DOIs) for easy referencing in literature
 The DOIs are generated by Zenodo from the pipeline's github repository.
 
+
 ## Summary
 
 ::: {.callout-tip}
 ## Key Points
 
+- nf-core is a community-led project to develop and curate a set of high-quality bioinformatic pipelines.
+- All pipelines are listed on the [nf-co.re](https://nf-co.re/pipelines) website.
+- Each pipeline contains both general usage documentation, as well as detailed help for its input parameters and the output files generated by the pipeline. 
+- Several aspects of the pipelines can be configured, using configuration files. 
+- Default profiles can be used to load a set of default configurations. Common profiles in nf-core pipelines include: 
+  - `test` to run a quick test, useful to see if the software is correctly setup on the computer/server being used. 
+  - `docker` and `singularity` to indicate we want to use either Docker or Singularity to automatically manage the software installation during the pipeline run.
 :::
 
 ## Credit