From 4e9b5c07678d31a26dea66d90e0b0f36ef7beb4e Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Mon, 29 Jan 2024 12:01:15 +0100 Subject: [PATCH 1/8] Add new episodes --- config.yaml | 12 +- episodes/01-introduction.md | 173 +++++++++++++ episodes/02-snakemake_on_the_cluster.md | 246 +++++++++++++++++++ episodes/03-placeholders.md | 78 ++++++ episodes/04-snakemake_and_mpi.md | 311 ++++++++++++++++++++++++ 5 files changed, 816 insertions(+), 4 deletions(-) create mode 100644 episodes/01-introduction.md create mode 100644 episodes/02-snakemake_on_the_cluster.md create mode 100644 episodes/03-placeholders.md create mode 100644 episodes/04-snakemake_and_mpi.md diff --git a/config.yaml b/config.yaml index ec8758f..9ea6eac 100644 --- a/config.yaml +++ b/config.yaml @@ -58,7 +58,11 @@ contact: 'maintainers-hpc@lists.carpentries.org' # - another-learner.md # Order of episodes in your lesson -episodes: +episodes: +- 01-introduction.md +- 02-snakemake_on_the_cluster.md +- 03-placeholders.md +- 04-snakemake_and_mpi.md - amdahl_foundation.md - snakemake_single.md - snakemake_multiple.md @@ -67,13 +71,13 @@ episodes: - amdahl_snakemake.md # Information for Learners -learners: +learners: # Information for Instructors -instructors: +instructors: # Learner Profiles -profiles: +profiles: # Customisation --------------------------------------------- # diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md new file mode 100644 index 0000000..5f69b66 --- /dev/null +++ b/episodes/01-introduction.md @@ -0,0 +1,173 @@ +--- +title: "Running commands with Snakemake" +teaching: 30 +exercises: 30 +--- + +::: questions +- "How do I run a simple command with Snakemake?" +::: + +:::objectives +- "Create a Snakemake recipe (a Snakefile)" +::: + + +## What is the workflow I'm interested in? + +In this lesson we will make an experiment that takes an application which runs +in parallel and investigate it's scalability. To do that we will need to gather +data, in this case that means running the application multiple times with +different numbers of CPU cores and recording the execution time. Once we've +done that we need to create a visualisation of the data to see how it compares +against the ideal case. + +From the visualisation we can then decide at what scale it +makes most sense to run the application at in production to maximise the use of +our CPU allocation on the system. + +We could do all of this manually, but there are useful tools to help us manage +data analysis pipelines like we have in our experiment. Today we'll learn about +one of those: Snakemake. + +In order to get started with Snakemake, let's begin by taking a simple command +and see how we can run that via Snakemake. Let's choose the command `hostname` +which prints out the name of the host where the command is executed: + +```bash +[ocaisa@node1 ~]$ hostname +``` +```output +node1.int.jetstream2.hpc-carpentry.org +``` + +That prints out the result but Snakemake relies on files to know the status of +your workflow, so let's redirect the output to a file: + +```bash +[ocaisa@node1 ~]$ hostname > hostname_login.txt +``` + +## Making a Snakefile + +Edit a new text file named `Snakefile`. + +Contents of `Snakefile`: + +```python +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" +``` + +::: callout + +## Key points about this file + +1. The file is named `Snakefile` - with a capital `S` and no file extension. +1. Some lines are indented. Indents must be with space characters, not tabs. See the + setup section for how to make your text editor do this. +1. The rule definition starts with the keyword `rule` followed by the rule name, then a colon. +1. We named the rule `hostname`. You may use letters, numbers or underscores, but the rule name + must begin with a letter and may not be a keyword. +1. The keywords `input`, `output`, `shell` are all followed by a colon. +1. The file names and the shell command are all in `"quotes"`. +1. The output filename is given before the input filename. In fact, Snakemake doesn't care what + order they appear in but we give the output first throughout this course. We'll see why soon. +1. In this use case there is no input file for the command so we leave this blank. + +::: + +Back in the shell we'll run our new rule. At this point, if there were any missing quotes, bad +indents, etc. we may see an error. + +```bash +$ snakemake -j1 -p hostname_login.txt +``` + +::: callout + +## `bash: snakemake: command not found...` + +If your shell tells you that it cannot find the command `snakemake` then we need to make the +software available somehow. In our case, this means searching for the module that we need to +load: +```bash +module spider snakemake +``` + +```output +[ocaisa@node1 ~]$ module spider snakemake + +-------------------------------------------------------------------------------------------------------- + snakemake: +-------------------------------------------------------------------------------------------------------- + Versions: + snakemake/8.2.1-foss-2023a + snakemake/8.2.1 (E) + +Names marked by a trailing (E) are extensions provided by another module. + + +-------------------------------------------------------------------------------------------------------- + For detailed information about a specific "snakemake" package (including how to load the modules) use the module's full name. + Note that names that have a trailing (E) are extensions provided by other modules. + For example: + + $ module spider snakemake/8.2.1 +-------------------------------------------------------------------------------------------------------- + +``` + +Now we want the module, so let's load that to make the package available + +```bash +[ocaisa@node1 ~]$ module load snakemake +``` + +and then make sure we have the `snakemake` command available + +```bash +[ocaisa@node1 ~]$ which snakemake +``` +```output +/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/snakemake/8.2.1-foss-2023a/bin/snakemake +``` +::: + +::: challenge +## Running Snakemake + +Run `snakemake --help | less` to see the help for all available options. +What does the `-p` option in the `snakemake` command above do? + +1. Protects existing output files +1. Prints the shell commands that are being run to the terminal +1. Tells Snakemake to only run one process at a time +1. Prompts the user for the correct input file + +*Hint: you can search in the text by pressing `/`, and quit back to the shell with `q`* + +:::::: Solution + +(2) Prints the shell commands that are being run to the terminal + +This is such a useful thing we don't know why it isn't the default! The `-j1` option is what +tells Snakemake to only run one process at a time, and we'll stick with this for now as it +makes things simpler. The `-F` option tells Snakemake to always overwrite output files, and +we'll learn about protected outputs much later in the course. Answer 4 is a total red-herring, +as Snakemake never prompts interactively for user input. +:::::: +::: + +::: keypoints + +- "Before running Snakemake you need to write a Snakefile" +- "A Snakefile is a text file which defines a list of rules" +- "Rules have inputs, outputs, and shell commands to be run" +- "You tell Snakemake what file to make and it will run the shell command defined in the + appropriate rule" + +::: diff --git a/episodes/02-snakemake_on_the_cluster.md b/episodes/02-snakemake_on_the_cluster.md new file mode 100644 index 0000000..8f481b5 --- /dev/null +++ b/episodes/02-snakemake_on_the_cluster.md @@ -0,0 +1,246 @@ +--- +title: "Running Snakemake on the cluster" +teaching: 30 +exercises: 20 +--- + +::: objectives + +- "Define rules to run locally and on the cluster" + +::: + +::: questions + +- "How do I run my Snakemake rule on the cluster?" + +::: + +What happens when we want to make our rule run on the cluster rather than the +login node? The cluster we are using uses Slurm, and it happens that Snakemake +has built in support for Slurm, we just need to tell it that we want to use it. + +Snakemake uses the `executor` option to allow you to select the plugin that you +wish to execute the rule. The quickest way to apply this to your Snakefile is to +define this on the command line. Let's try it out + +```bash +[ocaisa@node1 ~]$ snakemake -j1 -p --executor slurm hostname_login +``` + +```output +Building DAG of jobs... +Retrieving input from storage. +Nothing to be done (all requested files are present and up to date). +``` + +Nothing happened! Why not? When it is asked to build a target, Snakemake checks +the 'last modification +time' of both the target and its dependencies. If any dependency has been +updated since the target, then the actions are re-run to update the target. +Using this approach, Snakemake knows to only rebuild the files that, either +directly or indirectly, depend on the file that changed. This is called an +_incremental build]_. + +::: callout +## Incremental Builds Improve Efficiency + +By only rebuilding files when required, Snakemake makes your processing +more efficient. +::: + + +::: challenge +## Running on the cluster + +We need another rule now that executes the `hostname` on the cluster. Create the +rule in your Snakefile and try to execute it on cluster with the options +`--executor slurm` to `snakemake` + +:::::: solution + +The rule is almost identical to the previous rule save for the rule name and +output file: + +```python +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > hostname_remote.txt" + +``` +You can then execute the rule with +```bash +[ocaisa@node1 ~]$ snakemake -j1 -p --executor slurm hostname_remote +``` +```output +Building DAG of jobs... +Retrieving input from storage. +Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash +Provided remote nodes: 1 +Job stats: +job count +--------------- ------- +hostname_remote 1 +total 1 + +Select jobs to execute... +Execute 1 jobs... + +[Mon Jan 29 18:03:46 2024] +rule hostname_remote: + output: hostname_remote.txt + jobid: 0 + reason: Missing output files: hostname_remote.txt + resources: tmpdir= + +hostname > hostname_remote.txt +No SLURM account given, trying to guess. +Guessed SLURM account: def-users +No wall time information given. This might or might not work on your cluster. If not, specify the resource runtime in your rule or as a reasonable default via --default-resources. +No job memory information ('mem_mb' or 'mem_mb_per_cpu') is given - submitting without. This might or might not work on your cluster. +Job 0 has been submitted with SLURM jobid 326 (log: /home/ocaisa/.snakemake/slurm_logs/rule_hostname_remote/326.log). +[Mon Jan 29 18:04:26 2024] +Finished job 0. +1 of 1 steps (100%) done +Complete log: .snakemake/log/2024-01-29T180346.788174.snakemake.log +``` +Note all the warnings that Snakemake is giving us about the fact that the rule +may not be able to execute on our cluster as we may not have given enough +information. Luckily for us, this actually works on our cluster and we can take +a look in the output file we asked for, `hostname_remote.txt`: +```bash +[ocaisa@node1 ~]$ cat hostname_remote.txt +``` +```output +tmpnode1.int.jetstream2.hpc-carpentry.org +``` + +:::::: + +::: + +## Snakemake profile + +Adapting Snakemake to a particular environment can entail many flags and +options. Therefore, it is possible to specify a configuration profile to be used +to obtain default options. This looks like +```bash +snakemake --profile myprofileFolder ... +``` +The profile folder must contain a file called `config.yaml` which is what will +store our options. The folder may also contain other files necessary for the +profile. Let's create the file `cluster_profile/config.yaml` and insert some of +our existing options: + +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +``` + +We should now be able rerun our workflow by pointing to the profile rather than +the listing out the options. To force our workflow to rerun, we first need to +remove the output file `hostname_remote.txt`, and then we can try out our new +profile +```bash +[ocaisa@node1 ~]$ rm hostname_remote.txt +[ocaisa@node1 ~]$ snakemake --profile cluster_profile hostname_remote +``` + +The profile is extremely useful in the context of our cluster, as the Slurm +executor has lots of options, and sometimes you need to use them to be able to +submit jobs to the cluster you have access to. Unfortunately, the names of the +options in Snakemake are not _exactly_ the same as those of Slurm, so we need +the help of a translation table: + +| SLURM | Snakemake | Description | +|-------------------|-------------------|----------------------------------------------------------------| +| `--partition` | `slurm_partition` | the partition a rule/job is to use | +| `--time` | `runtime` | the walltime per job in minutes | +| `--constraint` | `constraint` | may hold features on some clusters | +| `--mem` | `mem, mem_mb` | memory a cluster node must | +| | | provide (mem: string with unit), mem_mb: int | +| `--mem-per-cpu` | `mem_mb_per_cpu` | memory per reserved CPU | +| `--ntasks` | `tasks` | number of concurrent tasks / ranks | +| `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) | +| `--nodes` | `nodes` | number of nodes | + +The warnings given by Snakemake hinted that we need to provide these options. +One way to do it is to provide them is as part of the Snakemake rule, e.g., +```python +rule: + input: ... + output: ... + resources: + partition: + runtime: +``` +and we can also use the profile to define default values for these options to +use with our project. For example, the available memory on our cluster is about +4GB per core, so we can add that to our profile: +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 +``` + +:::challenge +We know that our problem runs in a very short time. Make the default length of +our jobs to two minutes for Slurm. +::::::solution +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 +``` +:::::: +::: + +There are various `sbatch` options not directly supported by the resource +definitions in the table above. You may use the `slurm_extra` resource to +specify any of these additional flags to `sbatch`: + +```python +rule myrule: + input: ... + output: ... + resources: + slurm_extra="--mail-type=ALL --mail-user=" +``` + +## Local rule execution + +Our initial rule was to +get the hostname of the login node. We always want to run that rule on the login +node for that to make sense. If we tell Snakemake to run all rules via the +Slurm executor (which is what we are doing via our new profile) this +won't happen any more. So how do we force the rule to run on +the login node? + +Well, it's no surprise that some Snakemake rules perform trivial tasks where job +submission might be +overkill (e.g., less than 1 minute worth of compute time). Similar to our case, +it would be a better +idea to have these rules execute locally (i.e. where the `snakemake` command is +run) instead of as a job. Snakemake lets you indicate which rules should always +run locally with the `localrules` keyword. Let's define `hostname_login` as a +local rule near the top of our `Snakefile`. + +```python +localrules: hostname_login +``` + +::: keypoints + +- "Snakemake generates and submits its own batch scripts for your scheduler." +- "You can store default configuration settings in a Snakemake profile" +- "`localrules` defines rules that are executed locally, and never submitted to a cluster." + +::: diff --git a/episodes/03-placeholders.md b/episodes/03-placeholders.md new file mode 100644 index 0000000..9dde975 --- /dev/null +++ b/episodes/03-placeholders.md @@ -0,0 +1,78 @@ +--- +title: "Placeholders" +teaching: 40 +exercises: 30 +--- + +::: questions +- "How do I make a generic rule?" +- "How does Snakemake decide what rule to run?" +::: + +::: objectives +- "Understand the basic steps Snakemake goes through when running a workflow" +- "See how Snakemake deals with some errors" +::: + +Our Snakefile has some duplication. For example, the names of text +files are repeated in places throughout the Snakefile rules. Snakefiles are +a form of code and, in any code, repetition can lead to problems (e.g. we rename +a data file in one part of the Snakefile but forget to rename it elsewhere). + +::: callout +## D.R.Y. (Don't Repeat Yourself) + +In many programming languages, the bulk of the language features are +there to allow the programmer to describe long-winded computational +routines as short, expressive, beautiful code. Features in Python, +R, or Java, such as user-defined variables and functions are useful in +part because they mean we don't have to write out (or think about) +all of the details over and over again. This good habit of writing +things out only once is known as the "Don't Repeat Yourself" +principle or D.R.Y. +::: + +Let us set about removing some of the repetition from our Snakefile. + +## Placeholders + +To make a more general-purpose rule we need **placeholders**. Let's take a look +at what a placeholder looks like + +```python +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > {output}" + +``` + +As a reminder, here's the previous version from the last episode: + +```python +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > hostname_remote.txt" + +``` + +The new rule has replaced explicit file names with things in `{curly brackets}`, +specifically `{output}` (but it could also have been `{input}`...if that had +a value and were useful). + + +### `{input}` and `{output}` are **placeholders** + +Placeholders are used in the `shell` section of a rule, and Snakemake will +replace them with appropriate values - `{input}` with the full name of the input +file, and +`{output}` with the full name of the output file -- before running the command. + +:::keypoints +- "Snakemake rules are made more generic with placeholders" +- "Placeholders in the shell part of the rule are replaced with values based on the chosen + wildcards" +::: diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md new file mode 100644 index 0000000..63e58ba --- /dev/null +++ b/episodes/04-snakemake_and_mpi.md @@ -0,0 +1,311 @@ +--- +title: "MPI applications and Snakemake" +teaching: 30 +exercises: 20 +--- + +::: objectives + +- "Define rules to run locally and on the cluster" + +::: + +::: questions + +- "How do I run an MPI application via Snakemake on the cluster?" + +::: + +Now it's time to start getting back to our real workflow. We can execute a +command on the cluster, but what about executing the MPI application we are +interested in? Our application is called `amdahl` and is available as an +environment module. + +::: challenge + +Locate and load the `amdahl` module and then replace our `hostname_remote` rule +with a version that runs `amdahl`. (Don't worry about parallel MPI just yet, run +it with a single CPU, `mpirun -n 1 amdahl`). + +Does your rule execute correctly? If not look through the log files to find out +why? + +::::::solution + +```bash +module spider amdahl +module load amdahl +``` +will locate and then load the `amdahl` module. We can then update/replace our +rule to run the `amdahl` application: +```python +rule amdahl_run: + output: "amdahl_run.txt" + input: + shell: + "amdahl > amdahl_run.txt" +``` +However, when we try to execute the rule we get an error (unless you already +have a different version of `amdahl` already available in your path). Snakemake +reports the +location of the logs and if we look inside we can (eventually) find +```output +... +amdahl > amdahl_run.txt +/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash: line 1: amdahl: command not found +... +``` +So, even though we loaded the module before running the workflow, our +Snakemake rule didn't find the executable. That's because the Snakemake rule +is running in a clean runtime environment, and we need to somehow tell it to +load the necessary environment module before trying to execute the rule. + +:::::: +::: + +## Snakemake and environment modules + +Our application is called `amdahl` and is available on the system via an +environment module, so we need to +tell Snakemake to load the module before it tries to execute the rule. Snakemake +is aware of environment modules, and these can be specified via (yet another) +option: +```python +rule amdahl_run: + output: "amdahl_run.txt" + input: + envmodules: + "mpi4py", + "amdahl" + input: + shell: + "mpiexec -n 1 {executable} > {output}" +``` + +Adding these lines are not enough to make the rule execute however. Not only do +you have to tell Snakemake what modules to load, but you also have to tell it to +use environment modules in general (since the use of environment modules is +considered to make your runtime environment less reproducible as the available +modules may differ from cluster to cluster). This require you to give Snakemake +an additonal option +```bash +snakemake --profile cluster_profile --use-envmodules amdahl_run +``` + +::: challenge + +We'll be using environment modules throughout the rest of tutorial, so make that +a default option of our profile (by setting it's value to `True`) + +::::::solution + +Update our cluster profile to +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 +use-envmodules: True +``` +If you want to test it, you need to erase the output file of the rule and rerun +Snakemake. + +:::::: + +::: + +## Snakemake and MPI + +We didn't really run an MPI application in the last section as we only ran on +one core. How do we request to run on multiple cores for a single rule? + +Snakemake has general support for MPI, but the only executor that currently +explicitly supports MPI is the Slurm executor (lucky for us!). If we look back +at our Slurm to Snakemake translation table we notice the relevant options +appear near the bottom: + +| SLURM | Snakemake | Description | +|-------------------|-------------------|----------------------------------------------------------------| +| ... | ... | ... | +| `--ntasks` | `tasks` | number of concurrent tasks / ranks | +| `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) | +| `--nodes` | `nodes` | number of nodes | + +The one we are interested is `tasks` as we are only going to increase the number +of ranks. We can define these in a `resources` section of our rule and refer to +them using placeholders: +```python +rule amdahl_run: + output: "amdahl_run.txt" + input: + envmodules: + "amdahl" + resources: + mpi='mpiexec', + tasks=2 + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` + +That worked but now we have a bit of an issue. We want to do this for a few +different values of `tasks` that would mean we would need a different output +file for every run. It would be great if we can somehow indicate in the `output` +the value that we want to use for `tasks`...and have Snakemake pick that up. + +We could use a _wildcard_ in the `output` to allow us to +define `tasks` we wish to use. The syntax for such a wildcard looks like +```python +output: "amdahl_run_{parallel_tasks}.txt" +``` +where `parallel_tasks` is our wildcard. + +::: callout +## Wildcards + +Wildcards are used in the `input` and `output` lines of the rule to represent +parts of filenames. +Much like the `*` pattern in the shell, the wildcard can stand in for any text +in order to make up +the desired filename. As with naming your rules, you may choose any name you +like for your +wildcards, so here we used `parallel_tasks`. Using the same wildcards in the +input and output is what tells Snakemake how to match input files to output +files. + +If two rules use a wildcard with the same name then Snakemake will treat them as +different entities +- rules in Snakemake are self-contained in this way. + +In the `shell` line you can reference the wildcard with +`{wildcards.parallel_tasks}` +::: + +We could use a wildcard in the `output` to allow us to +define `tasks` we wish to use. This could look like +```python +rule amdahl_run: + output: "amdahl_run_{parallel_tasks}.txt" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + tasks="{parallel_tasks}" + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` +but there are two problems with this: + +* The only way for Snakemake to know the value of the wildcard is for the user + to explicitly request a concrete output file: + ```bash + snakemake --profile cluster_profile amdahl_run_2.txt + ``` +* The bigger problem is that even doing that does not work, it seems we cannot + use a wildcard for `tasks`: + ```output + WorkflowError: + SLURM job submission failed. The error message was sbatch: error: Invalid numeric value "{parallel_tasks}" for --ntasks. + ``` + +Unfortunately there is no direct way for us to access the wildcards in this +scenario. The only way to do it is to _indirectly_ access the wildcards by +using a function. The solution for this is to write a one-time use function that +has no name. Such functions are called either anonymous functions or lamdba +functions (both mean the same thing). + +To define a lambda function in python, the general syntax is as follows: +```python +lambda x: x + 54 +``` +Since a function _can_ see the wildcards, we can use that to set the value for +`tasks`: +```python +rule amdahl_run: + output: "amdahl_run_{parallel_tasks}.txt" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` + +Now we have a rule that can be used to generate output from runs of an +arbitrary number of parallel tasks. + +::: callout + +## Comments in Snakefiles + +In the above code, the line beginning `#` is a comment line. Hopefully you are already in the +habit of adding comments to your own scripts. Good comments make any script more readable, and +this is just as true with Snakefiles. + +::: + +::: challenge + +Create an output file for the case where we have 6 parallel tasks + +:::::: solution + +```bash +snakemake --profile cluster_profile amdahl_run_6.txt +``` + +:::::: + +::: + +## Snakemake order of operations + +We're only just getting started with some simple rules, but it's worth thinking about exactly what +Snakemake is doing when you run it. There are three distinct phases: + +1. Prepares to run: + 1. Reads in all the rule definitions from the Snakefile +1. Plans what to do: + 1. Sees what file(s) you are asking it to make + 1. Looks for a matching rule by looking at the `output`s of all the rules it knows + 1. Fills in the wildcards to work out the `input` for this rule + 1. Checks that this input file (if required) is actually available +1. Runs the steps: + 1. Creates the directory for the output file, if needed + 1. Removes the old output file if it is already there + 1. Only then, runs the shell command with the placeholders replaced + 1. Checks that the command ran without errors *and* made the new output file as expected + +::: callout +## Dry-run (-n) mode + +It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to +run, and print them to the screen, but never actually run them. This is done with the `-n` +flag, eg: + +```bash +> $ snakemake -n ... +``` +::: + +The amount of checking may seem pedantic right now, but as the workflow gains more steps this will +become very useful to us indeed. + + +::: keypoints + +- "Snakemake chooses the appropriate rule by replacing wildcards such that the output matches + the target" +- "Snakemake checks for various error conditions and will stop if it sees a problem" + +::: From 8ec2256d376b53a081dfd8dbcff594d3318342a3 Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Mon, 29 Jan 2024 15:32:25 +0100 Subject: [PATCH 2/8] Make the main rule generic enough to prepare for chaining --- config.yaml | 7 +- episodes/02-snakemake_on_the_cluster.md | 5 +- episodes/04-snakemake_and_mpi.md | 152 +++++++++++++++++++++--- 3 files changed, 140 insertions(+), 24 deletions(-) diff --git a/config.yaml b/config.yaml index 9ea6eac..a6eef1e 100644 --- a/config.yaml +++ b/config.yaml @@ -63,12 +63,7 @@ episodes: - 02-snakemake_on_the_cluster.md - 03-placeholders.md - 04-snakemake_and_mpi.md -- amdahl_foundation.md -- snakemake_single.md -- snakemake_multiple.md -- snakemake_cluster.md -- snakemake_profiles.md -- amdahl_snakemake.md + # Information for Learners learners: diff --git a/episodes/02-snakemake_on_the_cluster.md b/episodes/02-snakemake_on_the_cluster.md index 8f481b5..c7dd3ae 100644 --- a/episodes/02-snakemake_on_the_cluster.md +++ b/episodes/02-snakemake_on_the_cluster.md @@ -40,7 +40,7 @@ time' of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an -_incremental build]_. +_incremental build_. ::: callout ## Incremental Builds Improve Efficiency @@ -191,7 +191,9 @@ default-resources: :::challenge We know that our problem runs in a very short time. Make the default length of our jobs to two minutes for Slurm. + ::::::solution + ```yaml printshellcmds: True jobs: 3 @@ -201,6 +203,7 @@ default-resources: - runtime=2 ``` :::::: + ::: There are various `sbatch` options not directly supported by the resource diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md index 63e58ba..c9b2f4a 100644 --- a/episodes/04-snakemake_and_mpi.md +++ b/episodes/04-snakemake_and_mpi.md @@ -25,7 +25,7 @@ environment module. Locate and load the `amdahl` module and then replace our `hostname_remote` rule with a version that runs `amdahl`. (Don't worry about parallel MPI just yet, run -it with a single CPU, `mpirun -n 1 amdahl`). +it with a single CPU, `mpiexec -n 1 amdahl`). Does your rule execute correctly? If not look through the log files to find out why? @@ -43,16 +43,27 @@ rule amdahl_run: output: "amdahl_run.txt" input: shell: - "amdahl > amdahl_run.txt" + "mpiexec -n 1 amdahl > amdahl_run.txt" ``` However, when we try to execute the rule we get an error (unless you already -have a different version of `amdahl` already available in your path). Snakemake +have a different version of `amdahl` available in your path). Snakemake reports the location of the logs and if we look inside we can (eventually) find ```output ... -amdahl > amdahl_run.txt -/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash: line 1: amdahl: command not found +mpiexec -n 1 amdahl > amdahl_run.txt +-------------------------------------------------------------------------- +mpiexec was unable to find the specified executable file, and therefore +did not launch the job. This error was first reported for process +rank 0; it may have occurred for other processes as well. + +NOTE: A common cause for this error is misspelling a mpiexec command + line parameter option (remember that mpiexec interprets the first + unrecognized command line token as the executable). + +Node: tmpnode1 +Executable: amdahl +-------------------------------------------------------------------------- ... ``` So, even though we loaded the module before running the workflow, our @@ -79,7 +90,7 @@ rule amdahl_run: "amdahl" input: shell: - "mpiexec -n 1 {executable} > {output}" + "mpiexec -n 1 amdahl > {output}" ``` Adding these lines are not enough to make the rule execute however. Not only do @@ -212,18 +223,22 @@ but there are two problems with this: SLURM job submission failed. The error message was sbatch: error: Invalid numeric value "{parallel_tasks}" for --ntasks. ``` -Unfortunately there is no direct way for us to access the wildcards in this -scenario. The only way to do it is to _indirectly_ access the wildcards by -using a function. The solution for this is to write a one-time use function that -has no name. Such functions are called either anonymous functions or lamdba -functions (both mean the same thing). +Unfortunately for us, there is no direct way for us to access the wildcards. The +reason for this is that Snakemake tries to use the value of `tasks` during it's +initialisation stage, which is before we know the value of the wildcard. We need +to defer the determination of `tasks` to later on. This can be achieved by +specifying an input function instead of a value for this +scenario. The solution then is to write a one-time use function that +has no name to manipulate Snakmake. These kinds of functions are called either +anonymous functions or lamdba functions (both mean the same thing), and are a +feature of Python (and other programming languages). To define a lambda function in python, the general syntax is as follows: ```python lambda x: x + 54 ``` -Since a function _can_ see the wildcards, we can use that to set the value for -`tasks`: +Since a function _can_ take the wildcards as arguments, we can use that to set +the value for `tasks`: ```python rule amdahl_run: output: "amdahl_run_{parallel_tasks}.txt" @@ -254,14 +269,118 @@ this is just as true with Snakefiles. ::: +Since our rule is now capable of generating an arbitrary number of output files +things could get very crowded in our current directory. It's probably best then +to put the runs into a separate folder. We can just add the folder directly to +our `output`: + +```python +rule amdahl_run: + output: "runs/amdahl_run_{parallel_tasks}.txt" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` + +::: challenge + +Create an output file (under the `run` folder) for the case where we have 6 +parallel tasks + +:::::: solution + +```bash +snakemake --profile cluster_profile runs/amdahl_run_6.txt +``` + +:::::: + +::: + +Another thing about our application `amdahl` is that we ultimately want to +process the output to generate our scaling plot. The output right now is useful +for reading but makes processing harder. `amdahl` has an option that actually +makes this easier for us. To see the `amdahl` options we can use +```bash +[ocaisa@node1 ~]$ module load amdahl +[ocaisa@node1 ~]$ amdahl --help +``` +```output +usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e] + +options: + -h, --help show this help message and exit + -p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION] + Parallel proportion should be a float between 0 and 1 + -w [WORK_SECONDS], --work-seconds [WORK_SECONDS] + Total seconds of workload, should be an integer greater than 0 + -t, --terse Enable terse output + -e, --exact Disable random jitter +``` +The option we are looking for is `--terse`, and that will make `amdahl` print +output in a format that is much easier to process, JSON. JSON format in a file +typically uses the file extension so let's add that option to our shell command +and change the file format of the output: + +```python +rule amdahl_run: + output: "runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse > {output}" +``` + +There was another parameter for `amdahl` that caught my eye. `amdahl` has an +option `--parallel-proportion` (or `-p`)which we might be interested in +changing. This has an impact on the values we get in our results so let's add +another directory layer to our output format to reflect a particular choice for +this value. We can use a wildcard so we done have to choose the value right +away: + +```python +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" +``` + ::: challenge -Create an output file for the case where we have 6 parallel tasks +Create an output file for a value of `-p` of 0.999 (the default value is 0.8) +for the case where we have 6 parallel tasks. :::::: solution ```bash -snakemake --profile cluster_profile amdahl_run_6.txt +snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json ``` :::::: @@ -270,8 +389,7 @@ snakemake --profile cluster_profile amdahl_run_6.txt ## Snakemake order of operations -We're only just getting started with some simple rules, but it's worth thinking about exactly what -Snakemake is doing when you run it. There are three distinct phases: +We're only just getting started with some simple rules, but it's worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases: 1. Prepares to run: 1. Reads in all the rule definitions from the Snakefile From 24199b6ee0ee21d6bfdc9f2d0f10d51ad3f781dd Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Mon, 29 Jan 2024 15:38:35 +0100 Subject: [PATCH 3/8] Missed the file extension --- episodes/04-snakemake_and_mpi.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md index c9b2f4a..944ea4f 100644 --- a/episodes/04-snakemake_and_mpi.md +++ b/episodes/04-snakemake_and_mpi.md @@ -328,7 +328,7 @@ options: ``` The option we are looking for is `--terse`, and that will make `amdahl` print output in a format that is much easier to process, JSON. JSON format in a file -typically uses the file extension so let's add that option to our shell command +typically uses the file extension `.json` so let's add that option to our shell command and change the file format of the output: ```python From c3cf390c6bce8e5cccb27390cb83d60cd4f00166 Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Mon, 29 Jan 2024 15:41:43 +0100 Subject: [PATCH 4/8] update --- episodes/04-snakemake_and_mpi.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md index 944ea4f..6fb3aad 100644 --- a/episodes/04-snakemake_and_mpi.md +++ b/episodes/04-snakemake_and_mpi.md @@ -405,7 +405,7 @@ We're only just getting started with some simple rules, but it's worth thinking 1. Checks that the command ran without errors *and* made the new output file as expected ::: callout -## Dry-run (-n) mode +## Dry-run (`-n`) mode It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to run, and print them to the screen, but never actually run them. This is done with the `-n` From e54d77904b39cecc7f8fe544948a7757006fa69b Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Mon, 29 Jan 2024 23:47:43 +0100 Subject: [PATCH 5/8] Add final episodes --- config.yaml | 3 +- episodes/01-introduction.md | 2 +- episodes/05-chaining_rules.md | 190 ++++++++++++++++++ episodes/06-expansion.md | 181 +++++++++++++++++ episodes/amdahl_foundation.md | 126 ------------ episodes/amdahl_snakemake.md | 61 ------ episodes/files/Snakefile_amdahl_cluster | 8 - episodes/files/Snakefile_cluster | 4 - episodes/files/Snakefile_cluster_iteration | 24 --- episodes/files/Snakefile_hello | 4 - episodes/files/Snakefile_iterative | 13 -- episodes/files/Snakefile_tworules | 9 - episodes/files/plot_terse_amdahl_results.py | 49 +++++ episodes/files/queuing_config.yaml | 6 - episodes/files/snakefiles/Snakefile_ep01 | 5 + episodes/files/snakefiles/Snakefile_ep02 | 13 ++ episodes/files/snakefiles/Snakefile_ep04 | 22 ++ episodes/files/snakefiles/Snakefile_ep05 | 28 +++ episodes/files/snakefiles/Snakefile_ep06 | 32 +++ .../cluster_profile_ep02/config.yaml | 6 + .../cluster_profile_ep04/config.yaml | 7 + episodes/snakemake_cluster.md | 63 ------ episodes/snakemake_multiple.md | 77 ------- episodes/snakemake_profiles.md | 67 ------ episodes/snakemake_single.md | 69 ------- 25 files changed, 536 insertions(+), 533 deletions(-) create mode 100644 episodes/05-chaining_rules.md create mode 100644 episodes/06-expansion.md delete mode 100644 episodes/amdahl_foundation.md delete mode 100644 episodes/amdahl_snakemake.md delete mode 100644 episodes/files/Snakefile_amdahl_cluster delete mode 100644 episodes/files/Snakefile_cluster delete mode 100644 episodes/files/Snakefile_cluster_iteration delete mode 100644 episodes/files/Snakefile_hello delete mode 100644 episodes/files/Snakefile_iterative delete mode 100644 episodes/files/Snakefile_tworules create mode 100644 episodes/files/plot_terse_amdahl_results.py delete mode 100644 episodes/files/queuing_config.yaml create mode 100644 episodes/files/snakefiles/Snakefile_ep01 create mode 100644 episodes/files/snakefiles/Snakefile_ep02 create mode 100644 episodes/files/snakefiles/Snakefile_ep04 create mode 100644 episodes/files/snakefiles/Snakefile_ep05 create mode 100644 episodes/files/snakefiles/Snakefile_ep06 create mode 100644 episodes/files/snakefiles/cluster_profile_ep02/config.yaml create mode 100644 episodes/files/snakefiles/cluster_profile_ep04/config.yaml delete mode 100644 episodes/snakemake_cluster.md delete mode 100644 episodes/snakemake_multiple.md delete mode 100644 episodes/snakemake_profiles.md delete mode 100644 episodes/snakemake_single.md diff --git a/config.yaml b/config.yaml index a6eef1e..13b4362 100644 --- a/config.yaml +++ b/config.yaml @@ -63,7 +63,8 @@ episodes: - 02-snakemake_on_the_cluster.md - 03-placeholders.md - 04-snakemake_and_mpi.md - +- 05-chaining_rules.md +- 06-expansion.md # Information for Learners learners: diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index 5f69b66..d4940ba 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -150,7 +150,7 @@ What does the `-p` option in the `snakemake` command above do? *Hint: you can search in the text by pressing `/`, and quit back to the shell with `q`* -:::::: Solution +:::::: solution (2) Prints the shell commands that are being run to the terminal diff --git a/episodes/05-chaining_rules.md b/episodes/05-chaining_rules.md new file mode 100644 index 0000000..878ab13 --- /dev/null +++ b/episodes/05-chaining_rules.md @@ -0,0 +1,190 @@ +--- +title: "Chaining rules" +teaching: 40 +exercises: 30 +--- + +::: questions +- "How do I combine rules into a workflow?" +- "How do I make a rule with multiple inputs and outputs?" +::: + +::: objectives +- "Use Snakemake to filter and then count the lines in a FASTQ file" +- "Add an RNA quantification step in the data analysis" +- "See how Snakemake deals with missing outputs" +::: + +## A pipeline of multiple rules + +We now have a rule that can generate output for any value of `p` and any number +tasks, we just need to call Snakemake with the parameters that we want: +```bash +snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json +``` + +That's not exactly convenient though, to generate a full dataset we have to run +Snakemake lots of times with different output file targets. Rather than that, +let's create a rule that can generate those files for us. + +Chaining rules in Snakemake is a matter of choosing filename patterns that +connect the rules. +There's something of an art to it - most times there are several options that +will work: + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: "p_{parallel_proportion}/runs/amdahl_run_6.json" + shell: + "echo {input} done > {output}" +``` + +::: challenge + +The new rule is doing no work, it's just making sure we create the file we want. +It's not worth executing on the cluster. How do ensure it runs on the login node +only? + +:::::: solution + +We need to add the new rule to our `localrules`: +```python +localrules: hostname_login, generate_run_files +``` + +::: + +::: + +Now let's run the new rule: +```bash +[ocaisa@node1 ~]$ snakemake --profile cluster_profile/ p_0.999_runs.txt +``` +```output +Using profile cluster_profile/ for setting default command line arguments. +Building DAG of jobs... +Retrieving input from storage. +Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash +Provided remote nodes: 3 +Job stats: +job count +------------------ ------- +amdahl_run 1 +generate_run_files 1 +total 2 + +Select jobs to execute... +Execute 1 jobs... + +[Tue Jan 30 17:39:29 2024] +rule amdahl_run: + output: p_0.999/runs/amdahl_run_6.json + jobid: 1 + reason: Missing output files: p_0.999/runs/amdahl_run_6.json + wildcards: parallel_proportion=0.999, parallel_tasks=6 + resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, mem_mb_per_cpu=3600, runtime=2, mpi=mpiexec, tasks=6 + +mpiexec -n 6 amdahl --terse -p 0.999 > p_0.999/runs/amdahl_run_6.json +No SLURM account given, trying to guess. +Guessed SLURM account: def-users +Job 1 has been submitted with SLURM jobid 342 (log: /home/ocaisa/.snakemake/slurm_logs/rule_amdahl_run/342.log). +[Tue Jan 30 17:47:31 2024] +Finished job 1. +1 of 2 steps (50%) done +Select jobs to execute... +Execute 1 jobs... + +[Tue Jan 30 17:47:31 2024] +localrule generate_run_files: + input: p_0.999/runs/amdahl_run_6.json + output: p_0.999_runs.txt + jobid: 0 + reason: Missing output files: p_0.999_runs.txt; Input files updated by another job: p_0.999/runs/amdahl_run_6.json + wildcards: parallel_proportion=0.999 + resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, mem_mb_per_cpu=3600, runtime=2 + +echo p_0.999/runs/amdahl_run_6.json done > p_0.999_runs.txt +[Tue Jan 30 17:47:31 2024] +Finished job 0. +2 of 2 steps (100%) done +Complete log: .snakemake/log/2024-01-30T173929.781106.snakemake.log +``` + +Look at the logging messages that Snakemake prints in the terminal. What has happened here? + +1. Snakemake looks for a rule to make `p_0.999_runs.txt` +1. It determines that "generate_run_files" can make this if + `parallel_proportion=0.999` +1. It sees that the input needed is therefore `p_0.999/runs/amdahl_run_6.json` +

+1. Snakemake looks for a rule to make `p_0.999/runs/amdahl_run_6.json` +1. It determines that "amdahl_run" can make this if `parallel_proportion=0.999` + and `parallel_tasks=6` +

+1. Now Snakemake has reached an available input file (in this case, no input + file is actually required), it runs both steps to get the final output + +This, in a nutshell, is how we build workflows in Snakemake. + +1. Define rules for all the processing steps +1. Choose `input` and `output` naming patterns that allow Snakemake to link the rules +1. Tell Snakemake to generate the final output file(s) + +If you are used to writing regular scripts this takes a little +getting used to. Rather than listing steps in order of execution, you are alway +**working backwards** from the final desired result. The order of operations is +determined by applying the pattern matching rules to the filenames, not by the +order of the rules in the Snakefile. + +::: callout + +## Outputs first? + +The Snakemake approach of working backwards from the desired output to determine +the workflow is why we're putting the `output` lines first in all our rules - to +remind us that these are what Snakemake looks at first! + +Many users of Snakemake, and indeed the official documentation, prefer to have +the `input` first, so in practice you should use whatever order makes sense to +you. + +::: + +::: callout + +## `log` outputs in Snakemake + +Snakemake has a dedicated rule field for outputs that are +[log files](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files), +and these are mostly treated as regular outputs except that log files are not removed if the job +produces an error. This means you can look at the log to help diagnose the error. In a real +workflow this can be very useful, but in terms of learning the fundementals of Snakemake we'll +stick with regular `input` and `output` fields here. + +::: + + + +::: callout + +## Errors are normal + +Don't be disheartened if you see errors like the one above when first testing your new Snakemake +pipelines. There is a lot that can go wrong when writing a new workflow, and you'll normally need +several iterations to get things just right. One advantage of the Snakemake approach compared to +regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on +and potentially running junk calculations on partial or corrupted data. Another advantage is that +when a step fails we can safely resume from where we left off, as we'll see in the next episode. + +::: + + + +::: keypoints +- "Snakemake links rules by iteratively looking for rules that make missing inputs" +- "Rules may have multiple named inputs and/or outputs" +- "If a shell command does not yield an expected output then Snakemake will regard that as a + failure" +::: + diff --git a/episodes/06-expansion.md b/episodes/06-expansion.md new file mode 100644 index 0000000..80eaa54 --- /dev/null +++ b/episodes/06-expansion.md @@ -0,0 +1,181 @@ +--- +title: "Processing lists of inputs" +teaching: 50 +exercises: 30 +--- + +::: questions +- "How do I process multiple files at once?" +- "How do I combine multiple files together?" +::: + +::: objectives +- "Use Snakemake to process all our samples at once" +- "Make a scalability plot that brings our results together" +::: + +We created a rule that can generate a single output file, but we're not going to +create multiple rules for every output file. We want to generate all of the run +files with a single rule if we could, well Snakemake can indeed take a list of +input files: + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: "p_{parallel_proportion}/runs/amdahl_run_2.json", "p_{parallel_proportion}/runs/amdahl_run_6.json" + shell: + "echo {input} done > {output}" +``` + +That's great, but we don't want to have to list all of the files we're +interested in individually. How can we do this? + +## Defining a list of samples to process + +To do this, we can define some lists as Snakemake **global variables**. + +Global variables should be added before the rules in the Snakefile. + +```python +# Task sizes we wish to run +NTASK_SIZES = [1, 2, 3, 4, 5] +``` + +* Unlike with variables in shell scripts, we can put spaces around the `=` sign, but they are + not mandatory. +* The lists of quoted strings are enclosed in square brackets and comma-separated. If you know any + Python you'll recognise this as Python list syntax. +* A good convention is to use capitalized names for these variables, but this is not mandatory. +* Although these are referred to as variables, you can't actually change the values once the + workflow is running, so lists defined this way are more like constants. + +## Using a Snakemake rule to define a batch of outputs + +Now let's update our Snakefile to leverage the new global variable to create a +list of files: +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + shell: + "echo {input} done > {output}" +``` + +The `expand(...)` function in this rule generates a list of filenames, by taking +the first thing in the single parentheses as a template and replacing `{count}` +with all the `NTASK_SIZES`. Since there are 5 elements in the list, this will +yield 5 files we want to make. Note that we had to protect our wildcard in a +second set of parentheses so it wouldn't be interpreted as something that needed +to be expanded. + +In our current case we still rely on the file name to define the value of the +wildcard `parallel_proportion` so we can't call the rule directly, we still need +to request a specific file: + +```bash +snakemake --profile cluster_profile/ p_0.999_runs.txt +``` + +If you don't specify a target rule name or any file names on the command line when running +Snakemake, the default is to use **the first rule** in the Snakefile as the target. + +::: callout +## Rules as targets + +Giving the name of a rule to Snakemake on the command line only works when that rule has +*no wildcards* in the outputs, because Snakemake has no way to know what the desired wildcards +might be. You will see the error "Target rules may not contain wildcards." This can also happen +when you don't supply any explicit targets on the command line at all, and Snakemake tries to run +the first rule defined in the Snakefile. + +::: + +## Rules that combine multiple inputs + +Our *`generate_run_files`* rule is a rule which takes a list of input files. The +length of that list is not fixed by the rule, but can change based on +`NTASK_SIZES`. + +In our workflow the final step is to take all the generated files and combine +them into a plot. To do that, you may have heard that some people use a python +library called `matplotlib`. It's beyond the scope of this tutorial to write +the python script to create a final plot, so we provide you with the script as +part of this lesson (at ). You can download it with +```bash +curl -O +``` + +The script `plot_terse_amdahl_results.py` needs a command line that looks like: +```bash +python plot_terse_amdahl_results.py <1st input file> <2nd input file> ... +``` +Let's introduce that into our `generate_run_files` rule: + + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + shell: + "python plot_terse_amdahl_results.py {output} {input}" +``` + +::: challenge + +This script relies on `matplotlib`, is it available as an environment module? +Add this requirement to our rule. + +:::::: solution + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_scalability.jpg" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + envmodules: + "matplotlib" + shell: + "python plot_terse_amdahl_results.py {output} {input}" +``` + +:::::: + +::: + +Now we finally get to generate a scaling plot! Run the final Snakemake command +```bash +snakemake --profile cluster_profile/ p_0.999_scalability.jpg +``` + +::: challenge + +Generate the scalability plot for all values from 1 to 10 cores. + +:::::: solution + +```python +NTASK_SIZES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +``` + +:::::: + +::: + +::: challenge + +Rerun the workflow for a `p` value of 0.8 + +:::::: solution + +```bash +snakemake --profile cluster_profile/ p_0.8_scalability.jpg +``` + +:::::: + +::: + +::: keypoints +- "Use the `expand()` function to generate lists of filenames you want to combine" +- "Any `{input}` to a rule can be a variable-length list" +::: + diff --git a/episodes/amdahl_foundation.md b/episodes/amdahl_foundation.md deleted file mode 100644 index fb0a532..0000000 --- a/episodes/amdahl_foundation.md +++ /dev/null @@ -1,126 +0,0 @@ ---- -title: "Running a Parallel Application on the Cluster" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- What output does the Amdahl code generate? -- Why does parallelizing the amdahl code make it faster? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Run the amdahl parallel code on the cluster -- Note what output is generated, and where it goes -- Predict the trend of execution time vs parallelism - -:::::::::::::::::::::::::::::::::::::::: - -## Introduction - -A high-performance computing cluster offers powerful -computational resources to its users, but taking advantage -of these resources is not always straightforward. The -cluster system does not work in the same way as systems -you may be more familiar with. - -The software we will use in this lesson is a model of -the kind of parallel task that is well-adapted to -high-performance computing resources. It's called "amdahl", -named for Eugene Amdahl, a famous computer scientist who -coined "Amdahl's Law", which is about the advantages and -limitations of parallelism in code execution. - -:::::::::::::::::::::::::::::::: callout - -[Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law) is -a statement about how much benefit you can expect to get by -parallelizing a computer program. - -The limitation arises from the fact that, in any application, -there is some fraction of the work to be done which is inherently -serial, and some fraction which is amenable to parallelization. -The law is a quantitative expression of the fact that, by -parallelizing the code, you can only ever make the parallel -part faster, you cannot reduce the execution time of the -serial part. - -As a practical matter, this means that developer effort spent -on parallelization has diminishing returns on the overall -reduction in execution time. - -:::::::::::::::::::::::::::::::::::::::: - -## The Amdahl Code - -Download it and install it, via pip. -Note that `amdahl` depends on MPI, -so make sure that's also available. - -On the HPC Carpentry cluster: - -``` shell -[user@login1 ~]$ module load OpenMPI -[user@login1 ~]$ module load Python -[user@login1 ~]$ pip install amdahl -``` - -## Running It on the Cluster - -Use the `sacct` command to see the run-time. -The run-time is also recorded in the output itself. - -``` shell -[user@login1 ~]$ nano amdahl_1.sh -``` - -``` bash -#!/bin/bash -#SBATCH -t 00:01 # max 1 minute -#SBATCH -p smnodes # max 4 cores -#SBATCH -n 1 # use 1 core -#SBATCH -o amdahl-np1.out # record result - -module load OpenMPI -module load Python - -mpirun amdahl -``` - -``` shell -[user@login1 ~]$ sbatch amdahl_1.sh -``` - -:::::::::::::::::::::::::::::: challenge - -Run the amdhal code with a few (small!) levels -of parallelism. Make a quantitative estimate of -how much faster the code will run with 3 processors -than 2. The naive estimate would be that it would run -1.5× the speed, or equivalently, that it would -complete in 2/3 the time. - -:::::::::::::::: solution - -``` shell -[user@login1 ~]$ sbatch amdahl_1.sh # serial job ~ 25 sec -[user@login1 ~]$ sbatch amdahl_2.sh # 2-way parallel ~ 20 sec -[user@login1 ~]$ sbatch amdahl_3.sh # 3-way parallel ~ 16 sec -``` - -The amdahl code runs faster with 3 processors than with -2, but the speed-up is less than 1.5×. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- The amdahl code is a model of a parallel application -- The execution speed depends on the degree of parallelism - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/amdahl_snakemake.md b/episodes/amdahl_snakemake.md deleted file mode 100644 index 4686339..0000000 --- a/episodes/amdahl_snakemake.md +++ /dev/null @@ -1,61 +0,0 @@ ---- -title: "Amdahl Parallel Runs" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- How can we collect data on Amdahl run times? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Collect systematic data on the runtime of the amdahl code - -:::::::::::::::::::::::::::::::::::::::: - -## Systematic Data Collection - -Using what we have learned so far, including Snakemake -profiles and rules, we will now compose a Snakefile -that runs the Amdahl example code over a range of -parallel widths. This workflow will generate the -data we will use in the next module to demonstrate -the diminishing returns of increasing parallelism. - -## Write a File - -Compose the Snakemake file that does what we want. - -We can put the widths in a list and iterate over -them. We will use the profile generated previously -to ensure that the jobs run on the cluster. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Our example has a single paramter, the parallelism, -that we vary. How would you generalize this to arbitrary -parameters? - -:::::::::::::::: solution - -Arbitrary parameters are still finite, so you could -just generate a flat list of all the combinations, and iterate -over that. Or you could generate two lists and do a nested -loop. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- A relatively compact snakemake file collects interesting data. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/files/Snakefile_amdahl_cluster b/episodes/files/Snakefile_amdahl_cluster deleted file mode 100644 index eca4d3e..0000000 --- a/episodes/files/Snakefile_amdahl_cluster +++ /dev/null @@ -1,8 +0,0 @@ -rule one: - input: - output: 'amdahl_cluster.txt' - resources: - mpi="mpirun", - tasks=3 - shell: - "module load OpenMPI; mpirun -np {resources.tasks} amdahl > amdahl_cluster.txt" diff --git a/episodes/files/Snakefile_cluster b/episodes/files/Snakefile_cluster deleted file mode 100644 index ac60d86..0000000 --- a/episodes/files/Snakefile_cluster +++ /dev/null @@ -1,4 +0,0 @@ -rule: - input: - output: 'host.txt' - shell: 'hostname > host.txt' diff --git a/episodes/files/Snakefile_cluster_iteration b/episodes/files/Snakefile_cluster_iteration deleted file mode 100644 index 41a94a2..0000000 --- a/episodes/files/Snakefile_cluster_iteration +++ /dev/null @@ -1,24 +0,0 @@ -# -# Run a bunch of Amdahl jobs and aggregate the output. -# -WIDTHS=[1,2] -# -def getwidth(wildcards): - return wildcards.sample - -rule plot: - input: expand('{size}.out',size=WIDTHS) - output: 'done.out' - resources: - mpi="mpirun", - tasks=1 - shell: 'echo "{WIDTHS}, Done!" > done.out' -rule iterate: - input: - output: '{sample}.out' - resources: - mpi="mpirun", - tasks=getwidth - shell: - "module load OpenMPI; mpirun -np {resources.tasks} amdahl > {wildcards.sample}.out" - diff --git a/episodes/files/Snakefile_hello b/episodes/files/Snakefile_hello deleted file mode 100644 index 0b94a00..0000000 --- a/episodes/files/Snakefile_hello +++ /dev/null @@ -1,4 +0,0 @@ -rule: - input: - output: 'hello.txt' - shell: 'echo "Hello there, world!" >> hello.txt' diff --git a/episodes/files/Snakefile_iterative b/episodes/files/Snakefile_iterative deleted file mode 100644 index 8fe13f8..0000000 --- a/episodes/files/Snakefile_iterative +++ /dev/null @@ -1,13 +0,0 @@ -# -# Iterative example. -# -NAMES=['one','two','three'] -# -rule done: - input: expand('{name}.out',name=NAMES) - output: 'done.out' - shell: 'echo "Done!" > done.out' -rule iterate: - input: - output: '{sample}.out' - shell: 'echo {output} > {output}' diff --git a/episodes/files/Snakefile_tworules b/episodes/files/Snakefile_tworules deleted file mode 100644 index 66558a6..0000000 --- a/episodes/files/Snakefile_tworules +++ /dev/null @@ -1,9 +0,0 @@ -rule last: - input: 'lower.txt' - output: 'upper.txt' - shell: 'cat lower.txt | tr a-z A-Z > upper.txt' - -rule first: - input: - output: 'lower.txt' - shell: 'echo "Hello, world!" > lower.txt' diff --git a/episodes/files/plot_terse_amdahl_results.py b/episodes/files/plot_terse_amdahl_results.py new file mode 100644 index 0000000..a85425f --- /dev/null +++ b/episodes/files/plot_terse_amdahl_results.py @@ -0,0 +1,49 @@ +import sys +import json +import matplotlib.pyplot as plt +import numpy as np + +def process_files(file_list, output="plot.jpg"): + value_tuples=[] + for filename in file_list: + # Open the JSON file and load data + with open(filename, 'r') as file: + data = json.load(file) + value_tuples.append((data['nproc'], data['execution_time'])) + + # Sort the tuples + sorted_list = sorted(value_tuples) + + # Unzip the sorted list into two lists + x, y = zip(*sorted_list) + + # Create a line plot + plt.plot(x, y, marker='o') + + # Adding the y=1/x line + x_line = np.linspace(1, max(x), 100) # Create x values for the line + y_line = (y[0]/x[0]) / x_line # Calculate corresponding (scaled) y values + + plt.plot(x_line, y_line, linestyle='--', color='red', label='Perfect scaling') + + # Adding title and labels + plt.title("Scaling plot") + plt.xlabel("Number of cores") + plt.ylabel("Wallclock time (seconds)") + + # Show the legend + plt.legend() + + # Save the plot to a JPEG file + plt.savefig(output, format='jpeg') + +if __name__ == "__main__": + # The first command-line argument is the script name itself, so we skip it + output = sys.argv[1] + filenames = sys.argv[2:] + + if filenames: + process_files(filenames, output=output) + else: + print("No files provided.") + diff --git a/episodes/files/queuing_config.yaml b/episodes/files/queuing_config.yaml deleted file mode 100644 index 7db5043..0000000 --- a/episodes/files/queuing_config.yaml +++ /dev/null @@ -1,6 +0,0 @@ -# snakemake -j 3 --cluster "sbatch -N 1 -n {resources.tasks} -p node" -cluster: - sbatch - --partition=node - --nodes=1 - --tasks={resources.tasks} diff --git a/episodes/files/snakefiles/Snakefile_ep01 b/episodes/files/snakefiles/Snakefile_ep01 new file mode 100644 index 0000000..32de8e2 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep01 @@ -0,0 +1,5 @@ +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" diff --git a/episodes/files/snakefiles/Snakefile_ep02 b/episodes/files/snakefiles/Snakefile_ep02 new file mode 100644 index 0000000..6957cfb --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep02 @@ -0,0 +1,13 @@ +localrules: hostname_login + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > hostname_remote.txt" diff --git a/episodes/files/snakefiles/Snakefile_ep04 b/episodes/files/snakefiles/Snakefile_ep04 new file mode 100644 index 0000000..b8c9897 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep04 @@ -0,0 +1,22 @@ +localrules: hostname_login + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" diff --git a/episodes/files/snakefiles/Snakefile_ep05 b/episodes/files/snakefiles/Snakefile_ep05 new file mode 100644 index 0000000..93ec684 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep05 @@ -0,0 +1,28 @@ +localrules: hostname_login, generate_run_files + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: "p_{parallel_proportion}/runs/amdahl_run_6.json" + shell: + "echo {input} done > {output}" + +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" diff --git a/episodes/files/snakefiles/Snakefile_ep06 b/episodes/files/snakefiles/Snakefile_ep06 new file mode 100644 index 0000000..73c17e4 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep06 @@ -0,0 +1,32 @@ +NTASK_SIZES = [1, 2, 3, 4, 5] + +localrules: hostname_login, generate_run_files + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule generate_run_files: + output: "p_{parallel_proportion}_scalability.jpg" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + envmodules: + "matplotlib" + shell: + "python plot_terse_amdahl_results.py {output} {input}" + +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" diff --git a/episodes/files/snakefiles/cluster_profile_ep02/config.yaml b/episodes/files/snakefiles/cluster_profile_ep02/config.yaml new file mode 100644 index 0000000..60685b5 --- /dev/null +++ b/episodes/files/snakefiles/cluster_profile_ep02/config.yaml @@ -0,0 +1,6 @@ +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 diff --git a/episodes/files/snakefiles/cluster_profile_ep04/config.yaml b/episodes/files/snakefiles/cluster_profile_ep04/config.yaml new file mode 100644 index 0000000..2fbcb60 --- /dev/null +++ b/episodes/files/snakefiles/cluster_profile_ep04/config.yaml @@ -0,0 +1,7 @@ +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 +use-envmodules: True diff --git a/episodes/snakemake_cluster.md b/episodes/snakemake_cluster.md deleted file mode 100644 index c157a55..0000000 --- a/episodes/snakemake_cluster.md +++ /dev/null @@ -1,63 +0,0 @@ ---- -title: "Snakemake and the Cluster" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- How can we express a one-task cluster operation in Snakemake? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a Snakefile that executes a job on the cluster -- Use MPI options to ensure the job runs in parallel - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake and the Cluster - -Snakemake has provisions for operating on an HPC cluster. - -Various command-line arguments can be provided to tell -Snakemake not to run things locally, but do run things -via the queuing system instead. - -In this lesson, we will repeat the first module, running -the admahl code on the cluster, but will use snakemake -to make it happen. - -## Write a cluster Snakemake rule file - -Open your favorite editor, do the thing. -Specify resources. Provide command line arguments -to do the cluster operations by hand. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -How can you control the degree of parallelism -of your cluster task? - -:::::::::::::::: solution - -Use the "mpi" option in the resource block of -the Snakemake rule, and specify the number of tasks. -This will be mapped to the `-n` argument of the -equivalent `sbatch` command. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake rule files can submit cluster jobs. -- There are a lot of options. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/snakemake_multiple.md b/episodes/snakemake_multiple.md deleted file mode 100644 index 9967018..0000000 --- a/episodes/snakemake_multiple.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -title: "More Complicated Snakefiles" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- What is a task graph? -- How does the Snakemake file express a task graph? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a multiple-rule Snakefile with dependent rules -- Translate between a task graph and rule set - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake and Workflow - -A Snakefile can contain multiple rules. In the trivial -case, there will be no dependencies between the rules, and -they can all run concurrently. - -A more interesting case is when there are dependencies between -the rules, e.g. when one rule takes the output of another rule -as its input. In this case, the dependent rule (the one that needs -another rule's output) cannot run until the rule it depends on -has completed. - -It's possible to express this relationship by means of -a task graph, whose nodes are tasks, and whose arcs are -input-output relationships between the tasks. - -A Snakemake file is textual description of a task -graph. - -## Write a multi-rule Snakemake rule file - -Open your favorite editor, do the thing. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Draw the task graph for your Snakefile. - -Given an example task graph, write a Snakefile that -implements it. - -:::::::::::::::: solution - -The rules in the snakefile are nodes in the task -graph. Two rules are connected by an arc in the task -graph if the output of one rule is the input to the -other. The task graph is directed, so the arc points -from the rule that generates a file as output to the rule -that consumes the same file as input. - -A rule with an output that no other rules consumes is -a terminal rule. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake rule files can be mapped to task graphs -- Tasks are executed as required in dependency order -- Where possible, tasks may run concurrently. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/snakemake_profiles.md b/episodes/snakemake_profiles.md deleted file mode 100644 index 27c6702..0000000 --- a/episodes/snakemake_profiles.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -title: "Snakemake Profiles" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- How can we encapsulate our desired snakemake configuration? -- How do we balance non-reptition and customizability? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a Snakemake profile for the cluster -- Run the amdahl code with varying degrees of parallelism - with the cluster profile. - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake Profiles - -Snakemake has a provision for profiles, which allow users -to collect various common settings together in a special -file that snakemake examines when it runs. This lets users -avoid repetition and possible errors of omission for common -settings, and encapsulates some of the cluster complexity -we encoutered in the previous module. - -Not all settings should be in the profile. Users can -choose which ones to make static and which ones to make -adjustable. In our case, we will want to have the freedom -to choose the degree of parallelism, but most of the -cluster arguments will not change, and so can be static -in the profile. - -## Write a Profile - -Do the thing. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Write a profile that allows you to choose a -different partition, in addition to the level of -parallelism. - -:::::::::::::::: solution - -The profile files can have variables taken from -the rule file, and in particular can refer to -resources from a rule. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake profiles encapsulate cluster complexity. -- Retaining operational flexibliity is also important. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/snakemake_single.md b/episodes/snakemake_single.md deleted file mode 100644 index f9a47e4..0000000 --- a/episodes/snakemake_single.md +++ /dev/null @@ -1,69 +0,0 @@ ---- -title: "Introduction to Snakemake" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- What are Snakemake rules? -- Why do Snakemake rules not always run? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a single-rule Snakefile and execute it with Snakemake -- Predict whether the rule will run or not - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake - -Snakemake is a workflow tool. It takes as input -a description of the work that you would like the computer -to do, and when run, does the work that you have -asked for. - -The description of the work takes the form of a -series of rules, written in a special format into a -Snakefile. Rules have outputs, and the Snakefile -and generated output files make up the system state. - -## Write a Snakemake rule file - -Open your favorite editor, do the thing. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Remove the output file, and run Snakemake. Then -run it again. Edit the output file, and run it -a third time. For which of these invocations -does Snakemake do non-trivial work? - -:::::::::::::::: solution - -The rule does not get executed the second time. The -Snakemake infrastructure is stateful, and knows that -the required outputs are up to date. - -The rule also does not get executed the third time. -The output is not the output from the rule, but the -Snakemake infrastructure doesn't know that, it only -checks the file time-stamp. Editing Snakemake-manipulated -files can get you into an inconsistent state. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake is an indirect way of running executables -- Snakemake has a notion of system state, and can be fooled. - -:::::::::::::::::::::::::::::::::::::::: From 8e0184d192bfd09b23d6751d2936fd671f646eb2 Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Mon, 29 Jan 2024 23:59:38 +0100 Subject: [PATCH 6/8] Add download link for plotting file --- episodes/06-expansion.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/06-expansion.md b/episodes/06-expansion.md index 80eaa54..6c2b49c 100644 --- a/episodes/06-expansion.md +++ b/episodes/06-expansion.md @@ -100,9 +100,9 @@ In our workflow the final step is to take all the generated files and combine them into a plot. To do that, you may have heard that some people use a python library called `matplotlib`. It's beyond the scope of this tutorial to write the python script to create a final plot, so we provide you with the script as -part of this lesson (at ). You can download it with +part of this lesson. You can download it with ```bash -curl -O +curl -O https://ocaisa.github.io/hpc-workflows/files/plot_terse_amdahl_results.py ``` The script `plot_terse_amdahl_results.py` needs a command line that looks like: From 4d7d7cb15a0e0d5ae202db5c64cf0983e7efd2b2 Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Tue, 30 Jan 2024 17:22:52 +0100 Subject: [PATCH 7/8] Tweak all episodes --- episodes/01-introduction.md | 49 +++++---- episodes/02-snakemake_on_the_cluster.md | 41 ++++---- episodes/03-placeholders.md | 5 +- episodes/04-snakemake_and_mpi.md | 133 +++++++++++++----------- episodes/05-chaining_rules.md | 45 ++++---- episodes/06-expansion.md | 43 +++++--- 6 files changed, 173 insertions(+), 143 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index d4940ba..159154e 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -67,33 +67,37 @@ rule hostname_login: ## Key points about this file 1. The file is named `Snakefile` - with a capital `S` and no file extension. -1. Some lines are indented. Indents must be with space characters, not tabs. See the - setup section for how to make your text editor do this. -1. The rule definition starts with the keyword `rule` followed by the rule name, then a colon. -1. We named the rule `hostname`. You may use letters, numbers or underscores, but the rule name - must begin with a letter and may not be a keyword. +1. Some lines are indented. Indents must be with space characters, not tabs. See + the setup section for how to make your text editor do this. +1. The rule definition starts with the keyword `rule` followed by the rule name, + then a colon. +1. We named the rule `hostname_login`. You may use letters, numbers or + underscores, but the rule name must begin with a letter and may not be a + keyword. 1. The keywords `input`, `output`, `shell` are all followed by a colon. 1. The file names and the shell command are all in `"quotes"`. -1. The output filename is given before the input filename. In fact, Snakemake doesn't care what - order they appear in but we give the output first throughout this course. We'll see why soon. -1. In this use case there is no input file for the command so we leave this blank. +1. The output filename is given before the input filename. In fact, Snakemake + doesn't care what order they appear in but we give the output first + throughout this course. We'll see why soon. +1. In this use case there is no input file for the command so we leave this + blank. ::: -Back in the shell we'll run our new rule. At this point, if there were any missing quotes, bad -indents, etc. we may see an error. +Back in the shell we'll run our new rule. At this point, if there were any +missing quotes, bad indents, etc. we may see an error. ```bash -$ snakemake -j1 -p hostname_login.txt +$ snakemake -j1 -p hostname_login ``` ::: callout ## `bash: snakemake: command not found...` -If your shell tells you that it cannot find the command `snakemake` then we need to make the -software available somehow. In our case, this means searching for the module that we need to -load: +If your shell tells you that it cannot find the command `snakemake` then we need +to make the software available somehow. In our case, this means searching for +the module that we need to load: ```bash module spider snakemake ``` @@ -148,17 +152,16 @@ What does the `-p` option in the `snakemake` command above do? 1. Tells Snakemake to only run one process at a time 1. Prompts the user for the correct input file -*Hint: you can search in the text by pressing `/`, and quit back to the shell with `q`* +*Hint: you can search in the text by pressing `/`, and quit back to the shell +with `q`* :::::: solution - (2) Prints the shell commands that are being run to the terminal -This is such a useful thing we don't know why it isn't the default! The `-j1` option is what -tells Snakemake to only run one process at a time, and we'll stick with this for now as it -makes things simpler. The `-F` option tells Snakemake to always overwrite output files, and -we'll learn about protected outputs much later in the course. Answer 4 is a total red-herring, -as Snakemake never prompts interactively for user input. +This is such a useful thing we don't know why it isn't the default! The `-j1` +option is what tells Snakemake to only run one process at a time, and we'll +stick with this for now as it makes things simpler. Answer 4 is a total +red-herring, as Snakemake never prompts interactively for user input. :::::: ::: @@ -167,7 +170,7 @@ as Snakemake never prompts interactively for user input. - "Before running Snakemake you need to write a Snakefile" - "A Snakefile is a text file which defines a list of rules" - "Rules have inputs, outputs, and shell commands to be run" -- "You tell Snakemake what file to make and it will run the shell command defined in the - appropriate rule" +- "You tell Snakemake what file to make and it will run the shell command + defined in the appropriate rule" ::: diff --git a/episodes/02-snakemake_on_the_cluster.md b/episodes/02-snakemake_on_the_cluster.md index c7dd3ae..30eba35 100644 --- a/episodes/02-snakemake_on_the_cluster.md +++ b/episodes/02-snakemake_on_the_cluster.md @@ -35,12 +35,11 @@ Nothing to be done (all requested files are present and up to date). ``` Nothing happened! Why not? When it is asked to build a target, Snakemake checks -the 'last modification -time' of both the target and its dependencies. If any dependency has been -updated since the target, then the actions are re-run to update the target. -Using this approach, Snakemake knows to only rebuild the files that, either -directly or indirectly, depend on the file that changed. This is called an -_incremental build_. +the 'last modification time' of both the target and its dependencies. If any +dependency has been updated since the target, then the actions are re-run to +update the target. Using this approach, Snakemake knows to only rebuild the +files that, either directly or indirectly, depend on the file that changed. This +is called an _incremental build_. ::: callout ## Incremental Builds Improve Efficiency @@ -53,12 +52,11 @@ more efficient. ::: challenge ## Running on the cluster -We need another rule now that executes the `hostname` on the cluster. Create the -rule in your Snakefile and try to execute it on cluster with the options -`--executor slurm` to `snakemake` +We need another rule now that executes the `hostname` on the _cluster_. Create +a new rule in your Snakefile and try to execute it on cluster with the option +`--executor slurm` to `snakemake`. :::::: solution - The rule is almost identical to the previous rule save for the rule name and output file: @@ -109,14 +107,13 @@ Complete log: .snakemake/log/2024-01-29T180346.788174.snakemake.log Note all the warnings that Snakemake is giving us about the fact that the rule may not be able to execute on our cluster as we may not have given enough information. Luckily for us, this actually works on our cluster and we can take -a look in the output file we asked for, `hostname_remote.txt`: +a look in the output file the new rule creates, `hostname_remote.txt`: ```bash [ocaisa@node1 ~]$ cat hostname_remote.txt ``` ```output tmpnode1.int.jetstream2.hpc-carpentry.org ``` - :::::: ::: @@ -167,8 +164,10 @@ the help of a translation table: | `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) | | `--nodes` | `nodes` | number of nodes | -The warnings given by Snakemake hinted that we need to provide these options. -One way to do it is to provide them is as part of the Snakemake rule, e.g., +The warnings given by Snakemake hinted that we may need to provide these +options. One way to do it is to provide them is as part of the Snakemake rule +using the keyword `resources`, +e.g., ```python rule: input: ... @@ -178,8 +177,9 @@ rule: runtime: ``` and we can also use the profile to define default values for these options to -use with our project. For example, the available memory on our cluster is about -4GB per core, so we can add that to our profile: +use with our project, using the keyword `default-resources`. For example, the +available memory on our cluster is about 4GB per core, so we can add that to our +profile: ```yaml printshellcmds: True jobs: 3 @@ -189,7 +189,7 @@ default-resources: ``` :::challenge -We know that our problem runs in a very short time. Make the default length of +We know that our problem runs in a very short time. Change the default length of our jobs to two minutes for Slurm. ::::::solution @@ -227,10 +227,9 @@ Slurm executor (which is what we are doing via our new profile) this won't happen any more. So how do we force the rule to run on the login node? -Well, it's no surprise that some Snakemake rules perform trivial tasks where job -submission might be -overkill (e.g., less than 1 minute worth of compute time). Similar to our case, -it would be a better +Well, in the case where a Snakemake rule performs a trivial task job submission +might be overkill (e.g., less than 1 minute worth of compute time). Similar to +our case, it would be a better idea to have these rules execute locally (i.e. where the `snakemake` command is run) instead of as a job. Snakemake lets you indicate which rules should always run locally with the `localrules` keyword. Let's define `hostname_login` as a diff --git a/episodes/03-placeholders.md b/episodes/03-placeholders.md index 9dde975..8e93283 100644 --- a/episodes/03-placeholders.md +++ b/episodes/03-placeholders.md @@ -6,11 +6,9 @@ exercises: 30 ::: questions - "How do I make a generic rule?" -- "How does Snakemake decide what rule to run?" ::: ::: objectives -- "Understand the basic steps Snakemake goes through when running a workflow" - "See how Snakemake deals with some errors" ::: @@ -71,6 +69,9 @@ replace them with appropriate values - `{input}` with the full name of the input file, and `{output}` with the full name of the output file -- before running the command. +`{resources}` is also a placeholder, and we can access a named element of the +`{resources}` with the notation `{resources.runtime}` (for example). + :::keypoints - "Snakemake rules are made more generic with placeholders" - "Placeholders in the shell part of the rule are replaced with values based on the chosen diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md index 6fb3aad..0e7b41a 100644 --- a/episodes/04-snakemake_and_mpi.md +++ b/episodes/04-snakemake_and_mpi.md @@ -23,9 +23,9 @@ environment module. ::: challenge -Locate and load the `amdahl` module and then replace our `hostname_remote` rule -with a version that runs `amdahl`. (Don't worry about parallel MPI just yet, run -it with a single CPU, `mpiexec -n 1 amdahl`). +Locate and load the `amdahl` module and then _replace_ our `hostname_remote` +rule with a version that runs `amdahl`. (Don't worry about parallel MPI just +yet, run it with a single CPU, `mpiexec -n 1 amdahl`). Does your rule execute correctly? If not look through the log files to find out why? @@ -43,7 +43,7 @@ rule amdahl_run: output: "amdahl_run.txt" input: shell: - "mpiexec -n 1 amdahl > amdahl_run.txt" + "mpiexec -n 1 amdahl > {output}" ``` However, when we try to execute the rule we get an error (unless you already have a different version of `amdahl` available in your path). Snakemake @@ -68,7 +68,7 @@ Executable: amdahl ``` So, even though we loaded the module before running the workflow, our Snakemake rule didn't find the executable. That's because the Snakemake rule -is running in a clean runtime environment, and we need to somehow tell it to +is running in a clean _runtime environment_, and we need to somehow tell it to load the necessary environment module before trying to execute the rule. :::::: @@ -97,7 +97,7 @@ Adding these lines are not enough to make the rule execute however. Not only do you have to tell Snakemake what modules to load, but you also have to tell it to use environment modules in general (since the use of environment modules is considered to make your runtime environment less reproducible as the available -modules may differ from cluster to cluster). This require you to give Snakemake +modules may differ from cluster to cluster). This requires you to give Snakemake an additonal option ```bash snakemake --profile cluster_profile --use-envmodules amdahl_run @@ -167,7 +167,7 @@ file for every run. It would be great if we can somehow indicate in the `output` the value that we want to use for `tasks`...and have Snakemake pick that up. We could use a _wildcard_ in the `output` to allow us to -define `tasks` we wish to use. The syntax for such a wildcard looks like +define the `tasks` we wish to use. The syntax for such a wildcard looks like ```python output: "amdahl_run_{parallel_tasks}.txt" ``` @@ -187,15 +187,49 @@ input and output is what tells Snakemake how to match input files to output files. If two rules use a wildcard with the same name then Snakemake will treat them as -different entities -- rules in Snakemake are self-contained in this way. +different entities - rules in Snakemake are self-contained in this way. In the `shell` line you can reference the wildcard with `{wildcards.parallel_tasks}` ::: -We could use a wildcard in the `output` to allow us to -define `tasks` we wish to use. This could look like +## Snakemake order of operations + +We're only just getting started with some simple rules, but it's worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases: + +1. Prepares to run: + 1. Reads in all the rule definitions from the Snakefile +1. Plans what to do: + 1. Sees what file(s) you are asking it to make + 1. Looks for a matching rule by looking at the `output`s of all the rules it knows + 1. Fills in the wildcards to work out the `input` for this rule + 1. Checks that this input file (if required) is actually available +1. Runs the steps: + 1. Creates the directory for the output file, if needed + 1. Removes the old output file if it is already there + 1. Only then, runs the shell command with the placeholders replaced + 1. Checks that the command ran without errors *and* made the new output file as expected + +::: callout +## Dry-run (`-n`) mode + +It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to +run, and print them to the screen, but never actually run them. This is done with the `-n` +flag, eg: + +```bash +> $ snakemake -n ... +``` +::: + +The amount of checking may seem pedantic right now, but as the workflow gains more steps this will +become very useful to us indeed. + +## Using wildcards in our rule + +We would like to use a wildcard in the `output` to allow us to +define the number of `tasks` we wish to use. Based on what we've seen so far, +you might imagine this could look like ```python rule amdahl_run: output: "amdahl_run_{parallel_tasks}.txt" @@ -212,10 +246,12 @@ rule amdahl_run: but there are two problems with this: * The only way for Snakemake to know the value of the wildcard is for the user - to explicitly request a concrete output file: + to explicitly request a concrete output file (rather than call the rule): ```bash snakemake --profile cluster_profile amdahl_run_2.txt ``` + This is perfectly valid, as Snakemake can figure out that it has a rule that + can match that filename. * The bigger problem is that even doing that does not work, it seems we cannot use a wildcard for `tasks`: ```output @@ -223,21 +259,23 @@ but there are two problems with this: SLURM job submission failed. The error message was sbatch: error: Invalid numeric value "{parallel_tasks}" for --ntasks. ``` -Unfortunately for us, there is no direct way for us to access the wildcards. The +Unfortunately for us, there is no direct way for us to access the wildcards +for `tasks`. The reason for this is that Snakemake tries to use the value of `tasks` during it's initialisation stage, which is before we know the value of the wildcard. We need to defer the determination of `tasks` to later on. This can be achieved by specifying an input function instead of a value for this -scenario. The solution then is to write a one-time use function that -has no name to manipulate Snakmake. These kinds of functions are called either -anonymous functions or lamdba functions (both mean the same thing), and are a -feature of Python (and other programming languages). +scenario. The solution then is to write a one-time use function to manipulate +Snakemake into doing this for us. Since the function is specifically for the +rule, we can use a one-line function without a name. These kinds of functions +are called either anonymous functions or lamdba functions (both mean the same +thing), and are a feature of Python (and other programming languages). To define a lambda function in python, the general syntax is as follows: ```python lambda x: x + 54 ``` -Since a function _can_ take the wildcards as arguments, we can use that to set +Since our function _can_ take the wildcards as arguments, we can use that to set the value for `tasks`: ```python rule amdahl_run: @@ -271,8 +309,9 @@ this is just as true with Snakefiles. Since our rule is now capable of generating an arbitrary number of output files things could get very crowded in our current directory. It's probably best then -to put the runs into a separate folder. We can just add the folder directly to -our `output`: +to put the runs into a separate folder to keep things tidy. We can add the +folder directly to our `output` and Snakemake will take of directory creation +for us: ```python rule amdahl_run: @@ -293,9 +332,12 @@ rule amdahl_run: ::: challenge -Create an output file (under the `run` folder) for the case where we have 6 +Create an output file (under the `runs` folder) for the case where we have 6 parallel tasks +(HINT: Remember that Snakemake needs to be able to match the requested file to +the `output` from a rule) + :::::: solution ```bash @@ -328,8 +370,9 @@ options: ``` The option we are looking for is `--terse`, and that will make `amdahl` print output in a format that is much easier to process, JSON. JSON format in a file -typically uses the file extension `.json` so let's add that option to our shell command -and change the file format of the output: +typically uses the file extension `.json` so let's add that option to our +`shell` command _and_ change the file format of the `output` to match our new +command: ```python rule amdahl_run: @@ -349,8 +392,9 @@ rule amdahl_run: ``` There was another parameter for `amdahl` that caught my eye. `amdahl` has an -option `--parallel-proportion` (or `-p`)which we might be interested in -changing. This has an impact on the values we get in our results so let's add +option `--parallel-proportion` (or `-p`) which we might be interested in +changing as it changes the behaviour of the code,and therefore has an impact on +the values we get in our results. Let's add another directory layer to our output format to reflect a particular choice for this value. We can use a wildcard so we done have to choose the value right away: @@ -387,43 +431,12 @@ snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json ::: -## Snakemake order of operations - -We're only just getting started with some simple rules, but it's worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases: - -1. Prepares to run: - 1. Reads in all the rule definitions from the Snakefile -1. Plans what to do: - 1. Sees what file(s) you are asking it to make - 1. Looks for a matching rule by looking at the `output`s of all the rules it knows - 1. Fills in the wildcards to work out the `input` for this rule - 1. Checks that this input file (if required) is actually available -1. Runs the steps: - 1. Creates the directory for the output file, if needed - 1. Removes the old output file if it is already there - 1. Only then, runs the shell command with the placeholders replaced - 1. Checks that the command ran without errors *and* made the new output file as expected - -::: callout -## Dry-run (`-n`) mode - -It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to -run, and print them to the screen, but never actually run them. This is done with the `-n` -flag, eg: - -```bash -> $ snakemake -n ... -``` -::: - -The amount of checking may seem pedantic right now, but as the workflow gains more steps this will -become very useful to us indeed. - ::: keypoints -- "Snakemake chooses the appropriate rule by replacing wildcards such that the output matches - the target" -- "Snakemake checks for various error conditions and will stop if it sees a problem" +- "Snakemake chooses the appropriate rule by replacing wildcards such that the + output matches the target" +- "Snakemake checks for various error conditions and will stop if it sees a + problem" ::: diff --git a/episodes/05-chaining_rules.md b/episodes/05-chaining_rules.md index 878ab13..b8cdbfb 100644 --- a/episodes/05-chaining_rules.md +++ b/episodes/05-chaining_rules.md @@ -10,15 +10,13 @@ exercises: 30 ::: ::: objectives -- "Use Snakemake to filter and then count the lines in a FASTQ file" -- "Add an RNA quantification step in the data analysis" -- "See how Snakemake deals with missing outputs" +- "" ::: ## A pipeline of multiple rules -We now have a rule that can generate output for any value of `p` and any number -tasks, we just need to call Snakemake with the parameters that we want: +We now have a rule that can generate output for any value of `-p` and any number +of tasks, we just need to call Snakemake with the parameters that we want: ```bash snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json ``` @@ -57,7 +55,8 @@ localrules: hostname_login, generate_run_files ::: -Now let's run the new rule: +Now let's run the new rule (remember we need to request the output file by name +as the `output` in our rule contains a wildcard pattern): ```bash [ocaisa@node1 ~]$ snakemake --profile cluster_profile/ p_0.999_runs.txt ``` @@ -128,7 +127,8 @@ Look at the logging messages that Snakemake prints in the terminal. What has hap This, in a nutshell, is how we build workflows in Snakemake. 1. Define rules for all the processing steps -1. Choose `input` and `output` naming patterns that allow Snakemake to link the rules +1. Choose `input` and `output` naming patterns that allow Snakemake to link the + rules 1. Tell Snakemake to generate the final output file(s) If you are used to writing regular scripts this takes a little @@ -157,34 +157,35 @@ you. Snakemake has a dedicated rule field for outputs that are [log files](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files), -and these are mostly treated as regular outputs except that log files are not removed if the job -produces an error. This means you can look at the log to help diagnose the error. In a real -workflow this can be very useful, but in terms of learning the fundementals of Snakemake we'll -stick with regular `input` and `output` fields here. +and these are mostly treated as regular outputs except that log files are not +removed if the job produces an error. This means you can look at the log to help +diagnose the error. In a real workflow this can be very useful, but in terms of +learning the fundamentals of Snakemake we'll stick with regular `input` and +`output` fields here. ::: - - ::: callout ## Errors are normal -Don't be disheartened if you see errors like the one above when first testing your new Snakemake -pipelines. There is a lot that can go wrong when writing a new workflow, and you'll normally need -several iterations to get things just right. One advantage of the Snakemake approach compared to -regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on -and potentially running junk calculations on partial or corrupted data. Another advantage is that -when a step fails we can safely resume from where we left off, as we'll see in the next episode. +Don't be disheartened if you see errors when first testing +your new Snakemake pipelines. There is a lot that can go wrong when writing a +new workflow, and you'll normally need several iterations to get things just +right. One advantage of the Snakemake approach compared to regular scripts is +that Snakemake fails fast when there is a problem, rather than ploughing on +and potentially running junk calculations on partial or corrupted data. Another +advantage is that when a step fails we can safely resume from where we left off. ::: ::: keypoints -- "Snakemake links rules by iteratively looking for rules that make missing inputs" +- "Snakemake links rules by iteratively looking for rules that make missing + inputs" - "Rules may have multiple named inputs and/or outputs" -- "If a shell command does not yield an expected output then Snakemake will regard that as a - failure" +- "If a shell command does not yield an expected output then Snakemake will + regard that as a failure" ::: diff --git a/episodes/06-expansion.md b/episodes/06-expansion.md index 6c2b49c..e332dbe 100644 --- a/episodes/06-expansion.md +++ b/episodes/06-expansion.md @@ -41,17 +41,20 @@ Global variables should be added before the rules in the Snakefile. NTASK_SIZES = [1, 2, 3, 4, 5] ``` -* Unlike with variables in shell scripts, we can put spaces around the `=` sign, but they are +* Unlike with variables in shell scripts, we can put spaces around the `=` sign, + but they are not mandatory. +* The lists of quoted strings are enclosed in square brackets and + comma-separated. If you know any Python you'll recognise this as Python list + syntax. +* A good convention is to use capitalized names for these variables, but this is not mandatory. -* The lists of quoted strings are enclosed in square brackets and comma-separated. If you know any - Python you'll recognise this as Python list syntax. -* A good convention is to use capitalized names for these variables, but this is not mandatory. -* Although these are referred to as variables, you can't actually change the values once the - workflow is running, so lists defined this way are more like constants. +* Although these are referred to as variables, you can't actually change the + values once the workflow is running, so lists defined this way are more like + constants. ## Using a Snakemake rule to define a batch of outputs -Now let's update our Snakefile to leverage the new global variable to create a +Now let's update our Snakefile to leverage the new global variable and create a list of files: ```python rule generate_run_files: @@ -76,23 +79,25 @@ to request a specific file: snakemake --profile cluster_profile/ p_0.999_runs.txt ``` -If you don't specify a target rule name or any file names on the command line when running -Snakemake, the default is to use **the first rule** in the Snakefile as the target. +If you don't specify a target rule name or any file names on the command line +when running Snakemake, the default is to use **the first rule** in the +Snakefile as the target. ::: callout ## Rules as targets -Giving the name of a rule to Snakemake on the command line only works when that rule has -*no wildcards* in the outputs, because Snakemake has no way to know what the desired wildcards -might be. You will see the error "Target rules may not contain wildcards." This can also happen -when you don't supply any explicit targets on the command line at all, and Snakemake tries to run -the first rule defined in the Snakefile. +Giving the name of a rule to Snakemake on the command line only works when that +rule has *no wildcards* in the outputs, because Snakemake has no way to know +what the desired wildcards might be. You will see the error "Target rules may +not contain wildcards." This can also happen when you don't supply any explicit +targets on the command line at all, and Snakemake tries to runthe first rule +defined in the Snakefile. ::: ## Rules that combine multiple inputs -Our *`generate_run_files`* rule is a rule which takes a list of input files. The +Our `generate_run_files` rule is a rule which takes a list of input files. The length of that list is not fixed by the rule, but can change based on `NTASK_SIZES`. @@ -174,6 +179,14 @@ snakemake --profile cluster_profile/ p_0.8_scalability.jpg ::: +::: challenge +## Bonus round + +Create a final rule that can be called directly and generates a scaling plot for +3 different values of `p`. + +::: + ::: keypoints - "Use the `expand()` function to generate lists of filenames you want to combine" - "Any `{input}` to a rule can be a variable-length list" From 81e86b3d67cc924f7a4ae24d1cf50cfb0ac36abb Mon Sep 17 00:00:00 2001 From: ocaisa Date: Mon, 8 Apr 2024 15:00:23 +0200 Subject: [PATCH 8/8] Add citation file to reference lesson material origins --- CITATION.cff | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 CITATION.cff diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..09cb074 --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,56 @@ +# This CITATION.cff file was generated with cffinit. +# Visit https://bit.ly/cffinit to generate yours today! + +cff-version: 1.2.0 +title: HPC Workflow Management with Snakemake +message: >- + If you use this software, please cite it using the + metadata from this file. +type: software +authors: + - given-names: Alan + family-names: O'Cais + email: alan.ocais@cecam.org + affiliation: University of Barcelona + orcid: 'https://orcid.org/0000-0002-8254-8752' +repository-code: 'https://github.com/carpentries-incubator/hpc-workflows' +url: 'https://carpentries-incubator.github.io/hpc-workflows/' +abstract: >- + When using HPC resources, it's very common to need to + carry out the same set of tasks over a set of data + (commonly called a workflow or pipeline). In this lesson + we will make an experiment that takes an application which + runs in parallel and investigate it’s scalability. To do + that we will need to gather data, in this case that means + running the application multiple times with different + numbers of CPU cores and recording the execution time. + Once we’ve done that we need to create a visualisation of + the data to see how it compares against the ideal case. + + + We could do all of this manually, but there are useful + tools to help us manage data analysis pipelines like we + have in our experiment. In the context of this lesson, + we’ll learn about one of those: Snakemake. +keywords: + - HPC + - Carpentries + - Lesson + - Workflow + - Pipeline +license: CC-BY-4.0 +references: + - authors: + - family-names: Collins + given-names: Daniel + title: "Getting Started with Snakemake" + type: software + repository-code: 'https://github.com/carpentries-incubator/workflows-snakemake/' + url: 'https://carpentries-incubator.github.io/workflows-snakemake/' + - authors: + - family-names: Booth + given-names: Tim + title: "Snakemake for Bioinformatics" + type: software + repository-code: 'https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/' + url: 'https://carpentries-incubator.github.io/snakemake-novice-bioinformatics' \ No newline at end of file