From 9530f10df759cf37af82dd47c009b5c29558272a Mon Sep 17 00:00:00 2001 From: Tim Booth Date: Mon, 29 Jul 2024 16:09:57 +0100 Subject: [PATCH] Add to the insturctor notes, as per #64 --- instructors/instructor-notes.md | 15 +++++++++++++++ learners/prereqs.md | 16 ++++++++-------- 2 files changed, 23 insertions(+), 8 deletions(-) diff --git a/instructors/instructor-notes.md b/instructors/instructor-notes.md index 44f4c61..36226eb 100644 --- a/instructors/instructor-notes.md +++ b/instructors/instructor-notes.md @@ -93,12 +93,27 @@ evaluation feature of Snakemake, until we are ready to properly introduce and un ## Episode 03 - Chaining rules +### Illustrating the wildcard matching process + There is a figure to illustrate the way Snakemake finds rules by wildcard matching and then tracks back until it runs out of rule matches and finds a file that it already has. You may find that showing an animated version of this is helpful, in which case [there are some slides here]( https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/files/9299078/wildcard_demo.pptx). +### Named inputs versus lists of inputs + +In this course, we introduce named inputs and outputs before lists of inputs and outputs. This +results in shell commands like: + +`"kallisto quant -i {input.index} -o kallisto.{wildcards.sample} {input.fq1} {input.fq2}"` +Rather than the less readable version with a simple list of inputs: +`"kallisto quant -i {input[0]} -o kallisto.{wildcards.sample} {input[1]} {input[2]}"` +Later, we introduce lists of inputs in tandem with the `expand()` function. Of course it is +possible to have a list of outputs, but this is uncommon and not needed to solve any of the +challenges in this course. In fact, introducing lists of outputs may confuse learners as they +may think it is possible for a rule to yield a variable number of outputs in the manner of the old +`dynamic()` behaviour, which is not a thing. diff --git a/learners/prereqs.md b/learners/prereqs.md index b3495b6..9f6884f 100644 --- a/learners/prereqs.md +++ b/learners/prereqs.md @@ -58,18 +58,18 @@ $ ls -ltrF --directory --human-readable foo_* * Know how to look up the meaning of such options in the `man` page * See that `foo_*` is a wildcard (aka. glob) pattern that may match multiple file names -And these specific tasks occur within the course: +And these specific concepts occur within the course: * Use `cd` to change the current working directory * Use `&` to run a command as a background job -* Use `>` and `<` to redirect output to, and input from files +* Use `>` and `<` to redirect terminal output to, and input from, files * Use `|` to redirect (pipe) output directly between commands * Use `wget` to fetch a file from a web URL * Create and remove directories with `mkdir` and `rmdir` * Use `rm` and `rm -r` to remove files and non-empty directories * Use `ln -s` to create a symbolic link * Use `cat` and `head` to show the contents of a text file in the terminal -* Run a Bash script file containing shell commands +* Run a Bash script file containing shell commands: `bash scriptname.sh` ## Biology and bioinformatics @@ -87,14 +87,14 @@ overlap, but more typically the fragments loaded in the machine are longer, and middle of the fragment are never read. The data from the machine is saved into a file format named [FASTQ]( -https://en.wikipedia.org/wiki/FASTQ_format) which contains both the sequence (in *ATCG* letters) +https://en.wikipedia.org/wiki/FASTQ_format) which contains both the sequence (of *ATCG* letters) and the per-base quality score, which is an estimate of the error rate recorded by the machine as -it runs. Average quality drops off as the sequencer reads further into the fragment. It is -generally desirable to discard low-quality data, either by trimming off bad bases or discarding -the whole read. +it runs. Average quality drops off as the sequencer reads further into the fragment. It is +generally desirable to discard low-quality data, either by trimming off bad bases from the end or +discarding the whole read. In the most common case, the next step in DNA analysis after quality filtering is to map the -reads on a known reference genome, which is the job of [an aligner]( +reads onto a known reference genome, which is the job of [an aligner]( https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-read_sequence_alignment). This finds the most likely location where the raw read came from, allowing for a certain degree of sequence mismatch. The second most common idea is to perform a [de-novo assembly](