Skip to content

Commit

Permalink
Add to the insturctor notes, as per #64
Browse files Browse the repository at this point in the history
  • Loading branch information
tbooth committed Jul 29, 2024
1 parent 04990bc commit 9530f10
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 8 deletions.
15 changes: 15 additions & 0 deletions instructors/instructor-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,12 +93,27 @@ evaluation feature of Snakemake, until we are ready to properly introduce and un

## Episode 03 - Chaining rules

### Illustrating the wildcard matching process

There is a figure to illustrate the way Snakemake finds rules by wildcard matching and then tracks
back until it runs out of rule matches and finds a file that it already has. You may find that
showing an animated version of this is helpful, in which case
[there are some slides here](
https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/files/9299078/wildcard_demo.pptx).

### Named inputs versus lists of inputs

In this course, we introduce named inputs and outputs before lists of inputs and outputs. This
results in shell commands like:

`"kallisto quant -i {input.index} -o kallisto.{wildcards.sample} {input.fq1} {input.fq2}"`

Rather than the less readable version with a simple list of inputs:

`"kallisto quant -i {input[0]} -o kallisto.{wildcards.sample} {input[1]} {input[2]}"`

Later, we introduce lists of inputs in tandem with the `expand()` function. Of course it is
possible to have a list of outputs, but this is uncommon and not needed to solve any of the
challenges in this course. In fact, introducing lists of outputs may confuse learners as they
may think it is possible for a rule to yield a variable number of outputs in the manner of the old
`dynamic()` behaviour, which is not a thing.
16 changes: 8 additions & 8 deletions learners/prereqs.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,18 @@ $ ls -ltrF --directory --human-readable foo_*
* Know how to look up the meaning of such options in the `man` page
* See that `foo_*` is a wildcard (aka. glob) pattern that may match multiple file names

And these specific tasks occur within the course:
And these specific concepts occur within the course:

* Use `cd` to change the current working directory
* Use `&` to run a command as a background job
* Use `>` and `<` to redirect output to, and input from files
* Use `>` and `<` to redirect terminal output to, and input from, files
* Use `|` to redirect (pipe) output directly between commands
* Use `wget` to fetch a file from a web URL
* Create and remove directories with `mkdir` and `rmdir`
* Use `rm` and `rm -r` to remove files and non-empty directories
* Use `ln -s` to create a symbolic link
* Use `cat` and `head` to show the contents of a text file in the terminal
* Run a Bash script file containing shell commands
* Run a Bash script file containing shell commands: `bash scriptname.sh`

## Biology and bioinformatics

Expand All @@ -87,14 +87,14 @@ overlap, but more typically the fragments loaded in the machine are longer, and
middle of the fragment are never read.

The data from the machine is saved into a file format named [FASTQ](
https://en.wikipedia.org/wiki/FASTQ_format) which contains both the sequence (in *ATCG* letters)
https://en.wikipedia.org/wiki/FASTQ_format) which contains both the sequence (of *ATCG* letters)
and the per-base quality score, which is an estimate of the error rate recorded by the machine as
it runs. Average quality drops off as the sequencer reads further into the fragment. It is
generally desirable to discard low-quality data, either by trimming off bad bases or discarding
the whole read.
it runs. Average quality drops off as the sequencer reads further into the fragment. It is
generally desirable to discard low-quality data, either by trimming off bad bases from the end or
discarding the whole read.

In the most common case, the next step in DNA analysis after quality filtering is to map the
reads on a known reference genome, which is the job of [an aligner](
reads onto a known reference genome, which is the job of [an aligner](
https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-read_sequence_alignment).
This finds the most likely location where the raw read came from, allowing for a certain degree
of sequence mismatch. The second most common idea is to perform a [de-novo assembly](
Expand Down

0 comments on commit 9530f10

Please sign in to comment.