update for loops; fix #23

cambiotraining · Jul 18, 2024 · ce22145 · ce22145
1 parent 81c615c
commit ce22145
Showing 1 changed file with 88 additions and 19 deletions.
diff --git a/materials/02-programming/03-loops.md b/materials/02-programming/03-loops.md
@@ -17,11 +17,11 @@ title: "Loops"
 As such they are key to productivity improvements through automation. 
 Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).
 
-Going back to our `molecules` directory, suppose we wanted to use our `count_atoms.sh` script to get the number of atoms in each of our molecules' PDB files. 
-We know how to run the script for a single file: 
+Going back to our `molecules` directory, suppose we wanted to count the number of atoms in each of our molecules' PDB files. 
+As a reminder, here is the command to do this for one of our files:
 
 ```bash
-bash   count_atoms.sh   cubane.pdb
+cat cubane.pdb | grep "ATOM" | wc -l
 ```
 
 Of course, we could manually then repeat this for each of our molecule files: `cubane.pdb`, `ethane.pdb`, `methane.pdb`, `octane.pdb`, `pentane.pdb`, `propane.pdb`.  
@@ -31,31 +31,29 @@ We’ll use a loop to solve this problem, but first let’s look at the general
 ```bash
 for thing in list_of_things
 do
-  operation_using $thing  # Indentation within the loop is not required, but aids legibility
+  # Indentation within the loop is not required, but aids legibility
+  operation_using ${thing}
 done
 ```
 
-Let's create a new script called `count_loop.sh` (using `nano` or `gedit`), where we apply this idea to our example: 
+Taking our command above to count atoms, let's create a new script called `count_loop.sh`, where we apply this idea: 
 
 ```bash
 #!/bin/bash
 
 for filename in cubane.pdb ethane.pdb methane.pdb
 do
   # count the number of lines containing the word "ATOM"
-  natoms=$(cat ${filename} | grep "ATOM" | wc -l)
-
-  # print a message to the user
-  echo "The number of atoms in ${filename} is: ${natoms}"
+  cat ${filename} | grep "ATOM" | wc -l
 done
 ```
 
-If we run this script (`bash count_loop.sh`), we get the expected output: 
+If we ran this script (`bash count_loop.sh`), we would get the following output: 
 
 ```
-The number of atoms in cubane.pdb is: 16
-The number of atoms in ethane.pdb is: 8
-The number of atoms in methane.pdb is: 5
+16
+8
+5
 ```
 
 When the shell sees the keyword `for`, it knows to repeat a command (or group of commands) once for each item in a list. 
@@ -65,12 +63,42 @@ Inside the loop, we call for the variable's value `$filename`.
 
 In our example, at each iteration of the _for loop_, the variable `$filename` stored a different value, cycling through `cubane.pdb`, `ethane.pdb` and finally `methane.pdb`. 
 
+At the moment our script is not very informative of what files are being processed. 
+But we could use some of the programming techniques we've already learned about to make our output even more informative.
+Here is an example of a modified script: 
+
+```bash
+#!/bin/bash
+
+for filename in cubane.pdb ethane.pdb methane.pdb
+do
+  # count the number of lines containing the word "ATOM"
+  # store the result inside a variable 'natoms'
+  natoms=$(cat ${filename} | grep "ATOM" | wc -l)
+
+  # print a message to the user
+  echo "The number of atoms in ${filename} is: ${natoms}"
+done
+```
+
+If we run this script (`bash count_loop.sh`), we get a more informative output than before: 
+
+```
+The number of atoms in cubane.pdb is: 16
+The number of atoms in ethane.pdb is: 8
+The number of atoms in methane.pdb is: 5
+```
 
 ::: {.callout-note}
 - Do not use spaces, quotes, or wildcard characters such as '*' or '?' in filenames, as it complicates variable expansion.
 - Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
 :::
 
+:::{.callout-exercise}
+
+See the [loop multiple files](#wildcard-loops-exr) and [searching for text](#grep-loop-exr) exercises to test your knowledge.
+:::
+
 
 ## Scripts within scripts
 
@@ -117,11 +145,15 @@ bash count_atoms.sh methane.pdb
 So, it wouldn't actually run the command within the loop, but rather tell us what would have been run. 
 This is a good practice when building scripts that include a _for loop_, because it lets us check that our code is all correct. 
 
+:::{.callout-exercise}
+Try the [dry run exercise](#dry-run-exr) to test your knowledge.
+:::
+
 
 ## Exercises
 
-:::{.callout-exercise}
-#### Multiple files
+:::{.callout-exercise #wildcard-loops-exr}
+#### Looping multiple files
 {{< level 1 >}}
 
 Can you think of a way to improve our `count_loop.sh` script, so that every file gets processed, but without having to type all the individual files' names?
@@ -142,7 +174,7 @@ done
 :::
 :::
 
-:::{.callout-exercise}
+:::{.callout-exercise #grep-loop-exr}
 #### Searching for text
 {{< level 2 >}}
 
@@ -158,7 +190,9 @@ Write a _for loop_ to search for several variants:
 
 - Use `nano` to create a new script called `count_variants.sh`. 
 - Adapt the commands shown above to write a _for loop_ to search for the variants "Alpha", "Delta" and "Omicron".
-- Bonus: print a message indicating which of the variants is being searched for. 
+- Print a message indicating which of the variants is being searched for. 
+
+**Bonus (optional):** modify the script to output the results to a CSV file called `variant_counts.csv` with the name of the variant as the first column and the count as the second column. 
 
 ::: {.callout-answer collapse=true}
 We can write the following script: 
@@ -189,13 +223,48 @@ The number of Delta samples is: 75
 The number of Omicron samples is: 93
 ```
 
+The bonus task asked to modify the code to output the results to a file. 
+We can use the redirection operators (`>` / `>>`) to achieve this: 
+
+```bash
+#!/bin/bash
+
+# outside of the loop we initiate a new file with column names
+echo "variant,count" > variant_counts.csv
+
+for variant in Alpha Delta Omicron
+do
+  # count the variant occurrence across all files - save the result in a variable called "n"
+  n=$(cat *_variants.csv | grep "${variant}" | wc -l)
+
+  # we append the variant name and its count to our file, each separated by a comma
+  echo "${variant},${n}" >> variant_counts.csv
+done
+```
+
+If we run this modified script (`bash count_variants.sh`), nothing is printed to the terminal. 
+However, a file is created in our directory, which contains the results of our analysis: 
+
+```bash
+cat variant_counts.csv
+```
+
+```
+variant,count
+Alpha,38
+Delta,75
+Omicron,93
+```
+
+Because this is a CSV file, we could easily import it into a data analysis package (e.g. R or Python) to produce some visualisations. 
+
 :::
 :::
 
 
-:::{.callout-exercise}
+:::{.callout-exercise #dry-run-exr}
 #### Dry run
-{{< level 3 >}}
+{{< level 2 >}}
 
 Suppose we want to set up up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures.  
 Modify the following code to run as a _dry-run_ (i.e. not actually execute the command inside the loop) and try to understand what would happen: