From 7b77fa548373ba860f6f40a47da7b4b34a131c30 Mon Sep 17 00:00:00 2001
From: Jose <josejmcriollo@gmail.com>
Date: Fri, 19 Jan 2024 08:29:36 -0600
Subject: [PATCH] refactoring lessons structure, moved code migration to HPC to
 the end of advanced Python and moved other lessons to intro to Python

---
 _episodes/05-code-migration-01.md | 532 ------------------------------
 _episodes/05-loop.md              | 422 ++++++++++++++++++++++++
 _episodes/06-files.md             | 246 ++++++++++++++
 3 files changed, 668 insertions(+), 532 deletions(-)
 delete mode 100644 _episodes/05-code-migration-01.md
 create mode 100644 _episodes/05-loop.md
 create mode 100644 _episodes/06-files.md
diff --git a/_episodes/05-code-migration-01.md b/_episodes/05-code-migration-01.md
deleted file mode 100644
index 974127f..0000000
--- a/_episodes/05-code-migration-01.md
+++ /dev/null
@@ -1,532 +0,0 @@
----
-title: Code migration to HPC systems
-teaching: 40
-exercises: 0
-questions:
-- "I would like to try Hawk, can I keep working with Jupyter Notebooks? 
-  (yes, but ...)"
-- "How to access Jupyter Notebooks on Hawk?"
-- "How to transition from interactive Jupyter Notebooks to automated Python scripts?"
-objectives:
-- "Learn how to set up your work environment (transfer files, install libraries)."
-- "Understand the differences and trade-offs between Jupyter Notebooks and 
-  Python scripts."
-keypoints:
-- "It is possible to use Jupyter Notebooks on Hawk via OnDemand and ssh tunnels."
-- "The recommended method to setup your work environment (installing libraries)
-  is using Anaconda virtual environments."
-- "Include the library ipykernel to make the environment reachable from Jupyter
-   Notebooks"
-- "Use Jupyter Notebooks in the early (development) stages of your project when
-  lots of debugging are necessary. Move towards automated Python scripts in later
-  stages and submit them via SLURM job scripts."
-
----
-
-## Running Jupyter Notebooks on a remote HPC systems
-
-Although the most traditional way to interact with remote HPC and cloud systems 
-is through the command line (via the `ssh` and `scp` commands) some systems also
-offer graphical user interfaces for some services. Specifically, on Hawk you can 
-deploy
-Jupyter Notebooks via [OnDemand](https://openondemand.org/) (a web portal that
-allow you to work with HPC systems interactively). The notes below provide
-instructions for both methods of access: through OnDemand and through a `ssh` 
-*tunnel*.
-
-{::options parse_block_html="true" /}
-<div>
-  <ul class="nav nav-tabs" role="tablist">
-   <li role="presentation" class="active"><a data-os="ssh-tunnel" href="#SSH" 
-       aria-controls="SSH" role="tab" data-toggle="tab">SSH tunnel</a></li>
-   <li role="presentation"><a data-os="ondemand" href="#OnDemand" 
-       aria-controls="OnDemand" role="tab" data-toggle="tab">OnDemand</a></li>
-  </ul>
-
- <div class="tab-content">
-  <article role="tabpanel" class="tab-pane active" id="SSH">
-
-  To access a Jupyter Notebook server via an `ssh` tunnel you need first to 
-  login to Hawk:
-
-  ```
-  $ ssh hawk-username@hawklogin.cf.ac.uk
-  ```
-
-  Once logged in, confirm that Python 3 is accessible:
-
-  ```
-  $ module load compiler/gnu/9
-  $ module load python/3.7.0
-  $ python3 --version
-  ```
-  ~~~
-  Python 3.7.0
-  ~~~
-  {: .output}
-
-  We need to install Jupyter Notebooks on our user account on the remote server
-  (we will discuss more about installing Python packages later on):
-
-  ```
-  $ python3 -m venv create my_venv
-  $ . my_venv/bin/activate
-  $ python3 -m pip install jupyterlab
-  ```
-
-  The installation process will try to download several dependencies from the 
-  internet. Be patient, it shouldn't take more than a couple of minutes.
-
-  Now, this is important, the Jupyter Notebook server must be run on a *compute
-  node*, please take a look at our best practices guidelines in the 
-  [SCW portal](https://portal.supercomputing.wales/index.php/best-practice/).
-  If the concept of login and compute nodes is still not clear at this point 
-  don't worry too much (but you can find out more in out 
-  [Supercomputing for Beginners training course](https://arcca.github.io/hpc-intro/).
-  For now, run the following command to instruct the Hawk job scheduler to run 
-  a Jupyter Notebook server on a compute node:
-
-  ```
-  $ srun -n 1 -p htc --account=scwXXXX -t 1:00:00 jupyter-lab --ip=0.0.0.0
-  ```
-
-  ~~~
-  http://ccs1015:8888/?token=77777add13ab93a0c408c287a630249c2dba93efdd3fae06
-  or http://127.0.0.1:8888/?token=77777add13ab93a0c408c287a630249c2dba93efdd3fae06
-  ~~~
-  {: .output}
-
-  Next, open a new terminal and create a ssh tunnel using the node and port 
-  obtained in the previous step (e.g. ccs1015:8888):
-  
-  ```
-  $ ssh -L8888:ccs1015:8888 hawk-username@hawklogin.cf.ac.uk
-  ```
-
-  You should be able to navigate to http://localhost:8888 in your web browser 
-  (use the token provided in the output if needed). If everything went well, you
-  should see something like: 
-
-  <img src="{{ page.root }}/fig/jupyter-lab-home.png" alt="Jupyter Lab Home" 
-   width="50%" height="50%" />
-
-  Where you should be able to access the files stored your Hawk user account.
-
-
-  </article>
-
-  <article role="tabpanel" class="tab-pane" id="OnDemand">
-  1.  Go to [ARCCA OnDemand](https://arcondemand.cardiff.ac.uk) portal (this
-      requires access to [Cardiff University VPN](https://intranet.cardiff.ac.uk/staff/supporting-your-work/it-support/wireless-and-remote-access/off-campus-access/virtual-private-network-vpn) ).
-  2.  Enter your details: Hawk username and password. Once logged in you should
-      land on a page with useful information including the usual Message of the
-      Day (MOD) commonly seen when logging in to Hawk via the terminal.
-
-      |          |          |
-      |:--------:|:--------:|
-      | <img src="{{ page.root }}/fig/ondemand-login.png" alt="ARCCA OnDemand login page" width="400px" /> | <img src="{{ page.root }}/fig/ondemand-landing-page.png" alt="ARCCA landing page" width="400px" /> |
-      | | |
-
-  3.  Go to "Interactive Apps" in the top menu and select "Jupyter Notebook/Lab". 
-      This will bring you to a form where you can specify for how much time the
-      desktop is required, number of CPUs, partition, etc. You can also choose
-      to receive an email once the desktop is ready for you. Click the *Launch*
-      button to submit the request.
-
-      |          |          |
-      |:--------:|:--------:|
-      | <img src="{{ page.root }}/fig/ondemand-select-jupyter-notebook-lab.png" alt="ARCCA OnDemand login page" width="500px" /> | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-requirements.png" alt="OnDemand JN requirements" width="500px" /> |
-      | | |
-
-  4.  After submission you request will be placed on the queue and will wait
-      for resources, hopefully for a short period, but this *depends on the 
-      number of cores as well as time requested*, so please be patient. At this
-      point you can close the OnDemand website and come back at a later point 
-      to check progress or wait for the email notification if the option was 
-      selected.
-
-      Once your request is granted you should be able to see a *Running* message, 
-      the amount of resources granted and the time remaining.
-
-      Click *Connect to Jupyter* to launch the Jupyter in a new web browser tab.
-
-      |          |          |
-      |:--------:|:--------:|
-      | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-queued.png" alt="OnDemand JN queued" width="500px" /> | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-running.png" alt="OnDemand JN running" width="500px" /> |
-      | | |
-
-  5.  You should now have the familiar interface of Jupyter Notebooks in front of
-      you. It will show the documents and directories in your user account on 
-      Hawk. To create a new Notebook, go to the dropdown menu *New* on the right
-      side and click on *Python 3 (ipykernel)*. A new tab will open with a new
-      notebook ready for you to start working.
-
-      |          |          |
-      |:--------:|:--------:|
-      | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-main.png" alt="OnDemand JN main" width="500px" /> | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-new-notebook.png" alt="OnDemand JN new notebook" width="500px" /> |
-      | | |
-
-  </article>
- </div>
-</div>
-
-{% include links.md %}
-
-
-## Copying data
-
-To keep working on Hawk with the Notebooks we have written locally in our 
-Desktop computer we need to transfer them over. Depending on our platform we
-can do this in a couple of ways:
-
-{::options parse_block_html="true" /}
-<div>
-  <ul class="nav nav-tabs" role="tablist">
-   <li role="presentation" class="active"><a href="#copy-Windows"  aria-controls="copy-Windows"  role="tab" data-toggle="tab">Windows    </a></li>
-   <li role="presentation"               ><a href="#copy-MacOS"    aria-controls="copy-MacOS"    role="tab" data-toggle="tab">MacOS/Linux</a></li>
-   <li role="presentation"               ><a href="#copy-OnDemand" aria-controls="copy-OnDemand" role="tab" data-toggle="tab">OnDemand   </a></li>
-  </ul>
- 
- <div class="tab-content">
-  <article role="tabpanel" class="tab-pane active" id="copy-Windows">
-  On Windows you can use [MobaXterm](https://mobaxterm.mobatek.net/) to 
-  transfer files to Hawk from your local computer.
-
-  |                  |                  |
-  |:----------------:|:----------------:|
-  | <figure> <img src="{{ page.root }}/fig/MobaXterm-01-arrows.png" alt="Open SCP session on MobaXterm" width=500px> </figure> <br /> Click on **Session** to open the different connection methods available in MobaXterm | <img src="{{ page.root }}/fig/mobaxterm-scp-setup.png" alt="Enter details to start SPC session on MobaXterm" width="500px"/> <br /> Select **SFTP** and enter the Remote Host (*hawklogin.cf.ac.uk*) and your **Hawk username** |
-  | | |
-  | <figure> <img src="{{ page.root }}/fig/mobaxterm-scp-copy-directory.png" alt="Open SCP session on MobaXterm" width=500px> </figure> <br /> Locate the directory in your local computer and drag and drop to the remote server on the right pane. ||
-
-  </article>
- 
-  <article role="tabpanel" class="tab-pane" id="copy-MacOS">
-  MacOS and Linux provide the command scp -r that can be used to recursively 
-  copy your work directory over to your home directory in Hawk:
-  
-  ```
-  $ scp -r arcca-python hawk-username@hawklogin.cf.ac.uk:/home/hawk-username
-  python-novice-inflammation-code.zip           100% 7216   193.0KB/s   00:00
-  Untitled.ipynb                                100%   67KB 880.2KB/s   00:00
-  inflammation.png                              100%   13KB 315.6KB/s   00:00
-  argv_list.py                                  100%   42     0.4KB/s   00:00
-  readings_08.py                                100% 1097    10.6KB/s   00:00
-  readings_09.py                                100%  851    24.8KB/s   00:00
-  ```
-
-  </article>
-
-  <article role="tabpanel" class="tab-pane" id="copy-OnDemand">
-  With OnDemand you can also download and upload files to Hawk. In this 
-  example we will upload the directory with the Jupyter Notebooks we have 
-  created so far. Go to `Files` and select the directory where you wish to 
-  upload the files (our home directory in this case), then select `Upload`
-  and locate the directory in your local computer. Once uploaded, the files
-  should be available on Hawk:
-
-  |                  |                  |
-  |:----------------:|:----------------:|
-  | <img src="{{ page.root }}/fig/ondemand-upload-files-select-directory.png" alt="Go to Files and select directory where to upload." width=500px> | <img src="{{ page.root }}/fig/ondemand-upload-files-click-upload.png" alt="Click Upload" width="500px"/> |
-  | <img src="{{ page.root }}/fig/ondemand-upload-files-locate-local-directory.png" alt="Locate files in your local computer." width=500px> | <img src="{{ page.root }}/fig/ondemand-upload-files-uploaded.png" alt="Locate files in your local computer." width=500px> |
-
-  </article>
-
- </div>
-</div>
-
-## Setup your Jupyter Notebook work environment
-
-Depending on how you started your Jupyter Notebook you should have access to
-some default packages. But these are not guaranteed to be the same (this also
-applies for the version of Python) between the OnDemand and the `ssh` tunnel 
-methods. Moreover, it is unlikely that the remote HPC system would provide 
-every package you need by default.
-
-### Installing Python libraries
-
-The **recommended approach** is to create a conda virtual environment with a 
-`environemnt.yml` file which includes a list of all packages (and versions)
-needed for your work. This file can be created and used in your local computer
-and then copied to Hawk to reproduce the same environment. An example file is:
-
-~~~
-name: my-conda-env
-dependencies:
-  - python=3.9.2
-  - numpy
-  - pandas
-  - ipykernel
-~~~
-{: .language-yaml}
-
-The package `ipykernel` is required here to make the environment reachable from
-Jupyter Notebooks. You can find more about creating an `environment.yml` file 
-in the [Anaconda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
-
-On Hawk you need to first load Anaconda:
-
-```
-$ module load anaconda/2020.02
-$ source activate
-$ which conda
-```
-
-~~~
-/apps/languages/anaconda/2020.02/bin/conda
-~~~
-{: .output}
-
-And then proceed to install the virtual environment:
-
-```
-$ conda env create -f environment.yml
-```
-
-~~~
-...
-libffi-3.3           | 50 KB     | ##################################### | 100%
-numpy-1.21.2         | 23 KB     | ##################################### | 100%
-pandas-1.3.4         | 9.6 MB    | ##################################### | 100%
-mkl-2021.4.0         | 142.6 MB  | ##################################### | 100%
-six-1.16.0           | 18 KB     | ##################################### | 100%
-Preparing transaction: done
-Verifying transaction: done
-Executing transaction: done
-#
-# To activate this environment, use
-#
-#     $ conda activate my-conda-env
-#
-# To deactivate an active environment, use
-#
-#     $ conda deactivate
-~~~
-{: .output}
-
-We can then follow the instructions printed at the end of the installation 
-process to activate our environment:
-
-```
-$ conda activate my-conda-env
-$ which python
-```
-
-~~~
-~/.conda/envs/my-conda-env/bin/python
-~~~
-{: .output}
-
-We can further confirm the version of Python used by the environment:
-
-```
-$ python --version
-```
-
-~~~
-Python 3.9.2
-~~~
-{: .output}
-
-To deactivate the environment and return to the default Python provided by the
-system (or the loaded module):
-
-```
-$ conda deactivate
-$ python --version
-```
-
-~~~
-Python 3.7.6
-~~~
-{: .output}
-
-### Using an Anaconda virtual environment from Jupyter Notebooks
-
-We can access our newly installed Anaconda environment from Jupyter Notebooks
-on OnDemand. For this, create a new session and when the resources are granted
-click on `Connect to Jupyter`. On Jupyter Lab you might be asked to choose 
-which kernel to start, if so, select the name given to your virtual environment
-(*my-conda-env*) in this example:
-
-<img src="{{ page.root }}/fig/ondemand-jupyter-notebook-select-kernel.png"
-alt="Change JN kernel when automatically prompted." width=500px> 
-
-If another kernel is loaded by default, you can still change it by clicking on
-the top right corner of your Notebook, a similar menu should appear:
-
-  |                  |                  |
-  |:----------------:|:----------------:|
-  | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-change-kernel-manually.png" alt="Change JN kernel manually." width=500px> | <img src="{{ page.root }}/fig/ondemand-jupyter-notebook-select-kernel.png" alt="Select JN kernel from menu" width="500px"/> |
-
-If all goes well you should be able to confirm the Python versions and path, as
-well as the location of the installed libraries:
-
-<img src="{{ page.root }}/fig/ondemand-jupyter-notebook-confirm-environment.png"
-alt="Confirm the Python version, and paths to libraries included in
-environment." width="500px"/>
-
-At this point you should have all the packages required to continue working on 
-Hawk as if you were working on your local computer.
-
-
-### A more efficient approach
-
-During these examples we have been requesting only 1 CPU when we launch our 
-Jupyter Notebook and that, hopefully, has caused our request to be fulfilled
-fairly quickly. However, there will be a point where 1 CPU is no longer enough
-(maybe the application has become more complex or there is more data to analyse,
-and memory requirements have increased). At that point you can modify the 
-requirements and increase the number of CPUS,  memory, time or devices (GPUs).
-One point to keep in mind when increasing requirements is that this will impact
-the time it takes for the system scheduler to deliver your request and allocate
-you the resources, the higher the requirements, the longer it will take.
-
-When the time spent waiting in queue becomes excessive it is worth considering
-moving away from the Jupyter Notebook workflow towards a more traditional 
-Python script approach (**recommended for HPC systems**). The main difference
-between them is that while Jupyter Notebooks is ideal for the development 
-stages of a project (since you can test things out in real time and debug if 
-needed), the Python script approach is better suited for the production stages
-where the needed for supervision and debugging is reduced. Python scripts also 
-have the advantage, on HPC systems, of being able to be queued for resources
-and automatically executed when these are granted without you needing to be 
-logged in the system.
-
-So, how do we actually transfer our Jupyter Notebook to a Python script?
-Fortunately, Jupyter Notebook developers thought of this requirement and added
-a convenient export method to the Notebooks (the menus might be different
-depending on if Jupyter Notebooks or Jupyter Lab was launched from OnDemand):
-
-|                  |                  |
-|:----------------:|:----------------:|
-| ![Download Notebook from JN as Py script](../fig/jupyter-notebook-download-as-python-script.png) | ![Download Notebook from JL as Py script](../fig/jupyter-lab-download-as-python-script.png) |
-| Download from Jupyter Notebook | Download from Jupyter Lab |
-
-After choosing an appropriate name and saving the file, we should have a Python 
-script (a text file) with entries similar to:
-
-~~~
-#!/usr/bin/env python
-# coding: utf-8
-
-# In[1]:
-
-
-3 + 5 * 8
-
-
-# In[2]:
-
-
-weight_kg=60
-
-~~~
-{: .language-python}
-
-Notice the `# In[X]:` that mark the position of corresponding cells in our
-Jupyter Notebook and are kept for reference. We can keep them in place, they
-won't cause any trouble as they are included as comments (due to the initial
-`#`), but if we wanted to remove them we could do it by hand or more 
-efficiently by using the command line tool `sed` to find and delete lines that
-start with the characters `# [` and to delete empty lines (`^` is used by `sed`
-to indicate the beginning of a line and `$` to indicate the end):
-
-**(from this point onwards we move away from Jupyter Notebooks and start typing
-commands on a terminal connected to Hawk)**
-
-```
-sed -e '/# In/d' -e '/^$/d' lesson1.py > lesson1_cleaned.py
-```
-
-This will produce the `lesson1_cleaned.py` file with entries similar to:
-
-~~~
-#!/usr/bin/env python
-# coding: utf-8
-3+5+8
-weight_kg=60
-~~~
-{: .language-python}
-
-Now that we have our Python script, we need to create an additional file (job
-script) to place it in the queue (submit the job). Make sure to remove any
-commands from the Python script that might need additional confirmation or user
-interaction as you won't be able to provide it with this method of execution.
-The following is the content a job script that is equivalent to how we have 
-been requesting resources through OnDemand:
-
-~~~
-#!/bin/bash
-
-#SBATCH -J test             # job name
-#SBATCH -n 1                # number of tasks needed
-#SBATCH -p htc              # partition
-#SBATCH --time=01:00:00     # time limit
-#SBATCH -A scwXXXX          # account number
-
-set -eu
-
-module purge
-module load anaconda/2020.02
-module list
-
-# Load conda
-source activate
-
-# Load our environment
-conda activate my-conda-env
-
-which python
-python --version
-
-python my-python-script.py
-
-~~~
-{: .language-bash}
-
-To submit (put it queue) the above script, on Hawk:
-
-```
-$ sbatch my-job-script.sh
-```
-
-~~~
-Submitted batch job 25859860
-~~~
-{: .output}
-
-You can query the current state of this job with:
-
-```
-$ squeue -u $USER
-```
-
-~~~
-    JOBID       PARTITION    NAME    USER      ST    TIME  NODES NODELIST(REASON)
-    25860025    htc          test    c.xxxxxxx PD    0:00      1 ccs3004
-~~~
-{: .output}
-
-This particular job might not spend a long time in queue and the above output
-might not show it, but on completion there should be a `slurm-<job-id>.out`
-created in the current directory with the output produced by our script.
-
-There is a lot more to know about working with HTC systems and job schedulers,
-once you are ready to go this route, take a look at our documentation and 
-training courses on these topics:
-
-- [**Supercomputing for Beginners**](https://arcca.github.io/hpc-intro): Why 
-  use HPC? Accessing systems, using SLURM, loading software, file transfer and
-  optimising resources.
-- [**Slurm: Advanced Topics**](https://arcca.github.io/slurm_advanced_topics):
-  Additional material to interface with HPC more effectively.
-
-> ## Need help?
-> 
-> If during the above steps you found any issues or have doubts regarding your
-> specific work environment, get in touch with us at arcca-help@cardiff.ac.uk.
-{: .callout}
-
-
-
-{% include links.md %}
diff --git a/_episodes/05-loop.md b/_episodes/05-loop.md
new file mode 100644
index 0000000..b5812f3
--- /dev/null
+++ b/_episodes/05-loop.md
@@ -0,0 +1,422 @@
+---
+title: Repeating Actions with Loops
+teaching: 30
+exercises: 0
+questions:
+- "How can I do the same operations on many different values?"
+objectives:
+- "Explain what a `for` loop does."
+- "Correctly write `for` loops to repeat simple calculations."
+- "Trace changes to a loop variable as the loop runs."
+- "Trace changes to other variables as they are updated by a `for` loop."
+keypoints:
+- "Use `for variable in sequence` to process the elements of a sequence one at a time."
+- "The body of a `for` loop must be indented."
+- "Use `len(thing)` to determine the length of something that contains other values."
+---
+
+In the episode about visualizing data,
+we wrote Python code that plots values of interest from our first
+inflammation dataset (`inflammation-01.csv`), which revealed some suspicious features in it.
+
+![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day
+period.](../fig/03-loop_2_0.png)
+
+We have a dozen data sets right now and potentially more on the way if Dr. Maverick
+can keep up their surprisingly fast clinical trial rate. We want to create plots for all of
+our data sets with a single statement. To do that, we'll have to teach the computer how to
+repeat things.
+
+An example task that we might want to repeat is accessing numbers in a list,
+which we
+will do by printing each number on a line of its own.
+
+~~~
+odds = [1, 3, 5, 7]
+~~~
+{: .language-python}
+
+In Python, a list is basically an ordered collection of elements, and every
+element has a unique number associated with it --- its index. This means that
+we can access elements in a list using their indices.
+For example, we can get the first number in the list `odds`,
+by using `odds[0]`. One way to print each number is to use four `print` statements:
+
+~~~
+print(odds[0])
+print(odds[1])
+print(odds[2])
+print(odds[3])
+~~~
+{: .language-python}
+
+~~~
+1
+3
+5
+7
+~~~
+{: .output}
+
+This is a bad approach for three reasons:
+
+1.  **Not scalable**. Imagine you need to print a list that has hundreds
+    of elements.  It might be easier to type them in manually.
+
+2.  **Difficult to maintain**. If we want to decorate each printed element with an
+    asterisk or any other character, we would have to change four lines of code. While
+    this might not be a problem for small lists, it would definitely be a problem for
+    longer ones.
+
+3.  **Fragile**. If we use it with a list that has more elements than what we initially
+    envisioned, it will only display part of the list's elements. A shorter list, on
+    the other hand, will cause an error because it will be trying to display elements of the
+    list that do not exist.
+
+~~~
+odds = [1, 3, 5]
+print(odds[0])
+print(odds[1])
+print(odds[2])
+print(odds[3])
+~~~
+{: .language-python}
+
+~~~
+1
+3
+5
+~~~
+{: .output}
+
+~~~
+---------------------------------------------------------------------------
+IndexError                                Traceback (most recent call last)
+<ipython-input-3-7974b6cdaf14> in <module>()
+      3 print(odds[1])
+      4 print(odds[2])
+----> 5 print(odds[3])
+
+IndexError: list index out of range
+~~~
+{: .error}
+
+Here's a better approach: a [for loop]({{ page.root }}/reference.html#for-loop)
+
+~~~
+odds = [1, 3, 5, 7]
+for num in odds:
+    print(num)
+~~~
+{: .language-python}
+
+~~~
+1
+3
+5
+7
+~~~
+{: .output}
+
+This is shorter --- certainly shorter than something that prints every number in a
+hundred-number list --- and more robust as well:
+
+~~~
+odds = [1, 3, 5, 7, 9, 11]
+for num in odds:
+    print(num)
+~~~
+{: .language-python}
+
+~~~
+1
+3
+5
+7
+9
+11
+~~~
+{: .output}
+
+The improved version uses a [for loop]({{ page.root }}/reference.html#for-loop)
+to repeat an operation --- in this case, printing --- once for each thing in a sequence.
+The general form of a loop is:
+
+~~~
+for variable in collection:
+    # do things using variable, such as print
+~~~
+{: .language-python}
+
+Using the odds example above, the loop might look like this:
+
+![Loop variable 'num' being assigned the value of each element in the list `odds` in turn and
+then being printed](../fig/05-loops_image_num.png)
+
+where each number (`num`) in the variable `odds` is looped through and printed one number after
+another. The other numbers in the diagram denote which loop cycle the number was printed in (1
+being the first loop cycle, and 6 being the final loop cycle).
+
+We can call the [loop variable]({{ page.root }}/reference.html#loop-variable) anything we like, but
+there must be a colon at the end of the line starting the loop, and we must indent anything we
+want to run inside the loop. Unlike many other languages, there is no command to signify the end
+of the loop body (e.g. `end for`); what is indented after the `for` statement belongs to the loop.
+
+
+> ## What's in a name?
+>
+>
+> In the example above, the loop variable was given the name `num` as a mnemonic;
+> it is short for 'number'.
+> We can choose any name we want for variables.  We might just as easily have chosen the name
+> `banana` for the loop variable, as long as we use the same name when we invoke the variable inside
+> the loop:
+>
+> ~~~
+> odds = [1, 3, 5, 7, 9, 11]
+> for banana in odds:
+>     print(banana)
+> ~~~
+> {: .language-python}
+>
+> ~~~
+> 1
+> 3
+> 5
+> 7
+> 9
+> 11
+> ~~~
+> {: .output}
+>
+> It is a good idea to choose variable names that are meaningful, otherwise it would be more
+> difficult to understand what the loop is doing.
+{: .callout}
+
+Here's another loop that repeatedly updates a variable:
+
+~~~
+length = 0
+names = ['Curie', 'Darwin', 'Turing']
+for value in names:
+    length = length + 1
+print('There are', length, 'names in the list.')
+~~~
+{: .language-python}
+
+~~~
+There are 3 names in the list.
+~~~
+{: .output}
+
+It's worth tracing the execution of this little program step by step.
+Since there are three names in `names`,
+the statement on line 4 will be executed three times.
+The first time around,
+`length` is zero (the value assigned to it on line 1)
+and `value` is `Curie`.
+The statement adds 1 to the old value of `length`,
+producing 1,
+and updates `length` to refer to that new value.
+The next time around,
+`value` is `Darwin` and `length` is 1,
+so `length` is updated to be 2.
+After one more update,
+`length` is 3;
+since there is nothing left in `names` for Python to process,
+the loop finishes
+and the `print` function on line 5 tells us our final answer.
+
+Note that a loop variable is a variable that is being used to record progress in a loop.
+It still exists after the loop is over,
+and we can re-use variables previously defined as loop variables as well:
+
+~~~
+name = 'Rosalind'
+for name in ['Curie', 'Darwin', 'Turing']:
+    print(name)
+print('after the loop, name is', name)
+~~~
+{: .language-python}
+
+~~~
+Curie
+Darwin
+Turing
+after the loop, name is Turing
+~~~
+{: .output}
+
+Note also that finding the length of an object is such a common operation
+that Python actually has a built-in function to do it called `len`:
+
+~~~
+print(len([0, 1, 2, 3]))
+~~~
+{: .language-python}
+
+~~~
+4
+~~~
+{: .output}
+
+`len` is much faster than any function we could write ourselves,
+and much easier to read than a two-line loop;
+it will also give us the length of many other things that we haven't met yet,
+so we should always use it when we can.
+
+> ## From 1 to N
+>
+> Python has a built-in function called `range` that generates a sequence of numbers. `range` can
+> accept 1, 2, or 3 parameters.
+>
+> * If one parameter is given, `range` generates a sequence of that length,
+>   starting at zero and incrementing by 1.
+>   For example, `range(3)` produces the numbers `0, 1, 2`.
+> * If two parameters are given, `range` starts at
+>   the first and ends just before the second, incrementing by one.
+>   For example, `range(2, 5)` produces `2, 3, 4`.
+> * If `range` is given 3 parameters,
+>   it starts at the first one, ends just before the second one, and increments by the third one.
+>   For example, `range(3, 10, 2)` produces `3, 5, 7, 9`.
+>
+> Using `range`,
+> write a loop that uses `range` to print the first 3 natural numbers:
+>
+> ~~~
+> 1
+> 2
+> 3
+> ~~~
+> {: .language-python}
+>
+> > ## Solution
+> > ~~~
+> > for number in range(1, 4):
+> >     print(number)
+> > ~~~
+> > {: .language-python}
+> {: .solution}
+{: .challenge}
+
+
+
+
+> ## Understanding the loops
+>
+> Given the following loop:
+> ~~~
+> word = 'oxygen'
+> for char in word:
+>     print(char)
+> ~~~
+> {: .language-python}
+>
+> How many times is the body of the loop executed?
+>
+> * 3 times
+> * 4 times
+> * 5 times
+> * 6 times
+>
+> > ## Solution
+> >
+> > The body of the loop is executed 6 times.
+> >
+> {: .solution}
+{: .challenge}
+
+
+
+> ## Computing Powers With Loops
+>
+> Exponentiation is built into Python:
+>
+> ~~~
+> print(5 ** 3)
+> ~~~
+> {: .language-python}
+>
+> ~~~
+> 125
+> ~~~
+> {: .output}
+>
+> Write a loop that calculates the same result as `5 ** 3` using
+> multiplication (and without exponentiation).
+>
+> > ## Solution
+> > ~~~
+> > result = 1
+> > for number in range(0, 3):
+> >     result = result * 5
+> > print(result)
+> > ~~~
+> > {: .language-python}
+> {: .solution}
+{: .challenge}
+
+> ## Summing a list
+>
+> Write a loop that calculates the sum of elements in a list
+> by adding each element and printing the final value,
+> so `[124, 402, 36]` prints 562
+>
+> > ## Solution
+> > ~~~
+> > numbers = [124, 402, 36]
+> > summed = 0
+> > for num in numbers:
+> >     summed = summed + num
+> > print(summed)
+> > ~~~
+> > {: .language-python}
+> {: .solution}
+{: .challenge}
+
+> ## Computing the Value of a Polynomial
+>
+> The built-in function `enumerate` takes a sequence (e.g. a [list]({{ page.root }}/04-lists/)) and
+> generates a new sequence of the same length. Each element of the new sequence is a pair composed
+> of the index (0, 1, 2,...) and the value from the original sequence:
+>
+> ~~~
+> for idx, val in enumerate(a_list):
+>     # Do something using idx and val
+> ~~~
+> {: .language-python}
+>
+> The code above loops through `a_list`, assigning the index to `idx` and the value to `val`.
+>
+> Suppose you have encoded a polynomial as a list of coefficients in
+> the following way: the first element is the constant term, the
+> second element is the coefficient of the linear term, the third is the
+> coefficient of the quadratic term, etc.
+>
+> ~~~
+> x = 5
+> coefs = [2, 4, 3]
+> y = coefs[0] * x**0 + coefs[1] * x**1 + coefs[2] * x**2
+> print(y)
+> ~~~
+> {: .language-python}
+>
+> ~~~
+> 97
+> ~~~
+> {: .output}
+>
+> Write a loop using `enumerate(coefs)` which computes the value `y` of any
+> polynomial, given `x` and `coefs`.
+>
+> > ## Solution
+> > ~~~
+> > y = 0
+> > for idx, coef in enumerate(coefs):
+> >     y = y + coef * x**idx
+> > ~~~
+> > {: .language-python}
+> {: .solution}
+{: .challenge}
+
+{% include links.md %}
diff --git a/_episodes/06-files.md b/_episodes/06-files.md
new file mode 100644
index 0000000..05fcb34
--- /dev/null
+++ b/_episodes/06-files.md
@@ -0,0 +1,246 @@
+---
+title: Analyzing Data from Multiple Files
+teaching: 20
+exercises: 0
+questions:
+- "How can I do the same operations on many different files?"
+objectives:
+- "Use a library function to get a list of filenames that match a wildcard pattern."
+- "Write a `for` loop to process multiple files."
+keypoints:
+- "Use `glob.glob(pattern)` to create a list of files whose names match a pattern."
+- "Use `*` in a pattern to match zero or more characters, and `?` to match any single character."
+---
+
+As a final piece to processing our inflammation data, we need a way to get a list of all the files
+in our `data` directory whose names start with `inflammation-` and end with `.csv`.
+The following library will help us to achieve this:
+~~~
+import glob
+~~~
+{: .language-python}
+
+The `glob` library contains a function, also called `glob`,
+that finds files and directories whose names match a pattern.
+We provide those patterns as strings:
+the character `*` matches zero or more characters,
+while `?` matches any one character.
+We can use this to get the names of all the CSV files in the current directory:
+
+~~~
+print(glob.glob('inflammation*.csv'))
+~~~
+{: .language-python}
+
+~~~
+['inflammation-05.csv', 'inflammation-11.csv', 'inflammation-12.csv', 'inflammation-08.csv',
+'inflammation-03.csv', 'inflammation-06.csv', 'inflammation-09.csv', 'inflammation-07.csv',
+'inflammation-10.csv', 'inflammation-02.csv', 'inflammation-04.csv', 'inflammation-01.csv']
+~~~
+{: .output}
+
+As these examples show,
+`glob.glob`'s result is a list of file and directory paths in arbitrary order.
+This means we can loop over it
+to do something with each filename in turn.
+In our case,
+the "something" we want to do is generate a set of plots for each file in our inflammation dataset.
+
+If we want to start by analyzing just the first three files in alphabetical order, we can use the
+`sorted` built-in function to generate a new sorted list from the `glob.glob` output:
+
+~~~
+import glob
+import numpy
+import matplotlib.pyplot
+
+filenames = sorted(glob.glob('inflammation*.csv'))
+filenames = filenames[0:3]
+for filename in filenames:
+    print(filename)
+
+    data = numpy.loadtxt(fname=filename, delimiter=',')
+
+    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
+
+    axes1 = fig.add_subplot(1, 3, 1)
+    axes2 = fig.add_subplot(1, 3, 2)
+    axes3 = fig.add_subplot(1, 3, 3)
+
+    axes1.set_ylabel('average')
+    axes1.plot(numpy.mean(data, axis=0))
+
+    axes2.set_ylabel('max')
+    axes2.plot(numpy.max(data, axis=0))
+
+    axes3.set_ylabel('min')
+    axes3.plot(numpy.min(data, axis=0))
+
+    fig.tight_layout()
+    matplotlib.pyplot.show()
+~~~
+{: .language-python}
+
+~~~
+inflammation-01.csv
+~~~
+{: .output}
+
+![Output from the first iteration of the for loop. Three line graphs showing the daily average,
+maximum and minimum inflammation over a 40-day period for all patients in the first dataset.](
+../fig/03-loop_49_1.png)
+
+~~~
+inflammation-02.csv
+~~~
+{: .output}
+
+![Output from the second iteration of the for loop. Three line graphs showing the daily average,
+maximum and minimum inflammation over a 40-day period for all patients in the second
+dataset.](../fig/03-loop_49_3.png)
+
+~~~
+inflammation-03.csv
+~~~
+{: .output}
+
+![Output from the third iteration of the for loop. Three line graphs showing the daily average,
+maximum and minimum inflammation over a 40-day period for all patients in the third
+dataset.](../fig/03-loop_49_5.png)
+
+
+The plots generated for the second clinical trial file look very similar to the plots for
+the first file: their average plots show similar "noisy" rises and falls; their maxima plots
+show exactly the same linear rise and fall; and their minima plots show similar staircase
+structures.
+
+The third dataset shows much noisier average and maxima plots that are far less suspicious than
+the first two datasets, however the minima plot shows that the third dataset minima is
+consistently zero across every day of the trial. If we produce a heat map for the third data file
+we see the following:
+
+![Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout
+the entire dataset, and the last patient only has zero values over the 40 day study.
+](../fig/inflammation-03-imshow.svg)
+
+We can see that there are zero values sporadically distributed across all patients and days of the
+clinical trial, suggesting that there were potential issues with data collection throughout the
+trial. In addition, we can see that the last patient in the study didn't have any inflammation
+flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis!
+
+
+> ## Plotting Differences
+>
+> Plot the difference between the average inflammations reported in the first and second datasets
+> (stored in `inflammation-01.csv` and `inflammation-02.csv`, correspondingly),
+> i.e., the difference between the leftmost plots of the first two figures.
+>
+> > ## Solution
+> > ~~~
+> > import glob
+> > import numpy
+> > import matplotlib.pyplot
+> >
+> > filenames = sorted(glob.glob('inflammation*.csv'))
+> >
+> > data0 = numpy.loadtxt(fname=filenames[0], delimiter=',')
+> > data1 = numpy.loadtxt(fname=filenames[1], delimiter=',')
+> >
+> > fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
+> >
+> > matplotlib.pyplot.ylabel('Difference in average')
+> > matplotlib.pyplot.plot(numpy.mean(data0, axis=0) - numpy.mean(data1, axis=0))
+> >
+> > fig.tight_layout()
+> > matplotlib.pyplot.show()
+> > ~~~
+> > {: .language-python}
+> {: .solution}
+{: .challenge}
+
+> ## Generate Composite Statistics
+>
+> Use each of the files once to generate a dataset containing values averaged over all patients:
+>
+> ~~~
+> filenames = glob.glob('inflammation*.csv')
+> composite_data = numpy.zeros((60,40))
+> for filename in filenames:
+>     # sum each new file's data into composite_data as it's read
+>     #
+> # and then divide the composite_data by number of samples
+> composite_data = composite_data / len(filenames)
+> ~~~
+> {: .language-python}
+>
+> Then use pyplot to generate average, max, and min for all patients.
+>
+> > ## Solution
+> > ~~~
+> > import glob
+> > import numpy
+> > import matplotlib.pyplot
+> >
+> > filenames = glob.glob('inflammation*.csv')
+> > composite_data = numpy.zeros((60,40))
+> >
+> > for filename in filenames:
+> >     data = numpy.loadtxt(fname = filename, delimiter=',')
+> >     composite_data = composite_data + data
+> >
+> > composite_data = composite_data / len(filenames)
+> >
+> > fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
+> >
+> > axes1 = fig.add_subplot(1, 3, 1)
+> > axes2 = fig.add_subplot(1, 3, 2)
+> > axes3 = fig.add_subplot(1, 3, 3)
+> >
+> > axes1.set_ylabel('average')
+> > axes1.plot(numpy.mean(composite_data, axis=0))
+> >
+> > axes2.set_ylabel('max')
+> > axes2.plot(numpy.max(composite_data, axis=0))
+> >
+> > axes3.set_ylabel('min')
+> > axes3.plot(numpy.min(composite_data, axis=0))
+> >
+> > fig.tight_layout()
+> >
+> > matplotlib.pyplot.show()
+> > ~~~
+> > {: .language-python}
+>{: .solution}
+{: .challenge}
+
+After spending some time investigating the heat map and statistical plots, as well as
+doing the above exercises to plot differences between datasets and to generate composite
+patient statistics, we gain some insight into the twelve clinical trial datasets.
+
+The datasets appear to fall into two categories:
+
+* seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims,
+  but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`)
+* "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning
+  data collection issues such as sporadic missing values and even an unsuitable candidate
+  making it into the clinical trial.
+
+In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`,
+`inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value.
+Armed with this information, we confront Dr. Maverick about the suspicious data and
+duplicated files.
+
+Dr. Maverick confesses that they fabricated the clinical data after they found out
+that the initial trial suffered from a number of issues, including unreliable data-recording and
+poor participant selection. They created fake data to prove their drug worked, and when we asked
+for more data they tried to generate more fake datasets, as well as throwing in the original
+poor-quality dataset a few times to try and make all the trials seem a bit more "realistic".
+
+Congratulations! We've investigated the inflammation data and proven that the datasets have been
+synthetically generated.
+
+But it would be a shame to throw away the synthetic datasets that have taught us so much
+already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn
+how to program.
+
+{% include links.md %}