From 7b77fa548373ba860f6f40a47da7b4b34a131c30 Mon Sep 17 00:00:00 2001 From: Jose Date: Fri, 19 Jan 2024 08:29:36 -0600 Subject: [PATCH] refactoring lessons structure, moved code migration to HPC to the end of advanced Python and moved other lessons to intro to Python --- _episodes/05-code-migration-01.md | 532 ------------------------------ _episodes/05-loop.md | 422 ++++++++++++++++++++++++ _episodes/06-files.md | 246 ++++++++++++++ 3 files changed, 668 insertions(+), 532 deletions(-) delete mode 100644 _episodes/05-code-migration-01.md create mode 100644 _episodes/05-loop.md create mode 100644 _episodes/06-files.md diff --git a/_episodes/05-code-migration-01.md b/_episodes/05-code-migration-01.md deleted file mode 100644 index 974127f..0000000 --- a/_episodes/05-code-migration-01.md +++ /dev/null @@ -1,532 +0,0 @@ ---- -title: Code migration to HPC systems -teaching: 40 -exercises: 0 -questions: -- "I would like to try Hawk, can I keep working with Jupyter Notebooks? - (yes, but ...)" -- "How to access Jupyter Notebooks on Hawk?" -- "How to transition from interactive Jupyter Notebooks to automated Python scripts?" -objectives: -- "Learn how to set up your work environment (transfer files, install libraries)." -- "Understand the differences and trade-offs between Jupyter Notebooks and - Python scripts." -keypoints: -- "It is possible to use Jupyter Notebooks on Hawk via OnDemand and ssh tunnels." -- "The recommended method to setup your work environment (installing libraries) - is using Anaconda virtual environments." -- "Include the library ipykernel to make the environment reachable from Jupyter - Notebooks" -- "Use Jupyter Notebooks in the early (development) stages of your project when - lots of debugging are necessary. Move towards automated Python scripts in later - stages and submit them via SLURM job scripts." - ---- - -## Running Jupyter Notebooks on a remote HPC systems - -Although the most traditional way to interact with remote HPC and cloud systems -is through the command line (via the `ssh` and `scp` commands) some systems also -offer graphical user interfaces for some services. Specifically, on Hawk you can -deploy -Jupyter Notebooks via [OnDemand](https://openondemand.org/) (a web portal that -allow you to work with HPC systems interactively). The notes below provide -instructions for both methods of access: through OnDemand and through a `ssh` -*tunnel*. - -{::options parse_block_html="true" /} -
- - -
-
- - To access a Jupyter Notebook server via an `ssh` tunnel you need first to - login to Hawk: - - ``` - $ ssh hawk-username@hawklogin.cf.ac.uk - ``` - - Once logged in, confirm that Python 3 is accessible: - - ``` - $ module load compiler/gnu/9 - $ module load python/3.7.0 - $ python3 --version - ``` - ~~~ - Python 3.7.0 - ~~~ - {: .output} - - We need to install Jupyter Notebooks on our user account on the remote server - (we will discuss more about installing Python packages later on): - - ``` - $ python3 -m venv create my_venv - $ . my_venv/bin/activate - $ python3 -m pip install jupyterlab - ``` - - The installation process will try to download several dependencies from the - internet. Be patient, it shouldn't take more than a couple of minutes. - - Now, this is important, the Jupyter Notebook server must be run on a *compute - node*, please take a look at our best practices guidelines in the - [SCW portal](https://portal.supercomputing.wales/index.php/best-practice/). - If the concept of login and compute nodes is still not clear at this point - don't worry too much (but you can find out more in out - [Supercomputing for Beginners training course](https://arcca.github.io/hpc-intro/). - For now, run the following command to instruct the Hawk job scheduler to run - a Jupyter Notebook server on a compute node: - - ``` - $ srun -n 1 -p htc --account=scwXXXX -t 1:00:00 jupyter-lab --ip=0.0.0.0 - ``` - - ~~~ - http://ccs1015:8888/?token=77777add13ab93a0c408c287a630249c2dba93efdd3fae06 - or http://127.0.0.1:8888/?token=77777add13ab93a0c408c287a630249c2dba93efdd3fae06 - ~~~ - {: .output} - - Next, open a new terminal and create a ssh tunnel using the node and port - obtained in the previous step (e.g. ccs1015:8888): - - ``` - $ ssh -L8888:ccs1015:8888 hawk-username@hawklogin.cf.ac.uk - ``` - - You should be able to navigate to http://localhost:8888 in your web browser - (use the token provided in the output if needed). If everything went well, you - should see something like: - - Jupyter Lab Home - - Where you should be able to access the files stored your Hawk user account. - - -
- -
- 1. Go to [ARCCA OnDemand](https://arcondemand.cardiff.ac.uk) portal (this - requires access to [Cardiff University VPN](https://intranet.cardiff.ac.uk/staff/supporting-your-work/it-support/wireless-and-remote-access/off-campus-access/virtual-private-network-vpn) ). - 2. Enter your details: Hawk username and password. Once logged in you should - land on a page with useful information including the usual Message of the - Day (MOD) commonly seen when logging in to Hawk via the terminal. - - | | | - |:--------:|:--------:| - | ARCCA OnDemand login page | ARCCA landing page | - | | | - - 3. Go to "Interactive Apps" in the top menu and select "Jupyter Notebook/Lab". - This will bring you to a form where you can specify for how much time the - desktop is required, number of CPUs, partition, etc. You can also choose - to receive an email once the desktop is ready for you. Click the *Launch* - button to submit the request. - - | | | - |:--------:|:--------:| - | ARCCA OnDemand login page | OnDemand JN requirements | - | | | - - 4. After submission you request will be placed on the queue and will wait - for resources, hopefully for a short period, but this *depends on the - number of cores as well as time requested*, so please be patient. At this - point you can close the OnDemand website and come back at a later point - to check progress or wait for the email notification if the option was - selected. - - Once your request is granted you should be able to see a *Running* message, - the amount of resources granted and the time remaining. - - Click *Connect to Jupyter* to launch the Jupyter in a new web browser tab. - - | | | - |:--------:|:--------:| - | OnDemand JN queued | OnDemand JN running | - | | | - - 5. You should now have the familiar interface of Jupyter Notebooks in front of - you. It will show the documents and directories in your user account on - Hawk. To create a new Notebook, go to the dropdown menu *New* on the right - side and click on *Python 3 (ipykernel)*. A new tab will open with a new - notebook ready for you to start working. - - | | | - |:--------:|:--------:| - | OnDemand JN main | OnDemand JN new notebook | - | | | - -
-
-
- -{% include links.md %} - - -## Copying data - -To keep working on Hawk with the Notebooks we have written locally in our -Desktop computer we need to transfer them over. Depending on our platform we -can do this in a couple of ways: - -{::options parse_block_html="true" /} -
- - -
-
- On Windows you can use [MobaXterm](https://mobaxterm.mobatek.net/) to - transfer files to Hawk from your local computer. - - | | | - |:----------------:|:----------------:| - |
Open SCP session on MobaXterm

Click on **Session** to open the different connection methods available in MobaXterm | Enter details to start SPC session on MobaXterm
Select **SFTP** and enter the Remote Host (*hawklogin.cf.ac.uk*) and your **Hawk username** | - | | | - |
Open SCP session on MobaXterm

Locate the directory in your local computer and drag and drop to the remote server on the right pane. || - -
- -
- MacOS and Linux provide the command scp -r that can be used to recursively - copy your work directory over to your home directory in Hawk: - - ``` - $ scp -r arcca-python hawk-username@hawklogin.cf.ac.uk:/home/hawk-username - python-novice-inflammation-code.zip 100% 7216 193.0KB/s 00:00 - Untitled.ipynb 100% 67KB 880.2KB/s 00:00 - inflammation.png 100% 13KB 315.6KB/s 00:00 - argv_list.py 100% 42 0.4KB/s 00:00 - readings_08.py 100% 1097 10.6KB/s 00:00 - readings_09.py 100% 851 24.8KB/s 00:00 - ``` - -
- -
- With OnDemand you can also download and upload files to Hawk. In this - example we will upload the directory with the Jupyter Notebooks we have - created so far. Go to `Files` and select the directory where you wish to - upload the files (our home directory in this case), then select `Upload` - and locate the directory in your local computer. Once uploaded, the files - should be available on Hawk: - - | | | - |:----------------:|:----------------:| - | Go to Files and select directory where to upload. | Click Upload | - | Locate files in your local computer. | Locate files in your local computer. | - -
- -
-
- -## Setup your Jupyter Notebook work environment - -Depending on how you started your Jupyter Notebook you should have access to -some default packages. But these are not guaranteed to be the same (this also -applies for the version of Python) between the OnDemand and the `ssh` tunnel -methods. Moreover, it is unlikely that the remote HPC system would provide -every package you need by default. - -### Installing Python libraries - -The **recommended approach** is to create a conda virtual environment with a -`environemnt.yml` file which includes a list of all packages (and versions) -needed for your work. This file can be created and used in your local computer -and then copied to Hawk to reproduce the same environment. An example file is: - -~~~ -name: my-conda-env -dependencies: - - python=3.9.2 - - numpy - - pandas - - ipykernel -~~~ -{: .language-yaml} - -The package `ipykernel` is required here to make the environment reachable from -Jupyter Notebooks. You can find more about creating an `environment.yml` file -in the [Anaconda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually). - -On Hawk you need to first load Anaconda: - -``` -$ module load anaconda/2020.02 -$ source activate -$ which conda -``` - -~~~ -/apps/languages/anaconda/2020.02/bin/conda -~~~ -{: .output} - -And then proceed to install the virtual environment: - -``` -$ conda env create -f environment.yml -``` - -~~~ -... -libffi-3.3 | 50 KB | ##################################### | 100% -numpy-1.21.2 | 23 KB | ##################################### | 100% -pandas-1.3.4 | 9.6 MB | ##################################### | 100% -mkl-2021.4.0 | 142.6 MB | ##################################### | 100% -six-1.16.0 | 18 KB | ##################################### | 100% -Preparing transaction: done -Verifying transaction: done -Executing transaction: done -# -# To activate this environment, use -# -# $ conda activate my-conda-env -# -# To deactivate an active environment, use -# -# $ conda deactivate -~~~ -{: .output} - -We can then follow the instructions printed at the end of the installation -process to activate our environment: - -``` -$ conda activate my-conda-env -$ which python -``` - -~~~ -~/.conda/envs/my-conda-env/bin/python -~~~ -{: .output} - -We can further confirm the version of Python used by the environment: - -``` -$ python --version -``` - -~~~ -Python 3.9.2 -~~~ -{: .output} - -To deactivate the environment and return to the default Python provided by the -system (or the loaded module): - -``` -$ conda deactivate -$ python --version -``` - -~~~ -Python 3.7.6 -~~~ -{: .output} - -### Using an Anaconda virtual environment from Jupyter Notebooks - -We can access our newly installed Anaconda environment from Jupyter Notebooks -on OnDemand. For this, create a new session and when the resources are granted -click on `Connect to Jupyter`. On Jupyter Lab you might be asked to choose -which kernel to start, if so, select the name given to your virtual environment -(*my-conda-env*) in this example: - - - -If another kernel is loaded by default, you can still change it by clicking on -the top right corner of your Notebook, a similar menu should appear: - - | | | - |:----------------:|:----------------:| - | Change JN kernel manually. | Select JN kernel from menu | - -If all goes well you should be able to confirm the Python versions and path, as -well as the location of the installed libraries: - - - -At this point you should have all the packages required to continue working on -Hawk as if you were working on your local computer. - - -### A more efficient approach - -During these examples we have been requesting only 1 CPU when we launch our -Jupyter Notebook and that, hopefully, has caused our request to be fulfilled -fairly quickly. However, there will be a point where 1 CPU is no longer enough -(maybe the application has become more complex or there is more data to analyse, -and memory requirements have increased). At that point you can modify the -requirements and increase the number of CPUS, memory, time or devices (GPUs). -One point to keep in mind when increasing requirements is that this will impact -the time it takes for the system scheduler to deliver your request and allocate -you the resources, the higher the requirements, the longer it will take. - -When the time spent waiting in queue becomes excessive it is worth considering -moving away from the Jupyter Notebook workflow towards a more traditional -Python script approach (**recommended for HPC systems**). The main difference -between them is that while Jupyter Notebooks is ideal for the development -stages of a project (since you can test things out in real time and debug if -needed), the Python script approach is better suited for the production stages -where the needed for supervision and debugging is reduced. Python scripts also -have the advantage, on HPC systems, of being able to be queued for resources -and automatically executed when these are granted without you needing to be -logged in the system. - -So, how do we actually transfer our Jupyter Notebook to a Python script? -Fortunately, Jupyter Notebook developers thought of this requirement and added -a convenient export method to the Notebooks (the menus might be different -depending on if Jupyter Notebooks or Jupyter Lab was launched from OnDemand): - -| | | -|:----------------:|:----------------:| -| ![Download Notebook from JN as Py script](../fig/jupyter-notebook-download-as-python-script.png) | ![Download Notebook from JL as Py script](../fig/jupyter-lab-download-as-python-script.png) | -| Download from Jupyter Notebook | Download from Jupyter Lab | - -After choosing an appropriate name and saving the file, we should have a Python -script (a text file) with entries similar to: - -~~~ -#!/usr/bin/env python -# coding: utf-8 - -# In[1]: - - -3 + 5 * 8 - - -# In[2]: - - -weight_kg=60 - -~~~ -{: .language-python} - -Notice the `# In[X]:` that mark the position of corresponding cells in our -Jupyter Notebook and are kept for reference. We can keep them in place, they -won't cause any trouble as they are included as comments (due to the initial -`#`), but if we wanted to remove them we could do it by hand or more -efficiently by using the command line tool `sed` to find and delete lines that -start with the characters `# [` and to delete empty lines (`^` is used by `sed` -to indicate the beginning of a line and `$` to indicate the end): - -**(from this point onwards we move away from Jupyter Notebooks and start typing -commands on a terminal connected to Hawk)** - -``` -sed -e '/# In/d' -e '/^$/d' lesson1.py > lesson1_cleaned.py -``` - -This will produce the `lesson1_cleaned.py` file with entries similar to: - -~~~ -#!/usr/bin/env python -# coding: utf-8 -3+5+8 -weight_kg=60 -~~~ -{: .language-python} - -Now that we have our Python script, we need to create an additional file (job -script) to place it in the queue (submit the job). Make sure to remove any -commands from the Python script that might need additional confirmation or user -interaction as you won't be able to provide it with this method of execution. -The following is the content a job script that is equivalent to how we have -been requesting resources through OnDemand: - -~~~ -#!/bin/bash - -#SBATCH -J test # job name -#SBATCH -n 1 # number of tasks needed -#SBATCH -p htc # partition -#SBATCH --time=01:00:00 # time limit -#SBATCH -A scwXXXX # account number - -set -eu - -module purge -module load anaconda/2020.02 -module list - -# Load conda -source activate - -# Load our environment -conda activate my-conda-env - -which python -python --version - -python my-python-script.py - -~~~ -{: .language-bash} - -To submit (put it queue) the above script, on Hawk: - -``` -$ sbatch my-job-script.sh -``` - -~~~ -Submitted batch job 25859860 -~~~ -{: .output} - -You can query the current state of this job with: - -``` -$ squeue -u $USER -``` - -~~~ - JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) - 25860025 htc test c.xxxxxxx PD 0:00 1 ccs3004 -~~~ -{: .output} - -This particular job might not spend a long time in queue and the above output -might not show it, but on completion there should be a `slurm-.out` -created in the current directory with the output produced by our script. - -There is a lot more to know about working with HTC systems and job schedulers, -once you are ready to go this route, take a look at our documentation and -training courses on these topics: - -- [**Supercomputing for Beginners**](https://arcca.github.io/hpc-intro): Why - use HPC? Accessing systems, using SLURM, loading software, file transfer and - optimising resources. -- [**Slurm: Advanced Topics**](https://arcca.github.io/slurm_advanced_topics): - Additional material to interface with HPC more effectively. - -> ## Need help? -> -> If during the above steps you found any issues or have doubts regarding your -> specific work environment, get in touch with us at arcca-help@cardiff.ac.uk. -{: .callout} - - - -{% include links.md %} diff --git a/_episodes/05-loop.md b/_episodes/05-loop.md new file mode 100644 index 0000000..b5812f3 --- /dev/null +++ b/_episodes/05-loop.md @@ -0,0 +1,422 @@ +--- +title: Repeating Actions with Loops +teaching: 30 +exercises: 0 +questions: +- "How can I do the same operations on many different values?" +objectives: +- "Explain what a `for` loop does." +- "Correctly write `for` loops to repeat simple calculations." +- "Trace changes to a loop variable as the loop runs." +- "Trace changes to other variables as they are updated by a `for` loop." +keypoints: +- "Use `for variable in sequence` to process the elements of a sequence one at a time." +- "The body of a `for` loop must be indented." +- "Use `len(thing)` to determine the length of something that contains other values." +--- + +In the episode about visualizing data, +we wrote Python code that plots values of interest from our first +inflammation dataset (`inflammation-01.csv`), which revealed some suspicious features in it. + +![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day +period.](../fig/03-loop_2_0.png) + +We have a dozen data sets right now and potentially more on the way if Dr. Maverick +can keep up their surprisingly fast clinical trial rate. We want to create plots for all of +our data sets with a single statement. To do that, we'll have to teach the computer how to +repeat things. + +An example task that we might want to repeat is accessing numbers in a list, +which we +will do by printing each number on a line of its own. + +~~~ +odds = [1, 3, 5, 7] +~~~ +{: .language-python} + +In Python, a list is basically an ordered collection of elements, and every +element has a unique number associated with it --- its index. This means that +we can access elements in a list using their indices. +For example, we can get the first number in the list `odds`, +by using `odds[0]`. One way to print each number is to use four `print` statements: + +~~~ +print(odds[0]) +print(odds[1]) +print(odds[2]) +print(odds[3]) +~~~ +{: .language-python} + +~~~ +1 +3 +5 +7 +~~~ +{: .output} + +This is a bad approach for three reasons: + +1. **Not scalable**. Imagine you need to print a list that has hundreds + of elements. It might be easier to type them in manually. + +2. **Difficult to maintain**. If we want to decorate each printed element with an + asterisk or any other character, we would have to change four lines of code. While + this might not be a problem for small lists, it would definitely be a problem for + longer ones. + +3. **Fragile**. If we use it with a list that has more elements than what we initially + envisioned, it will only display part of the list's elements. A shorter list, on + the other hand, will cause an error because it will be trying to display elements of the + list that do not exist. + +~~~ +odds = [1, 3, 5] +print(odds[0]) +print(odds[1]) +print(odds[2]) +print(odds[3]) +~~~ +{: .language-python} + +~~~ +1 +3 +5 +~~~ +{: .output} + +~~~ +--------------------------------------------------------------------------- +IndexError Traceback (most recent call last) + in () + 3 print(odds[1]) + 4 print(odds[2]) +----> 5 print(odds[3]) + +IndexError: list index out of range +~~~ +{: .error} + +Here's a better approach: a [for loop]({{ page.root }}/reference.html#for-loop) + +~~~ +odds = [1, 3, 5, 7] +for num in odds: + print(num) +~~~ +{: .language-python} + +~~~ +1 +3 +5 +7 +~~~ +{: .output} + +This is shorter --- certainly shorter than something that prints every number in a +hundred-number list --- and more robust as well: + +~~~ +odds = [1, 3, 5, 7, 9, 11] +for num in odds: + print(num) +~~~ +{: .language-python} + +~~~ +1 +3 +5 +7 +9 +11 +~~~ +{: .output} + +The improved version uses a [for loop]({{ page.root }}/reference.html#for-loop) +to repeat an operation --- in this case, printing --- once for each thing in a sequence. +The general form of a loop is: + +~~~ +for variable in collection: + # do things using variable, such as print +~~~ +{: .language-python} + +Using the odds example above, the loop might look like this: + +![Loop variable 'num' being assigned the value of each element in the list `odds` in turn and +then being printed](../fig/05-loops_image_num.png) + +where each number (`num`) in the variable `odds` is looped through and printed one number after +another. The other numbers in the diagram denote which loop cycle the number was printed in (1 +being the first loop cycle, and 6 being the final loop cycle). + +We can call the [loop variable]({{ page.root }}/reference.html#loop-variable) anything we like, but +there must be a colon at the end of the line starting the loop, and we must indent anything we +want to run inside the loop. Unlike many other languages, there is no command to signify the end +of the loop body (e.g. `end for`); what is indented after the `for` statement belongs to the loop. + + +> ## What's in a name? +> +> +> In the example above, the loop variable was given the name `num` as a mnemonic; +> it is short for 'number'. +> We can choose any name we want for variables. We might just as easily have chosen the name +> `banana` for the loop variable, as long as we use the same name when we invoke the variable inside +> the loop: +> +> ~~~ +> odds = [1, 3, 5, 7, 9, 11] +> for banana in odds: +> print(banana) +> ~~~ +> {: .language-python} +> +> ~~~ +> 1 +> 3 +> 5 +> 7 +> 9 +> 11 +> ~~~ +> {: .output} +> +> It is a good idea to choose variable names that are meaningful, otherwise it would be more +> difficult to understand what the loop is doing. +{: .callout} + +Here's another loop that repeatedly updates a variable: + +~~~ +length = 0 +names = ['Curie', 'Darwin', 'Turing'] +for value in names: + length = length + 1 +print('There are', length, 'names in the list.') +~~~ +{: .language-python} + +~~~ +There are 3 names in the list. +~~~ +{: .output} + +It's worth tracing the execution of this little program step by step. +Since there are three names in `names`, +the statement on line 4 will be executed three times. +The first time around, +`length` is zero (the value assigned to it on line 1) +and `value` is `Curie`. +The statement adds 1 to the old value of `length`, +producing 1, +and updates `length` to refer to that new value. +The next time around, +`value` is `Darwin` and `length` is 1, +so `length` is updated to be 2. +After one more update, +`length` is 3; +since there is nothing left in `names` for Python to process, +the loop finishes +and the `print` function on line 5 tells us our final answer. + +Note that a loop variable is a variable that is being used to record progress in a loop. +It still exists after the loop is over, +and we can re-use variables previously defined as loop variables as well: + +~~~ +name = 'Rosalind' +for name in ['Curie', 'Darwin', 'Turing']: + print(name) +print('after the loop, name is', name) +~~~ +{: .language-python} + +~~~ +Curie +Darwin +Turing +after the loop, name is Turing +~~~ +{: .output} + +Note also that finding the length of an object is such a common operation +that Python actually has a built-in function to do it called `len`: + +~~~ +print(len([0, 1, 2, 3])) +~~~ +{: .language-python} + +~~~ +4 +~~~ +{: .output} + +`len` is much faster than any function we could write ourselves, +and much easier to read than a two-line loop; +it will also give us the length of many other things that we haven't met yet, +so we should always use it when we can. + +> ## From 1 to N +> +> Python has a built-in function called `range` that generates a sequence of numbers. `range` can +> accept 1, 2, or 3 parameters. +> +> * If one parameter is given, `range` generates a sequence of that length, +> starting at zero and incrementing by 1. +> For example, `range(3)` produces the numbers `0, 1, 2`. +> * If two parameters are given, `range` starts at +> the first and ends just before the second, incrementing by one. +> For example, `range(2, 5)` produces `2, 3, 4`. +> * If `range` is given 3 parameters, +> it starts at the first one, ends just before the second one, and increments by the third one. +> For example, `range(3, 10, 2)` produces `3, 5, 7, 9`. +> +> Using `range`, +> write a loop that uses `range` to print the first 3 natural numbers: +> +> ~~~ +> 1 +> 2 +> 3 +> ~~~ +> {: .language-python} +> +> > ## Solution +> > ~~~ +> > for number in range(1, 4): +> > print(number) +> > ~~~ +> > {: .language-python} +> {: .solution} +{: .challenge} + + + + +> ## Understanding the loops +> +> Given the following loop: +> ~~~ +> word = 'oxygen' +> for char in word: +> print(char) +> ~~~ +> {: .language-python} +> +> How many times is the body of the loop executed? +> +> * 3 times +> * 4 times +> * 5 times +> * 6 times +> +> > ## Solution +> > +> > The body of the loop is executed 6 times. +> > +> {: .solution} +{: .challenge} + + + +> ## Computing Powers With Loops +> +> Exponentiation is built into Python: +> +> ~~~ +> print(5 ** 3) +> ~~~ +> {: .language-python} +> +> ~~~ +> 125 +> ~~~ +> {: .output} +> +> Write a loop that calculates the same result as `5 ** 3` using +> multiplication (and without exponentiation). +> +> > ## Solution +> > ~~~ +> > result = 1 +> > for number in range(0, 3): +> > result = result * 5 +> > print(result) +> > ~~~ +> > {: .language-python} +> {: .solution} +{: .challenge} + +> ## Summing a list +> +> Write a loop that calculates the sum of elements in a list +> by adding each element and printing the final value, +> so `[124, 402, 36]` prints 562 +> +> > ## Solution +> > ~~~ +> > numbers = [124, 402, 36] +> > summed = 0 +> > for num in numbers: +> > summed = summed + num +> > print(summed) +> > ~~~ +> > {: .language-python} +> {: .solution} +{: .challenge} + +> ## Computing the Value of a Polynomial +> +> The built-in function `enumerate` takes a sequence (e.g. a [list]({{ page.root }}/04-lists/)) and +> generates a new sequence of the same length. Each element of the new sequence is a pair composed +> of the index (0, 1, 2,...) and the value from the original sequence: +> +> ~~~ +> for idx, val in enumerate(a_list): +> # Do something using idx and val +> ~~~ +> {: .language-python} +> +> The code above loops through `a_list`, assigning the index to `idx` and the value to `val`. +> +> Suppose you have encoded a polynomial as a list of coefficients in +> the following way: the first element is the constant term, the +> second element is the coefficient of the linear term, the third is the +> coefficient of the quadratic term, etc. +> +> ~~~ +> x = 5 +> coefs = [2, 4, 3] +> y = coefs[0] * x**0 + coefs[1] * x**1 + coefs[2] * x**2 +> print(y) +> ~~~ +> {: .language-python} +> +> ~~~ +> 97 +> ~~~ +> {: .output} +> +> Write a loop using `enumerate(coefs)` which computes the value `y` of any +> polynomial, given `x` and `coefs`. +> +> > ## Solution +> > ~~~ +> > y = 0 +> > for idx, coef in enumerate(coefs): +> > y = y + coef * x**idx +> > ~~~ +> > {: .language-python} +> {: .solution} +{: .challenge} + +{% include links.md %} diff --git a/_episodes/06-files.md b/_episodes/06-files.md new file mode 100644 index 0000000..05fcb34 --- /dev/null +++ b/_episodes/06-files.md @@ -0,0 +1,246 @@ +--- +title: Analyzing Data from Multiple Files +teaching: 20 +exercises: 0 +questions: +- "How can I do the same operations on many different files?" +objectives: +- "Use a library function to get a list of filenames that match a wildcard pattern." +- "Write a `for` loop to process multiple files." +keypoints: +- "Use `glob.glob(pattern)` to create a list of files whose names match a pattern." +- "Use `*` in a pattern to match zero or more characters, and `?` to match any single character." +--- + +As a final piece to processing our inflammation data, we need a way to get a list of all the files +in our `data` directory whose names start with `inflammation-` and end with `.csv`. +The following library will help us to achieve this: +~~~ +import glob +~~~ +{: .language-python} + +The `glob` library contains a function, also called `glob`, +that finds files and directories whose names match a pattern. +We provide those patterns as strings: +the character `*` matches zero or more characters, +while `?` matches any one character. +We can use this to get the names of all the CSV files in the current directory: + +~~~ +print(glob.glob('inflammation*.csv')) +~~~ +{: .language-python} + +~~~ +['inflammation-05.csv', 'inflammation-11.csv', 'inflammation-12.csv', 'inflammation-08.csv', +'inflammation-03.csv', 'inflammation-06.csv', 'inflammation-09.csv', 'inflammation-07.csv', +'inflammation-10.csv', 'inflammation-02.csv', 'inflammation-04.csv', 'inflammation-01.csv'] +~~~ +{: .output} + +As these examples show, +`glob.glob`'s result is a list of file and directory paths in arbitrary order. +This means we can loop over it +to do something with each filename in turn. +In our case, +the "something" we want to do is generate a set of plots for each file in our inflammation dataset. + +If we want to start by analyzing just the first three files in alphabetical order, we can use the +`sorted` built-in function to generate a new sorted list from the `glob.glob` output: + +~~~ +import glob +import numpy +import matplotlib.pyplot + +filenames = sorted(glob.glob('inflammation*.csv')) +filenames = filenames[0:3] +for filename in filenames: + print(filename) + + data = numpy.loadtxt(fname=filename, delimiter=',') + + fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) + + axes1 = fig.add_subplot(1, 3, 1) + axes2 = fig.add_subplot(1, 3, 2) + axes3 = fig.add_subplot(1, 3, 3) + + axes1.set_ylabel('average') + axes1.plot(numpy.mean(data, axis=0)) + + axes2.set_ylabel('max') + axes2.plot(numpy.max(data, axis=0)) + + axes3.set_ylabel('min') + axes3.plot(numpy.min(data, axis=0)) + + fig.tight_layout() + matplotlib.pyplot.show() +~~~ +{: .language-python} + +~~~ +inflammation-01.csv +~~~ +{: .output} + +![Output from the first iteration of the for loop. Three line graphs showing the daily average, +maximum and minimum inflammation over a 40-day period for all patients in the first dataset.]( +../fig/03-loop_49_1.png) + +~~~ +inflammation-02.csv +~~~ +{: .output} + +![Output from the second iteration of the for loop. Three line graphs showing the daily average, +maximum and minimum inflammation over a 40-day period for all patients in the second +dataset.](../fig/03-loop_49_3.png) + +~~~ +inflammation-03.csv +~~~ +{: .output} + +![Output from the third iteration of the for loop. Three line graphs showing the daily average, +maximum and minimum inflammation over a 40-day period for all patients in the third +dataset.](../fig/03-loop_49_5.png) + + +The plots generated for the second clinical trial file look very similar to the plots for +the first file: their average plots show similar "noisy" rises and falls; their maxima plots +show exactly the same linear rise and fall; and their minima plots show similar staircase +structures. + +The third dataset shows much noisier average and maxima plots that are far less suspicious than +the first two datasets, however the minima plot shows that the third dataset minima is +consistently zero across every day of the trial. If we produce a heat map for the third data file +we see the following: + +![Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout +the entire dataset, and the last patient only has zero values over the 40 day study. +](../fig/inflammation-03-imshow.svg) + +We can see that there are zero values sporadically distributed across all patients and days of the +clinical trial, suggesting that there were potential issues with data collection throughout the +trial. In addition, we can see that the last patient in the study didn't have any inflammation +flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis! + + +> ## Plotting Differences +> +> Plot the difference between the average inflammations reported in the first and second datasets +> (stored in `inflammation-01.csv` and `inflammation-02.csv`, correspondingly), +> i.e., the difference between the leftmost plots of the first two figures. +> +> > ## Solution +> > ~~~ +> > import glob +> > import numpy +> > import matplotlib.pyplot +> > +> > filenames = sorted(glob.glob('inflammation*.csv')) +> > +> > data0 = numpy.loadtxt(fname=filenames[0], delimiter=',') +> > data1 = numpy.loadtxt(fname=filenames[1], delimiter=',') +> > +> > fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) +> > +> > matplotlib.pyplot.ylabel('Difference in average') +> > matplotlib.pyplot.plot(numpy.mean(data0, axis=0) - numpy.mean(data1, axis=0)) +> > +> > fig.tight_layout() +> > matplotlib.pyplot.show() +> > ~~~ +> > {: .language-python} +> {: .solution} +{: .challenge} + +> ## Generate Composite Statistics +> +> Use each of the files once to generate a dataset containing values averaged over all patients: +> +> ~~~ +> filenames = glob.glob('inflammation*.csv') +> composite_data = numpy.zeros((60,40)) +> for filename in filenames: +> # sum each new file's data into composite_data as it's read +> # +> # and then divide the composite_data by number of samples +> composite_data = composite_data / len(filenames) +> ~~~ +> {: .language-python} +> +> Then use pyplot to generate average, max, and min for all patients. +> +> > ## Solution +> > ~~~ +> > import glob +> > import numpy +> > import matplotlib.pyplot +> > +> > filenames = glob.glob('inflammation*.csv') +> > composite_data = numpy.zeros((60,40)) +> > +> > for filename in filenames: +> > data = numpy.loadtxt(fname = filename, delimiter=',') +> > composite_data = composite_data + data +> > +> > composite_data = composite_data / len(filenames) +> > +> > fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) +> > +> > axes1 = fig.add_subplot(1, 3, 1) +> > axes2 = fig.add_subplot(1, 3, 2) +> > axes3 = fig.add_subplot(1, 3, 3) +> > +> > axes1.set_ylabel('average') +> > axes1.plot(numpy.mean(composite_data, axis=0)) +> > +> > axes2.set_ylabel('max') +> > axes2.plot(numpy.max(composite_data, axis=0)) +> > +> > axes3.set_ylabel('min') +> > axes3.plot(numpy.min(composite_data, axis=0)) +> > +> > fig.tight_layout() +> > +> > matplotlib.pyplot.show() +> > ~~~ +> > {: .language-python} +>{: .solution} +{: .challenge} + +After spending some time investigating the heat map and statistical plots, as well as +doing the above exercises to plot differences between datasets and to generate composite +patient statistics, we gain some insight into the twelve clinical trial datasets. + +The datasets appear to fall into two categories: + +* seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims, + but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`) +* "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning + data collection issues such as sporadic missing values and even an unsuitable candidate + making it into the clinical trial. + +In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`, +`inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value. +Armed with this information, we confront Dr. Maverick about the suspicious data and +duplicated files. + +Dr. Maverick confesses that they fabricated the clinical data after they found out +that the initial trial suffered from a number of issues, including unreliable data-recording and +poor participant selection. They created fake data to prove their drug worked, and when we asked +for more data they tried to generate more fake datasets, as well as throwing in the original +poor-quality dataset a few times to try and make all the trials seem a bit more "realistic". + +Congratulations! We've investigated the inflammation data and proven that the datasets have been +synthetically generated. + +But it would be a shame to throw away the synthetic datasets that have taught us so much +already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn +how to program. + +{% include links.md %}