From bb3cf8107f77e07b72c67dd0c71334debdee2cf4 Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Tue, 17 Sep 2024 11:59:40 +0000 Subject: [PATCH] site deploy Auto-generated via {sandpaper} Source : ba63695975b53ab8aa5c17073f8f3e6a1f0a9b9e Branch : md-outputs Author : GitHub Actions Time : 2024-09-17 11:59:18 +0000 Message : markdown source builds Auto-generated via {sandpaper} Source : ff260cc03fd35087399ace76b6afa18d5ecb22cb Branch : main Author : Aleksandra Nenadic Time : 2024-09-17 11:57:43 +0000 Message : Merge pull request #134 from carpentries-incubator/post-pilot1-restructuring Initial restructuring of episodes post pilot 1 --- 01-introduction.html | 27 +- 02-fair-research-software.html | 31 +- 03-tools.html | 262 +- 04-version-control.html | 99 +- 05-code-environment.html | 883 +++ 06-code-readability.html | 1360 ++++ 07-code-structure.html | 1120 +++ 08-code-correctness.html | 2103 ++++++ 09-code-documentation.html | 1569 +++++ 10-open-collaboration.html | 1087 +++ 11-wrap-up.html | 531 ++ 404.html | 33 +- AUTHORS.html | 27 +- CODE_OF_CONDUCT.html | 27 +- LICENSE.html | 27 +- aio.html | 5592 ++++++++------- ci-for-testing.html | 620 ++ config.yaml | 17 +- ethical-environmental-considerations.html | 815 +++ images.html | 57 +- index.html | 27 +- instructor-notes.html | 32 +- instructor/01-introduction.html | 27 +- instructor/02-fair-research-software.html | 31 +- instructor/03-tools.html | 264 +- instructor/04-version-control.html | 99 +- instructor/05-code-environment.html | 885 +++ instructor/06-code-readability.html | 1462 ++++ instructor/07-code-structure.html | 1227 ++++ instructor/08-code-correctness.html | 2266 ++++++ instructor/09-code-documentation.html | 1586 +++++ instructor/10-open-collaboration.html | 1089 +++ instructor/11-wrap-up.html | 533 ++ instructor/404.html | 33 +- instructor/AUTHORS.html | 27 +- instructor/CODE_OF_CONDUCT.html | 27 +- instructor/LICENSE.html | 27 +- instructor/aio.html | 6264 ++++++++++------- instructor/ci-for-testing.html | 622 ++ .../ethical-environmental-considerations.html | 817 +++ instructor/images.html | 55 +- instructor/index.html | 93 +- instructor/instructor-notes.html | 400 +- instructor/key-points.html | 90 +- instructor/licensing.html | 27 +- instructor/profiles.html | 27 +- key-points.html | 92 +- licensing.html | 27 +- md5sum.txt | 23 +- pkgdown.yml | 2 +- profiles.html | 27 +- sitemap.xml | 54 + 52 files changed, 28658 insertions(+), 5891 deletions(-) create mode 100644 05-code-environment.html create mode 100644 06-code-readability.html create mode 100644 07-code-structure.html create mode 100644 08-code-correctness.html create mode 100644 09-code-documentation.html create mode 100644 10-open-collaboration.html create mode 100644 11-wrap-up.html create mode 100644 ci-for-testing.html create mode 100644 ethical-environmental-considerations.html create mode 100644 instructor/05-code-environment.html create mode 100644 instructor/06-code-readability.html create mode 100644 instructor/07-code-structure.html create mode 100644 instructor/08-code-correctness.html create mode 100644 instructor/09-code-documentation.html create mode 100644 instructor/10-open-collaboration.html create mode 100644 instructor/11-wrap-up.html create mode 100644 instructor/ci-for-testing.html create mode 100644 instructor/ethical-environmental-considerations.html diff --git a/01-introduction.html b/01-introduction.html index 115a198f..8b36fe38 100644 --- a/01-introduction.html +++ b/01-introduction.html @@ -99,7 +99,7 @@ - @@ -258,7 +258,7 @@

@@ -267,7 +267,7 @@

@@ -276,7 +276,7 @@

@@ -255,7 +255,7 @@

@@ -264,7 +264,7 @@

@@ -273,7 +273,7 @@

@@ -282,7 +282,7 @@

@@ -578,7 +587,7 @@

Discussion

-
+

Here are some questions to help you assess where on the FAIR spectrum the code is:

@@ -628,7 +637,7 @@

Give me a hint

-
+

I would give the following scores:

F - 1/5

diff --git a/03-tools.html b/03-tools.html index 46916feb..fe9634aa 100644 --- a/03-tools.html +++ b/03-tools.html @@ -1,5 +1,5 @@ -Tools and practices for FAIR research software: Tools and practices for research software development +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Reproducible development environment

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • What are virtual environments in software development and why use +them?
  • +
  • How can we manage Python virtual coding environments and external +(third-party) libraries on our machines?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to:

+
  • Set up a Python virtual coding environment for a software project +using venv and pip.
  • +
+
+
+
+
+

So far we have created a local Git repository to track changes in our +software project and pushed it to GitHub to enable others to see and +contribute to it.

+

We now want to start developing the code further. If we have a look +at our script, we may notice a few import lines like the +following:

+
+

PYTHON +

+
import json
+
+
+

PYTHON +

+
import csv
+
+
+

PYTHON +

+
import datetime as dt
+
+
+

PYTHON +

+
import matplotlib.pyplot as plt
+
+

This means that our code requires several external +libraries (also called third-party packages or dependencies) - +json, csv, datetime and +matplotlib.

+

Python applications often use external libraries that do not come as +part of the standard Python distribution. This means that you will have +to use a package manager tool to install them on your +system. Applications will also sometimes need a specific version of an +external library (e.g. because they were written to work with feature, +class, or function that may have been updated in more recent versions), +or a specific version of Python interpreter. This means that each Python +application you work with may require a different setup and a set of +dependencies so it is useful to be able to keep these configurations +separate to avoid confusion between projects. The solution for this +problem is to create a self-contained virtual +environment per project, which contains a particular version of +Python installation plus a number of additional external libraries.

+

Virtual development environments

+

So what exactly are virtual environments, and why use them?

+

A Python virtual environment helps us create an isolated +working copy of a software project that uses a specific version +of Python interpreter together with specific versions of a number of +external libraries installed into that virtual environment. Python +virtual environments are implemented as directories with a particular +structure within software projects, containing links to specified +dependencies allowing isolation from other software projects on your +machine that may require different versions of Python or external +libraries.

+

It is recommended to create a separate virtual environment for each +project. Then you do not have to worry about changes to the environment +of the current project you are working on affecting other projects - you +can use different Python versions and different versions of the same +third party dependency by different projects on your machine +independently from one another.

+

Another big motivator for using virtual environments is that they +make sharing your code with others much easier - as we will see shortly +you can record your virtual environment in a special file and share it +with your collaborators who can then recreate the same development +environment on their machines.

+

You do not have to worry too much about specific versions of external +libraries that your project depends on most of the time. Virtual +environments also enable you to always use the latest available version +without specifying it explicitly. They also enable you to use a specific +older version of a package for your project, should you need to.

+

Managing virtual environments

+

There are several command line tools used for managing Python virtual +environments - we will use venv, available by default from +the standard Python distribution since +Python 3.3.

+

Part of managing your (virtual) working environment involves +installing, updating and removing external packages on your system. The +Python package manager tool pip is most commonly used for +this - it interacts and obtains the packages from the central repository +called Python Package Index (PyPI).

+

So, we will use venv and pip in combination +to help us create and share our virtual development environments.

+

Creating virtual environments

+

Creating a virtual environment with venv is done by +executing the following command:

+
+

BASH +

+
$ python -m venv /path/to/new/virtual/environment
+
+

where /path/to/new/virtual/environment is a path to a +directory where you want to place it - conventionally within your +software project so they are co-located. This will create the target +directory for the virtual environment.

+

For our project let’s create a virtual environment called +“venv_spacewalks” from our project’s root directory.

+

Firstly, ensure you are located within the project’s root +directory:

+
+

BASH +

+
$ cd /path/to/spacewalks
+
+
+

BASH +

+
$ python -m venv venv_spacewalks
+
+

If you list the contents of the newly created directory +“venv_spacewalks”, on a Mac or Linux system (slightly different on +Windows as explained below) you should see something like:

+
+

BASH +

+
$ ls -l venv_spacewalks
+
+
+

OUTPUT +

+
total 8
+drwxr-xr-x  12 alex  staff  384  5 Oct 11:47 bin
+drwxr-xr-x   2 alex  staff   64  5 Oct 11:47 include
+drwxr-xr-x   3 alex  staff   96  5 Oct 11:47 lib
+-rw-r--r--   1 alex  staff   90  5 Oct 11:47 pyvenv.cfg
+
+

So, running the python -m venv venv_spacewalks command +created the target directory called “venv_spacewalks” containing:

+
  • +pyvenv.cfg configuration file with a home key pointing +to the Python installation from which the command was run,
  • +
  • +bin subdirectory (called Scripts on +Windows) containing a symlink of the Python interpreter binary used to +create the environment and the standard Python library,
  • +
  • +lib/pythonX.Y/site-packages subdirectory (called +Lib\site-packages on Windows) to contain its own +independent set of installed Python packages isolated from other +projects, and
  • +
  • various other configuration and supporting files and +subdirectories.
  • +

Once you’ve created a virtual environment, you will need to activate +it.

+

On Mac or Linux, it is done as:

+
+

BASH +

+
$ source venv_spacewalks/bin/activate
+(venv_spacewalks) $
+
+

On Windows, recall that we have Scripts directory +instead of bin and activating a virtual environment is done +as:

+
+

BASH +

+
$ source venv_spacewalks/Scripts/activate
+(venv_spacewalks) $
+
+

Activating the virtual environment will change your command line’s +prompt to show what virtual environment you are currently using +(indicated by its name in round brackets at the start of the prompt), +and modify the environment so that running Python will get you the +particular version of Python configured in your virtual environment.

+

You can verify you are using your virtual environment’s version of +Python by checking the path using the command which:

+
+

BASH +

+
(venv_spacewalks) $ which python
+
+

When you’re done working on your project, you can exit the +environment with:

+
+

BASH +

+
(venv_spacewalks) $ deactivate
+
+

If you’ve just done the deactivate, ensure you +reactivate the environment ready for the next part:

+
+

BASH +

+
$ source venv_spacewalks/bin/activate
+(venv_spacewalks) $
+
+

Note that, since our software project is being tracked by Git, the +newly created virtual environment will show up in version control - we +will see how to handle it using Git in one of the subsequent +episodes.

+

Installing external packages

+

We noticed earlier that our code depends on four external +packages/libraries - json, csv, +datetime and matplotlib. As of Python 3.5, +Python comes with in-built JSON and CSV libraries - this means there is +no need to install these additional packages (if you are using a fairly +recent version of Python), but you still need to import them in any +script that uses them. However, we still need to install packages +datetime and matplotlib as they do not come as +standard with Python distribution.

+

To install the latest version of packages datetime and +matplotlib with pip you use pip’s +install command and specify the package’s name, e.g.:

+
+

BASH +

+
(venv_spacewalks) $ python -m pip install datetime
+(venv_spacewalks) $ python -m pip install matplotlib
+
+

or like this to install multiple packages at once for short:

+
+

BASH +

+
(venv_spacewalks) $ python -m pip install datetime matplotlib
+
+

The above commands have installed packages datetime and +matplotlib in our currently active +venv_spacewalks environment and will not affect any other +Python projects we may have on our machines.

+

If you run the python -m pip install command on a +package that is already installed, pip will notice this and +do nothing.

+

To install a specific version of a Python package give the package +name followed by == and the version number, +e.g. python -m pip install matplotlib==3.5.3.

+

To specify a minimum version of a Python package, you can do +python -m pip install matplotlib>=3.5.1.

+

To upgrade a package to the latest version, +e.g. python -m pip install --upgrade matplotlib.

+

To display information about a particular installed package do:

+
+

BASH +

+
(venv_spacewalks) $ python -m pip show matplotlib
+
+
+

OUTPUT +

+
Name: matplotlib
+Version: 3.9.0
+Summary: Python plotting package
+Home-page:
+Author: John D. Hunter, Michael Droettboom
+Author-email: Unknown <matplotlib-users@python.org>
+License: License agreement for matplotlib versions 1.3.0 and later
+=========================================================
+...
+Location: /opt/homebrew/lib/python3.11/site-packages
+Requires: contourpy, cycler, fonttools, kiwisolver, numpy, packaging, pillow, pyparsing, python-dateutil
+Required-by: 
+
+

To list all packages installed with pip (in your current +virtual environment):

+
+

BASH +

+
(venv_spacewalks) $ python -m pip list
+
+
+

OUTPUT +

+
Package         Version
+--------------- -----------
+contourpy       1.2.1
+cycler          0.12.1
+DateTime        5.5
+fonttools       4.53.1
+kiwisolver      1.4.5
+matplotlib      3.9.2
+numpy           2.0.1
+packaging       24.1
+pillow          10.4.0
+pip             23.3.1
+pyparsing       3.1.2
+python-dateutil 2.9.0.post0
+pytz            2024.1
+setuptools      69.0.2
+six             1.16.0
+zope.interface  7.0.1
+
+

To uninstall a package installed in the virtual environment do: +python -m pip uninstall <package-name>. You can also +supply a list of packages to uninstall at the same time.

+

Sharing virtual environments

+

You are collaborating on a project with a team so, naturally, you +will want to share your environment with your collaborators so they can +easily ‘clone’ your software project with all of its dependencies and +everyone can replicate equivalent virtual environments on their +machines. pip has a handy way of exporting, saving and +sharing virtual environments.

+

To export your active environment - use +python -m pip freeze command to produce a list of packages +installed in the virtual environment. A common convention is to put this +list in a requirements.txt file in your project’s root +directory:

+
+

BASH +

+
(venv_spacewalks) $ python -m pip freeze > requirements.txt
+(venv_spacewalks) $ cat requirements.txt
+
+
+

OUTPUT +

+
contourpy==1.2.1
+cycler==0.12.1
+DateTime==5.5
+fonttools==4.53.1
+kiwisolver==1.4.5
+matplotlib==3.9.2
+numpy==2.0.1
+packaging==24.1
+pillow==10.4.0
+pyparsing==3.1.2
+python-dateutil==2.9.0.post0
+pytz==2024.1
+six==1.16.0
+zope.interface==7.0.1
+
+

The first of the above commands will create a +requirements.txt file in your current directory. Yours may +look a little different, depending on the version of the packages you +have installed, as well as any differences in the packages that they +themselves use.

+

The requirements.txt file can then be committed to a +version control system (we will see how to do this using Git in a +moment) and get shipped as part of your software and shared with +collaborators and/or users.

+

Note that you only need to share the small +requirements.txt file with your collaborators - and not the +entire
venv_spacewalks directory with packages contained in your +virtual environment. We need to tell Git to ignore that directory, so it +is not tracked and shared - we do this by creating a file +.gitignore in the root directory of our project and adding +a line venv_spacewalks to it.

+

Let’s now put requirements.txt under version control and +share it along with our code.

+
+

BASH +

+
(venv_spacewalks) $ git add requirements.txt
+(venv_spacewalks) $ git commit -m "Initial commit of requirements.txt."
+(venv_spacewalks) $ git push origin main
+
+

Your collaborators or users of your software can now download your +software’s source code and replicate the same virtual software +environment for running your code on their machines using +requirements.txt to install all the necessary depending +packages.

+

To recreate a virtual environment from requirements.txt, +from the project root one can do the following:

+
+

BASH +

+
(venv_spacewalks) $ python -m pip install -r requirements.txt
+
+

As your project grows - you may need to update your environment for a +variety of reasons, e.g.:

+
  • one of your project’s dependencies has just released a new version +(dependency version number update),
  • +
  • you need an additional package for data analysis (adding a new +dependency), or
  • +
  • you have found a better package and no longer need the older package +(adding a new and removing an old dependency).
  • +

What you need to do in this case (apart from installing the new and +removing the packages that are no longer needed from your virtual +environment) is update the contents of the requirements.txt +file accordingly by re-issuing pip freeze command and +propagate the updated requirements.txt file to your +collaborators via your code sharing platform.

+

Further reading

+

We recommend the following resources for some additional reading on +the topic of this episode:

+

Also check the full reference set +for the course.

+
+
+ +
+
+

Key Points

+
+
  • Virtual environments keep Python versions and dependencies required +by different projects separate.
  • +
  • A Python virtual environment is itself a directory structure.
  • +
  • You can use venv to create and manage Python virtual +environments, and pip to install and manage Python external +(third-party) libraries.
  • +
  • By convention, you can save and export your Python virtual +environment in a requirements.txt in your project’s root +directory, which can then be shared with collaborators/users and used to +replicate your virtual environment elsewhere.
  • +
+
+
+ +
+
+ + +
+
+ + + diff --git a/06-code-readability.html b/06-code-readability.html new file mode 100644 index 00000000..b331e55b --- /dev/null +++ b/06-code-readability.html @@ -0,0 +1,1360 @@ + +Tools and practices for FAIR research software: Code readability +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Code readability

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • Why does code readability matter?
  • +
  • How can I organise my code to be more readable?
  • +
  • What types of documentation can I include to improve the readability +of my code?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to:

+
  • Organise code into reusable functions that achieve a singular +purpose
  • +
  • Choose function and variable names that help explain the purpose of +the function or variable
  • +
  • Write informative inline comments and docstrings to provide more +detail about what the code is doing
  • +
+
+
+
+
+

In this episode, we will introduce the concept of readable code and +consider how it can help create reusable scientific software and empower +collaboration between researchers.

+

When someone writes code, they do so based on requirements that are +likely to change in the future. Requirements change because software +interacts with the real world, which is dynamic. When these requirements +change, the developer (who is not necessarily the same person who wrote +the original code) must implement the new requirements. They do this by +reading the original code to understand the different abstractions, and +identify what needs to change. Readable code facilitates the reading and +understanding of the abstraction phases and, as a result, facilitates +the evolution of the codebase. Readable code saves future developers’ +time and effort.

+

In order to develop readable code, we should ask ourselves: “If I +re-read this piece of code in fifteen days or one year, will I be able +to understand what I have done and why?” Or even better: “If a new +person who just joined the project reads my software, will they be able +to understand what I have written here?”

+

We will now learn about a few software best practices we can follow +to help create more readable code. Before that, make sure your virtual +development environment is active.

+

Before we move on with further code modifications, make sure your +virtual development environment is active.

+ +

Place import statements at the top

+

Let’s have a look our code again - the first thing we may notice is +that our script currently places import statements throughout the code. +Conventionally, all import statements are placed at the top of the +script so that dependant libraries are clearly visible and not buried +inside the code (even though there are standard ways of describing +dependencies - e.g. using requirements.txt file). This will +help readability (accessibility) and reusability of our code.

+

Our code after the modification should look like the following.

+
+

PYTHON +

+

+# https://data.nasa.gov/resource/eva.json (with modifications)
+data_f = open('./eva-data.json', 'r')
+data_t = open('./eva-data.csv','w')
+g_file = './cumulative_eva_graph.png'  
+
+fieldnames = ("EVA #", "Country", "Crew    ", "Vehicle", "Date", "Duration", "Purpose")
+
+data=[]
+import json
+
+for i in range(374):
+    line=data_f.readline()
+    print(line)
+    data.append(json.loads(line[1:-1]))
+#data.pop(0)
+## Comment out this bit if you don't want the spreadsheet
+import csv
+
+w=csv.writer(data_t)
+
+import datetime as dt
+
+time = []
+date =[]
+
+j=0
+for i in data:
+    print(data[j])
+    # and this bit
+    w.writerow(data[j].values())
+    if 'duration' in data[j].keys():
+        tt=data[j]['duration']
+        if tt == '':
+            pass
+        else:
+            t=dt.datetime.strptime(tt,'%H:%M')
+            ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60)
+            print(t,ttt)
+            time.append(ttt)
+            if 'date' in data[j].keys():
+                date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d'))
+                #date.append(data[j]['date'][0:10])
+
+            else:
+                time.pop(0)
+    j+=1
+
+t=[0]
+for i in time:
+    t.append(t[-1]+i)
+
+date,time = zip(*sorted(zip(date, time)))
+
+import matplotlib.pyplot as plt
+
+plt.plot(date,t[1:], 'ko-')
+plt.xlabel('Year')
+plt.ylabel('Total time spent in space to date (hours)')
+plt.tight_layout()
+plt.savefig(g_file)
+plt.show()
+
+
+

Let’s make sure we commit our changes.

+
+

BASH +

+
(venv_spacewalks) $ git add eva_data_analysis.py
+(venv_spacewalks) $ git commit -m "Move import statements to the top of the script"
+
+

Use meaningful variable names

+

Variables are the most common thing you will assign when coding, and +it’s really important that it is clear what each variable means in order +to understand what the code is doing. If you return to your code after a +long time doing something else, or share your code with a colleague, it +should be easy enough to understand what variables are involved in your +code from their names. Therefore we need to give them clear names, but +we also want to keep them concise so the code stays readable. There are +no “hard and fast rules” here, and it’s often a case of using your best +judgment.

+

Some useful tips for naming variables are:

+
  • Short words are better than single character names. For example, if +we were creating a variable to store the speed to read a file, +s (for ‘speed’) is not descriptive enough but +MBReadPerSecondAverageAfterLastFlushToLog is too long to +read and prone to misspellings. ReadSpeed (or +read_speed) would suffice.
  • +
  • If you are finding it difficult to come up with a variable name that +is both short and descriptive, go with the short version and use an +inline comment to describe it further (more on those in the next +section). This guidance does not necessarily apply if your variable is a +well-known constant in your domain - for example, c represents +the speed of light in physics.
  • +
  • Try to be descriptive where possible and avoid meaningless or funny +names like foo, bar, var, +thing, etc.
  • +

There are also some restrictions to consider when naming variables in +Python:

+
  • Only alphanumeric characters and underscores are permitted in +variable names.
  • +
  • You cannot begin your variable names with a numerical character as +this will raise a syntax error. Numerical characters can be included in +a variable name, just not as the first character. For example, +read_speed1 is a valid variable name, but +1read_speed isn’t. (This behaviour may be different for +other programming languages.)
  • +
  • Variable names are case sensitive. So speed_of_light +and Speed_Of_Light are not the same.
  • +
  • Programming languages often have global pre-built functions, such as +input, which you may accidentally overwrite if you assign a +variable with the same name and no longer be able to access the original +input function. In this case, opting for something like +input_data would be preferable. Note that this behaviour +may be explicitly disallowed in other programming languages but is not +in Python.
  • +
+
+ +
+
+

Give a descriptive name to a variable

+
+

Below we have a variable called var being set the value +of 9.81. var is not a very descriptive name here as it +doesn’t tell us what 9.81 means, yet it is a very common constant in +physics! Go online and find out which constant 9.81 relates to and +suggest a new name for this variable.

+

Hint: the units are metres per second squared!

+
+

PYTHON +

+
var = 9.81
+
+
+
+
+
+
+ +
+
+

Yes, \[9.81 m/s^2 \] is the gravitational +force exerted by the Earth. It is often referred to as “little g” to +distinguish it from “big G” which is the Gravitational +Constant. A more descriptive name for this variable therefore might +be:

+
+

PYTHON +

+
g_earth = 9.81
+
+
+
+
+
+
+
+ +
+
+

Challenge

+
+

Let’s apply this to eva_data_analysis.py.

+
  1. +

    Edit the code as follows to use descriptive variable names:

    +
    • Change data_f to input_file
    • +
    • Change data_t to output_file
    • +
    • Change g_file to graph_file
    • +
  2. +
  3. What other variable names in our code would benefit from +renaming?

  4. +
  5. Commit your changes to your repository. Remember to use an +informative commit message.

  6. +
+
+
+
+
+ +
+
+

Updated code:

+
+

PYTHON +

+

+import json
+import csv
+import datetime as dt
+import matplotlib.pyplot as plt
+
+# Data source: https://data.nasa.gov/resource/eva.json (with modifications)
+input_file = open('./eva-data.json', 'r')
+output_file = open('./eva-data.csv', 'w')
+graph_file = './cumulative_eva_graph.png'
+
+fieldnames = ("EVA #", "Country", "Crew    ", "Vehicle", "Date", "Duration", "Purpose")
+
+data=[]
+
+for i in range(374):
+    line=input_file.readline()
+    print(line)
+    data.append(json.loads(line[1:-1]))
+#data.pop(0)
+## Comment out this bit if you don't want the spreadsheet
+
+w=csv.writer(output_file)
+
+time = []
+date =[]
+
+j=0
+for i in data:
+    print(data[j])
+    # and this bit
+    w.writerow(data[j].values())
+    if 'duration' in data[j].keys():
+        tt=data[j]['duration']
+        if tt == '':
+            pass
+        else:
+            t=dt.datetime.strptime(tt,'%H:%M')
+            ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60)
+            print(t,ttt)
+            time.append(ttt)
+            if 'date' in data[j].keys():
+                date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d'))
+                #date.append(data[j]['date'][0:10])
+
+            else:
+                time.pop(0)
+    j+=1
+
+t=[0]
+for i in time:
+    t.append(t[-1]+i)
+
+date,time = zip(*sorted(zip(date, time)))
+
+plt.plot(date,t[1:], 'ko-')
+plt.xlabel('Year')
+plt.ylabel('Total time spent in space to date (hours)')
+plt.tight_layout()
+plt.savefig(graph_file)
+plt.show()
+
+

We should also rename variables w, t, +ttt to be more descriptive.

+

Commit changes:

+
+

BASH +

+
(venv_spacewalks) $ git add eva_data_analysis.py
+(venv_spacewalks) $ git commit -m "Use descriptive variable names"
+
+
+
+
+
+

Use standard libraries

+

Our script currently reads the data line-by-line from the JSON data +file and uses custom code to manipulate the data. Variables of interest +are stored in lists but there are more suitable data structures +(e.g. data frames) to store data in our case. By choosing custom code +over standard and well-tested libraries, we are making our code less +readable and understandable and more error-prone.

+

The main functionality of our code can be rewritten as follows using +the Pandas library to load and manipulate the data in data +frames.

+

First, we need to install this dependency into our virtual +environment (which should be active at this point).

+
+

BASH +

+
(venv_spacewalks) $ python -m pip install pandas
+
+

The code should now look like:

+
+

PYTHON +

+
import matplotlib.pyplot as plt
+import pandas as pd
+
+# Data source: https://data.nasa.gov/resource/eva.json (with modifications)
+input_file = open('./eva-data.json', 'r')
+output_file = open('./eva-data.csv', 'w')
+graph_file = './cumulative_eva_graph.png'
+
+eva_df = pd.read_json(input_file, convert_dates=['date'])
+eva_df['eva'] = eva_df['eva'].astype(float)
+eva_df.dropna(axis=0, inplace=True)
+eva_df.sort_values('date', inplace=True)
+
+eva_df.to_csv(output_file, index=False)
+
+eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
+eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum()
+plt.plot(eva_df['date'], eva_df['cumulative_time'], 'ko-')
+plt.xlabel('Year')
+plt.ylabel('Total time spent in space to date (hours)')
+plt.tight_layout()
+plt.savefig(graph_file)
+plt.show()
+
+

We should replace the existing code in our Python script +eva_data_analysis.py with the above code and commit the +changes. Remember to use an informative commit message.

+
+

BASH +

+
(venv_spacewalks) $ git add eva_data_analysis.py
+(venv_spacewalks) $ git commit -m "Refactor code to use standard libraries"
+
+

Make sure to capture the changes to your virtual development +environment too.

+
+

BASH +

+
(venv_spacewalks) $ python -m pip freeze > requirements.txt
+(venv_spacewalks) $ git add requirements.txt
+(venv_spacewalks) $ git commit -m "Added Pandas library."
+(venv_spacewalks) $ git push origin main
+
+

Use comments to explain functionality

+

Commenting is a very useful practice to help convey the context of +the code. It can be helpful as a reminder for your future self or your +collaborators as to why code is written in a certain way, how it is +achieving a specific task, or the real-world implications of your +code.

+

There are several ways to add comments to code:

+
  • An inline comment is a comment on the same line as +a code statement. Typically, it comes after the code statement and +finishes when the line ends and is useful when you want to explain the +code line in short. Inline comments in Python should be separated by at +least two spaces from the statement; they start with a # followed by a +single space, and have no end delimiter.
  • +
  • A multi-line or block comment can +span multiple lines and has a start and end sequence. To comment out a +block of code in Python, you can either add a # at the beginning of each +line of the block or surround the entire block with three single +(''') or double quotes (""").
  • +
+

PYTHON +

+
x = 5  # In Python, inline comments begin with the `#` symbol and a single space.
+
+'''
+This is a multiline
+comment
+in Python.
+'''
+
+

Here are a few things to keep in mind when commenting your code:

+
  • Focus on the why and the how of +your code - avoid using comments to explain what your +code does. If your code is too complex for other programmers to +understand, consider rewriting it for clarity rather than adding +comments to explain it.
  • +
  • Make sure you are not reiterating something that your code already +conveys on its own. Comments should not echo your code.
  • +
  • Keep comments short and concise. Large blocks of text quickly become +unreadable and difficult to maintain.
  • +
  • Comments that contradict the code are worse than no comments. Always +make a priority of keeping comments up-to-date when code changes.
  • +
+

Examples of unhelpful comments

+
+

PYTHON +

+
statetax = 1.0625  # Assigns the float 1.0625 to the variable 'statetax'
+citytax = 1.01  # Assigns the float 1.01 to the variable 'citytax'
+specialtax = 1.01  # Assigns the float 1.01 to the variable 'specialtax'
+
+

The comments in this code simply tell us what the code does, which is +easy enough to figure out without the inline comments.

+
+
+

Examples of helpful comments

+
+

PYTHON +

+
statetax = 1.0625  # State sales tax rate is 6.25% through Jan. 1
+citytax = 1.01  # City sales tax rate is 1% through Jan. 1
+specialtax = 1.01  # Special sales tax rate is 1% through Jan. 1
+
+

In this case, it might not be immediately obvious what each variable +represents, so the comments offer helpful, real-world context. The date +in the comment also indicates when the code might need to be +updated.

+
+
+ +
+
+

Add comments to our code

+
+
  1. Examine eva_data_analysis.py. Add as many comments as +you think is required to help yourself and others understand what that +code is doing.
  2. +
  3. Commit your changes to your repository. Remember to use an +informative commit message.
  4. +
+
+
+
+
+ +
+
+

Some good comments may look like the example below.

+
+

PYTHON +

+

+import matplotlib.pyplot as plt
+import pandas as pd
+
+
+# https://data.nasa.gov/resource/eva.json (with modifications)
+input_file = open('./eva-data.json', 'r')
+output_file = open('./eva-data.csv', 'w')
+graph_file = './cumulative_eva_graph.png'
+
+print("--START--")
+print(f'Reading JSON file {input_file}')
+# Read the data from a JSON file into a Pandas dataframe
+eva_df = pd.read_json(input_file, convert_dates=['date'])
+eva_df['eva'] = eva_df['eva'].astype(float)
+# Clean the data by removing any incomplete rows and sort by date
+eva_df.dropna(axis=0, inplace=True)
+eva_df.sort_values('date', inplace=True)
+
+print(f'Saving to CSV file {output_file}')
+# Save dataframe to CSV file for later analysis
+eva_df.to_csv(output_file, index=False)
+
+print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
+# Plot cumulative time spent in space over years
+eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
+eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum()
+plt.plot(eva_df['date'], eva_df['cumulative_time'], 'ko-')
+plt.xlabel('Year')
+plt.ylabel('Total time spent in space to date (hours)')
+plt.tight_layout()
+plt.savefig(graph_file)
+plt.show()
+print("--END--")
+
+

Commit changes:

+
+

BASH +

+
(venv_spacewalks) $ git add eva_data_analysis.py
+(venv_spacewalks) $ git commit -m "Add inline comments to the code"
+
+
+
+
+
+
+

Separate units of functionality

+

Functions are a fundamental concept in writing software and are one +of the core ways you can organise your code to improve its readability. +A function is an isolated section of code that performs a single, +specific task that can be simple or complex. It can then be +called multiple times with different inputs throughout a codebase, but +its definition only needs to appear once.

+

Breaking up code into functions in this manner benefits readability +since the smaller sections are easier to read and understand. Since +functions can be reused, codebases naturally begin to follow the Don’t +Repeat Yourself principle which prevents software from becoming +overly long and confusing. The software also becomes easier to maintain +because, if the code encapsulated in a function needs to change, it only +needs updating in one place instead of many. As we will learn in a +future episode, testing code also becomes simpler when code is written +in functions. Each function can be individually checked to ensure it is +doing what is intended, which improves confidence in the software as a +whole.

+
+
+ +
+
+

Callout

+
+

Decomposing code into functions helps with reusability of blocks of +code and eliminating repetition, but, equally importantly, it helps with +code readability and testing.

+
+
+
+

Looking at our code, you may notice it contains different pieces of +functionality:

+
  1. reading the data from a JSON file
  2. +
  3. processing/cleaning the data and preparing it for analysis
  4. +
  5. data analysis and visualising the results
  6. +
  7. converting and saving the data in the CSV format
  8. +

Let’s refactor our code so that reading the data in JSON format into +a dataframe (step 1.) and converting it and saving to the CSV format +(step 4.) are extracted into separate functions. Let’s name those +functions read_json_to_dataframe and +write_dataframe_to_csv respectively. The main part of the +script should then be simplified to invoke these new functions, while +the functions themselves contain the complexity of each of these two +steps.

+

Our code may look something like the following.

+
+

PYTHON +

+

+import matplotlib.pyplot as plt
+import pandas as pd
+
+def read_json_to_dataframe(input_file):
+    print(f'Reading JSON file {input_file}')
+    # Read the data from a JSON file into a Pandas dataframe
+    eva_df = pd.read_json(input_file, convert_dates=['date'])
+    eva_df['eva'] = eva_df['eva'].astype(float)
+    # Clean the data by removing any incomplete rows and sort by date
+    eva_df.dropna(axis=0, inplace=True)
+    eva_df.sort_values('date', inplace=True)
+    return eva_df
+
+
+def write_dataframe_to_csv(df, output_file):
+    print(f'Saving to CSV file {output_file}')
+    # Save dataframe to CSV file for later analysis
+    df.to_csv(output_file, index=False)
+
+
+# Main code
+
+print("--START--")
+
+input_file = open('./eva-data.json', 'r')
+output_file = open('./eva-data.csv', 'w')
+graph_file = './cumulative_eva_graph.png'
+
+# Read the data from JSON file
+eva_data = read_json_to_dataframe(input_file)
+
+# Convert and export data to CSV file
+write_dataframe_to_csv(eva_data, output_file)
+
+print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
+# Plot cumulative time spent in space over years
+eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
+eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum()
+plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-')
+plt.xlabel('Year')
+plt.ylabel('Total time spent in space to date (hours)')
+plt.tight_layout()
+plt.savefig(graph_file)
+plt.show()
+
+print("--END--")
+
+

We have chosen to create function for reading in and writing out data +files since this is a very common task within research software. While +these functions do not contain that many lines of code due to using the +pandas inbuilt methods that do all the complex data +reading, converting and writing operations, it can be useful to package +these steps together into reusable functions if you need to read in or +write out a lot of similarly structured files and process them in the +same way.

+

Use docstrings to document functions

+

Docstrings are a specific type of documentation that are provided +within functions and Python +classes. A function docstring should explain what that particular +code is doing, what parameters the function needs (inputs) and what form +they should take, what the function outputs (you may see words like +‘returns’ or ‘yields’ here), and errors (if any) that might be +raised.

+

Providing these docstrings helps improve code readability since it +makes the function code more transparent and aids understanding. +Particularly, docstrings that provide information on the input and +output of functions makes it easier to reuse them in other parts of the +code, without having to read the full function to understand what needs +to be provided and what will be returned.

+

Python docstrings are defined by enclosing the text with 3 double +quotes ("""). This text is also indented to the same level +as the code defined beneath it, which is 4 whitespaces by +convention.

+
+

Example of a single-line docstring

+
+

PYTHON +

+
def add(x, y):
+    """Add two numbers together"""
+    return x + y
+
+
+
+

Example of a multi-line docstring

+
+

PYTHON +

+
def divide(x, y):
+    """
+    Divide number x by number y.
+
+    Args:
+        x: A number to be divided.
+        y: A number to divide by.
+
+    Returns:
+        float: The division of x by y.
+        
+    Raises:
+        ZeroDivisionError: Cannot divide by zero.
+    """
+    return x / y
+
+

Some projects may have their own guidelines on how to write +docstrings, such as numpy. +If you are contributing code to a wider project or community, try to +follow the guidelines and standards they provide for code style.

+

As your code grows and becomes more complex, the docstrings can form +the content of a reference guide allowing developers to quickly look up +how to use the APIs, functions, and classes defined in your codebase. +Hence, it is common to find tools that will automatically extract +docstrings from your code and generate a website where people can learn +about your code without downloading/installing and reading the code +files - such as MkDocs.

+

Let’s write a docstring for the function +read_json_to_dataframe we introduced in the previous +exercise using the Google +Style Python Docstrings Convention. Remember, questions we want to +answer when writing the docstring include:

+
  • What the function does?
  • +
  • What kind of inputs does the function take? Are they required or +optional? Do they have default values?
  • +
  • What output will the function produce?
  • +
  • What exceptions/errors, if any, it can produce?
  • +

Our read_json_to_dataframe function fully described by a +docstring may look like:

+
+

PYTHON +

+
def read_json_to_dataframe(input_file):
+    """
+    Read the data from a JSON file into a Pandas dataframe.
+    Clean the data by removing any incomplete rows and sort by date
+
+    Args:
+        input_file_ (str): The path to the JSON file.
+
+    Returns:
+         eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
+    """
+    print(f'Reading JSON file {input_file}')
+    # Read the data from a JSON file into a Pandas dataframe
+    eva_df = pd.read_json(input_file, convert_dates=['date'])
+    eva_df['eva'] = eva_df['eva'].astype(float)
+    # Clean the data by removing any incomplete rows and sort by date
+    eva_df.dropna(axis=0, inplace=True)
+    eva_df.sort_values('date', inplace=True)
+    return eva_df
+
+
+
+ +
+
+

Writing docstrings

+
+

Write a docstring for the function +write_dataframe_to_csv we introduced earlier.

+
+
+
+
+
+ +
+
+

Our write_dataframe_to_csv function fully described by a +docstring may look like:

+
+

PYTHON +

+
def write_dataframe_to_csv(df, output_file):
+"""
+Write the dataframe to a CSV file.
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        output_file (str): The path to the output CSV file.
+
+    Returns:
+        None
+    """
+    print(f'Saving to CSV file {output_file}')
+    # Save dataframe to CSV file for later analysis
+    df.to_csv(output_file, index=False)
+
+
+
+
+
+

Finally, our code may look something like the following:

+
+

PYTHON +

+

+import matplotlib.pyplot as plt
+import pandas as pd
+
+
+def read_json_to_dataframe(input_file):
+    """
+    Read the data from a JSON file into a Pandas dataframe.
+    Clean the data by removing any incomplete rows and sort by date
+
+    Args:
+        input_file_ (str): The path to the JSON file.
+
+    Returns:
+         eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
+    """
+    print(f'Reading JSON file {input_file}')
+    # Read the data from a JSON file into a Pandas dataframe
+    eva_df = pd.read_json(input_file, convert_dates=['date'])
+    eva_df['eva'] = eva_df['eva'].astype(float)
+    # Clean the data by removing any incomplete rows and sort by date
+    eva_df.dropna(axis=0, inplace=True)
+    eva_df.sort_values('date', inplace=True)
+    return eva_df
+
+
+def write_dataframe_to_csv(df, output_file):
+    """
+    Write the dataframe to a CSV file.
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        output_file (str): The path to the output CSV file.
+
+    Returns:
+        None
+    """
+    print(f'Saving to CSV file {output_file}')
+    # Save dataframe to CSV file for later analysis
+    df.to_csv(output_file, index=False)
+
+
+# Main code
+
+print("--START--")
+
+input_file = open('./eva-data.json', 'r')
+output_file = open('./eva-data.csv', 'w')
+graph_file = './cumulative_eva_graph.png'
+
+# Read the data from JSON file
+eva_data = read_json_to_dataframe(input_file)
+
+# Convert and export data to CSV file
+write_dataframe_to_csv(eva_data, output_file)
+
+print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
+# Plot cumulative time spent in space over years
+eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
+eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum()
+plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-')
+plt.xlabel('Year')
+plt.ylabel('Total time spent in space to date (hours)')
+plt.tight_layout()
+plt.savefig(graph_file)
+plt.show()
+
+print("--END--")
+
+

Do not forget to commit any uncommitted changes you may have and then +push your work to GitHub.

+
+

BASH +

+
(venv_spacewalks) $ git add <your_changed_files>
+(venv_spacewalks) $ git commit -m "Your commit message"
+(venv_spacewalks) $ git push origin main
+
+
+

Further reading

+

We recommend the following resources for some additional reading on +the topic of this episode:

+

Also check the full reference set +for the course.

+
+
+ +
+
+

Key Points

+
+
  • Readable code is easier to understand, maintain, debug and extend +(reuse) - saving time and effort.
  • +
  • Choosing descriptive variable and function names will communicate +their purpose more effectively.
  • +
  • Using inline comments and docstrings to describe parts of the code +will help transmit understanding and context.
  • +
  • Use libraries or packages for common functionality to avoid +duplication.
  • +
  • Creating functions from the smallest, reusable units of code will +make the code more readable and help. compartmentalise which parts of +the code are doing what actions and isolate specific code sections for +re-use.
  • +
+
+
+ +
+
+ + +
+
+ + + diff --git a/07-code-structure.html b/07-code-structure.html new file mode 100644 index 00000000..712cc605 --- /dev/null +++ b/07-code-structure.html @@ -0,0 +1,1120 @@ + +Tools and practices for FAIR research software: Code structure +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Code structure

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can be bests structure code into reusable components with a +single responsibility?
  • +
  • What is a common code structure (pattern) for creating software that +can read input from command line?
  • +
  • What are conventional places to store data, code, results, tests and +auxiliary information and metadata within our software or research +project?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to:

+
  • Structure code that is modular and split into small, reusable +functions.
  • +
  • Use the common code pattern for creating software that can read +input from command line
  • +
  • Follow best practices in structuring code and organising +software/research project directories for improved readability, +accessibility and reproducibility.
  • +
+
+
+
+
+

In the previous episode we have seen some tools and practices that +can help up improve readability of our code - including breaking our +code into small, reusable functions that perform one specific task. We +are going to explore a bit more how using common code structures can +improve readability, accessibility and reusability of our code, and will +expand these practices on our (research or code) projects as a +whole.

+

Before we move on with further code modifications, make sure your +virtual development environment is active.

+ +

Functions for Modular and Reusable Code

+

As we have already seen in the previous episode - functions play a +key role in creating modular and reusable code. We are going to carry on +improving our code following these principles:

+
  • Each function should have a single, clear responsibility. This makes +functions easier to understand, test, and reuse.
  • +
  • Functions should accept parameters to allow flexibility and +reusability in different contexts; avoid hard-coding values inside +functions/code (e.g. data files to read from/write to); instead, pass +them as arguments.
  • +
  • Split complex tasks into smaller, simpler functions that can be +composed; each function should handle a distinct part of a larger +task.
  • +
  • Write functions that can be easily combined or reused with other +functions to build more complex functionality.
  • +

Bearing in mind the above principles, we can further simplify our +code by extracting the code to process, analyse our data and plot a +graph into a separate function +plot_cumulative_time_in_space, and then further break down +the code to convert the data column containing spacewalk durations as +text into numbers which we can perform arithmetic operations over, and +add that numerical data as a new column in our dataset.

+

The main part of our code then becomes much simpler and more +readable, only containing the invocation of the following three +functions:

+
+

PYTHON +

+
eva_data = read_json_to_dataframe(input_file)
+write_dataframe_to_csv(eva_data, output_file)
+plot_cumulative_time_in_space(eva_data, graph_file)
+
+

Remember to add docstrings and comments to the new functions to +explain their functionalities.

+

Our new code may look like the following.

+
+

PYTHON +

+

+
+import matplotlib.pyplot as plt
+import pandas as pd
+
+
+def read_json_to_dataframe(input_file):
+    """
+    Read the data from a JSON file into a Pandas dataframe.
+    Clean the data by removing any incomplete rows and sort by date
+
+    Args:
+        input_file_ (str): The path to the JSON file.
+
+    Returns:
+         eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
+    """
+    print(f'Reading JSON file {input_file}')
+    eva_df = pd.read_json(input_file, convert_dates=['date'])
+    eva_df['eva'] = eva_df['eva'].astype(float)
+    eva_df.dropna(axis=0, inplace=True)
+    eva_df.sort_values('date', inplace=True)
+    return eva_df
+
+
+def write_dataframe_to_csv(df, output_file):
+    """
+    Write the dataframe to a CSV file.
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        output_file (str): The path to the output CSV file.
+
+    Returns:
+        None
+    """
+    print(f'Saving to CSV file {output_file}')
+    df.to_csv(output_file, index=False)
+
+def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        duration_hours (float): The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours) + int(minutes)/60
+    return duration_hours
+
+
+def add_duration_hours_variable(df):
+    """
+    Add duration in hours (duration_hours) variable to the dataset
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+
+    Returns:
+        df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added
+    """
+    df_copy = df.copy()
+    df_copy["duration_hours"] = df_copy["duration"].apply(
+        text_to_duration
+    )
+    return df_copy
+
+
+def plot_cumulative_time_in_space(df, graph_file):
+    """
+    Plot the cumulative time spent in space over years
+
+    Convert the duration column from strings to number of hours
+    Calculate cumulative sum of durations
+    Generate a plot of cumulative time spent in space over years and
+    save it to the specified location
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        graph_file (str): The path to the output graph file.
+
+    Returns:
+        None
+    """
+    print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
+    df = add_duration_hours_variable(df)
+    df['cumulative_time'] = df['duration_hours'].cumsum()
+    plt.plot(df.date, df.cumulative_time, 'ko-')
+    plt.xlabel('Year')
+    plt.ylabel('Total time spent in space to date (hours)')
+    plt.tight_layout()
+    plt.savefig(graph_file)
+    plt.show()
+
+
+# Main code
+
+print("--START--")
+
+input_file = open('./eva-data.json', 'r')
+output_file = open('./eva-data.csv', 'w')
+graph_file = './cumulative_eva_graph.png'
+
+eva_data = read_json_to_dataframe(input_file)
+
+write_dataframe_to_csv(eva_data, output_file)
+
+plot_cumulative_time_in_space(eva_data, graph_file)
+
+print("--END--")
+
+
+

Command-line interface to code

+

A common way to structure code is to have a command-line interface to +allow the passing of input data file to be read and the output file to +be written to as parameters to our script and avoid hard-coding them. +This improves interoperability and reusability of our code as it can now +be run from the command line terminal and integrated into other scripts +or workflows/pipelines. For example, another script can produce our +input data and can be “chained” with our code in a more complex data +analysis pipeline. Or we can invoke our script in a loop to quickly +analyse a number of input data files from a directory.

+

There is a common code structure (pattern) for this:

+
+

PYTHON +

+
# import modules
+
+def main(args):
+    # perform some actions
+
+if __name__ == "__main__":
+    # perform some actions before main()
+    main(args)
+
+

In this pattern the actions performed by the script are contained +within the main function (which does not need to be called +main, but using this convention helps others in +understanding your code). The main function is then called +within the if statement +__name__ == "__main__", after some other actions have been +performed (usually the parsing of command-line arguments, which will be +explained below). __name__ is a special variable which is +set by the Python interpreter before the execution of any code in the +source file. What value is given by the interpreter to +__name__ is determined by the manner in which the script is +loaded.

+

If we run the source file directly using the Python interpreter, +e.g.:

+
+

BASH +

+
$ python3 eva_data_analysis.py
+
+

then the interpreter will assign the hard-coded string +"__main__" to the __name__ variable:

+
+

PYTHON +

+
__name__ = "__main__"
+...
+# rest of your code
+
+

However, if your source file is imported by another Python script, +e.g:

+
+

PYTHON +

+
import eva_data_analysis
+
+

then the Python interpreter will assign the name “eva_data_analysis” +from the import statement to the __name__ variable:

+
+

PYTHON +

+
__name__ = "eva_data_analysis"
+...
+# rest of your code
+
+

Because of this behaviour of the Python interpreter, we can put any +code that should only be executed when running the script directly +within the if __name__ == "__main__": structure, allowing +the rest of the code within the script to be safely imported by another +script if we so wish.

+

While it may not seem very useful to have your script importable by +another script, there are a number of situations in which you would want +to do this:

+
  • for testing of your code, you can have your testing framework import +the main script, and run special test functions which then call the +main function directly;
  • +
  • where you want to not only be able to run your script from the +command-line, but also provide a programmer-friendly application +programming interface (API) for advanced users.
  • +

We will use sys library to read the command line +arguments passed to our script and make them available in our code as a +list - remember to import this library first.

+

Our modified code will now look as follows.

+
+

PYTHON +

+
import json
+import csv
+import datetime as dt
+import matplotlib.pyplot as plt
+import pandas as pd
+import sys
+
+def main(input_file, output_file, graph_file):
+    print("--START--")
+
+    eva_data = read_json_to_dataframe(input_file)
+
+    write_dataframe_to_csv(eva_data, output_file)
+
+    plot_cumulative_time_in_space(eva_data, graph_file)
+
+    print("--END--")
+
+def read_json_to_dataframe(input_file):
+    """
+    Read the data from a JSON file into a Pandas dataframe.
+    Clean the data by removing any incomplete rows and sort by date
+
+    Args:
+        input_file_ (str): The path to the JSON file.
+
+    Returns:
+         eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
+    """
+    print(f'Reading JSON file {input_file}')
+    eva_df = pd.read_json(input_file, convert_dates=['date'])
+    eva_df['eva'] = eva_df['eva'].astype(float)
+    eva_df.dropna(axis=0, inplace=True)
+    eva_df.sort_values('date', inplace=True)
+    return eva_df
+
+
+def write_dataframe_to_csv(df, output_file):
+    """
+    Write the dataframe to a CSV file.
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        output_file (str): The path to the output CSV file.
+
+    Returns:
+        None
+    """
+    print(f'Saving to CSV file {output_file}')
+    df.to_csv(output_file, index=False)
+
+def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        duration_hours (float): The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours) + int(minutes)/60
+    return duration_hours
+
+
+def add_duration_hours_variable(df):
+    """
+    Add duration in hours (duration_hours) variable to the dataset
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+
+    Returns:
+        df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added
+    """
+    df_copy = df.copy()
+    df_copy["duration_hours"] = df_copy["duration"].apply(
+        text_to_duration
+    )
+    return df_copy
+
+
+def plot_cumulative_time_in_space(df, graph_file):
+    """
+    Plot the cumulative time spent in space over years
+
+    Convert the duration column from strings to number of hours
+    Calculate cumulative sum of durations
+    Generate a plot of cumulative time spent in space over years and
+    save it to the specified location
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        graph_file (str): The path to the output graph file.
+
+    Returns:
+        None
+    """
+    print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
+    df = add_duration_hours_variable(df)
+    df['cumulative_time'] = df['duration_hours'].cumsum()
+    plt.plot(df.date, df.cumulative_time, 'ko-')
+    plt.xlabel('Year')
+    plt.ylabel('Total time spent in space to date (hours)')
+    plt.tight_layout()
+    plt.savefig(graph_file)
+    plt.show()
+
+
+if __name__ == "__main__":
+
+    if len(sys.argv) < 3:
+        input_file = './eva-data.json'
+        output_file = './eva-data.csv'
+        print(f'Using default input and output filenames')
+    else:
+        input_file = sys.argv[1]
+        output_file = sys.argv[2]
+        print('Using custom input and output filenames')
+
+    graph_file = './cumulative_eva_graph.png'
+    main(input_file, output_file, graph_file)
+
+

We can now run our script from the command line passing the JSON +input data file and CSV output data file as:

+
+

BASH +

+
(venv_spacewalks) $ python eva_data_analysis.py eva_data.json eva_data.csv
+
+

Remember to commit our changes.

+
+

BASH +

+
(venv_spacewalks) $ git status
+(venv_spacewalks) $ git add eva_data_analysis.py
+(venv_spacewalks) $ git commit -m "Add command line functionality to script"
+
+

Directory structure for software projects

+

One of the steps to make your work more easily readable and +reproducible is to organise your software projects following certain +conventions on consistent and informative directory structure. This way, +people will immediately know where to find things within your project. +Here are some general guidelines that apply to all research projects +(including software projects):

+
  • Put all files related to a project into a single directory
  • +
  • Do not mix project files - different projects should have separate +directories and repositories (it is OK to copy files into multiple +places if both projects require them)
  • +
  • Avoid spaces in directory and file names – they can cause errors +when read by computers
  • +
  • If you have sensitive data - you can put it in a private repository +on GitHub
  • +
  • Use .gitignore to specify what files should not be tracked - +e.g. passwords, local configuration, etc.
  • +
  • Add a README file to your repository to describe the project and +instructions on running the code or reproducing the results (we will +covered this later in this course).
  • +
+

OUTPUT +

+
project_name/
+├── README.md             # overview of the project
+├── data/                 # data files used in the project
+│   ├── README.md         # describes where data came from
+│   └── sub-folder/       # may contain subdirectories, e.g. for intermediate files from the analysis
+├── manuscript/           # manuscript describing the results
+├── results/              # results of the analysis (data, tables, figures)
+├── src/                  # contains all code in the project
+│   ├── LICENSE           # license for your code
+│   ├── requirements.txt  # software requirements and dependencies
+│   └── ...
+├── src/                  # source code for your project
+├── doc/                  # documentation for your project
+├── index.rst
+├── main_script.py        # main script/code entry point
+└── ...
+
+
  • Source code is typically placed in the src/ or +source/ directory (and its subdirectories containing +hierarchical libraries of your code). The main script or the main entry +to your code may remain in the project root.
  • +
  • Data is typically placed under data/ +
    • Raw data or input files can also be placed under +raw_data/ - original data should not be modified and should +be kept raw
    • +
    • Processed or cleaned data or intermediate results from data analysis +can be placed under processed_data/ +
    • +
  • +
  • Documentation is typically placed or compiled into doc/ +or docs/ +
  • +
  • Results are typically placed under results/
  • +
  • Tests are typically placed under tests/
  • +
+
+ +
+
+

Challenge

+
+

Refactor your software project so that input data is stored in +data/ directory and results (the graph and CSV data files) +saved in results/ directory.

+
+
+
+
+
+ +
+
+
+

PYTHON +

+
import matplotlib.pyplot as plt
+import pandas as pd
+import sys
+
+# https://data.nasa.gov/resource/eva.json (with modifications)
+
+def main(input_file, output_file, graph_file):
+    print("--START--")
+
+    eva_data = read_json_to_dataframe(input_file)
+
+    write_dataframe_to_csv(eva_data, output_file)
+
+    plot_cumulative_time_in_space(eva_data, graph_file)
+
+    print("--END--")
+
+def read_json_to_dataframe(input_file):
+    """
+    Read the data from a JSON file into a Pandas dataframe.
+    Clean the data by removing any incomplete rows and sort by date
+
+    Args:
+        input_file_ (str): The path to the JSON file.
+
+    Returns:
+         eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
+    """
+    print(f'Reading JSON file {input_file}')
+    eva_df = pd.read_json(input_file, convert_dates=['date'])
+    eva_df['eva'] = eva_df['eva'].astype(float)
+    eva_df.dropna(axis=0, inplace=True)
+    eva_df.sort_values('date', inplace=True)
+    return eva_df
+
+
+def write_dataframe_to_csv(df, output_file):
+    """
+    Write the dataframe to a CSV file.
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        output_file (str): The path to the output CSV file.
+
+    Returns:
+        None
+    """
+    print(f'Saving to CSV file {output_file}')
+    df.to_csv(output_file, index=False)
+
+def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        duration_hours (float): The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours) + int(minutes)/60
+    return duration_hours
+
+
+def add_duration_hours_variable(df):
+    """
+    Add duration in hours (duration_hours) variable to the dataset
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+
+    Returns:
+        df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added
+    """
+    df_copy = df.copy()
+    df_copy["duration_hours"] = df_copy["duration"].apply(
+        text_to_duration
+    )
+    return df_copy
+
+
+def plot_cumulative_time_in_space(df, graph_file):
+    """
+    Plot the cumulative time spent in space over years
+
+    Convert the duration column from strings to number of hours
+    Calculate cumulative sum of durations
+    Generate a plot of cumulative time spent in space over years and
+    save it to the specified location
+
+    Args:
+        df (pd.DataFrame): The input dataframe.
+        graph_file (str): The path to the output graph file.
+
+    Returns:
+        None
+    """
+    print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
+    df = add_duration_hours_variable(df)
+    df['cumulative_time'] = df['duration_hours'].cumsum()
+    plt.plot(df.date, df.cumulative_time, 'ko-')
+    plt.xlabel('Year')
+    plt.ylabel('Total time spent in space to date (hours)')
+    plt.tight_layout()
+    plt.savefig(graph_file)
+    plt.show()
+
+
+if __name__ == "__main__":
+
+    if len(sys.argv) < 3:
+        input_file = 'data/eva-data.json'
+        output_file = 'results/eva-data.csv'
+        print(f'Using default input and output filenames')
+    else:
+        input_file = sys.argv[1]
+        output_file = sys.argv[2]
+        print('Using custom input and output filenames')
+
+    graph_file = 'results/cumulative_eva_graph.png'
+    main(input_file, output_file, graph_file)
+
+
+

BASH +

+
(venv_spacewalks) $ git status
+(venv_spacewalks) $ git add eva_data_analysis.py data results
+(venv_spacewalks) $ git commit -m "Update project's directory structure"
+
+
+
+
+
+

Further reading

+

We recommend the following resources for some additional reading on +the topic of this episode:

+

Also check the full reference set +for the course.

+
+
+ +
+
+

Key Points

+
+
  • Good practices for code and project structure are essential for +creating readable, accessible and reproducibile projects.
  • +
+
+
+ +
+
+ + +
+
+ + + diff --git a/08-code-correctness.html b/08-code-correctness.html new file mode 100644 index 00000000..0225ba5a --- /dev/null +++ b/08-code-correctness.html @@ -0,0 +1,2103 @@ + +Tools and practices for FAIR research software: Code correctness +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Code correctness

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can we verify that our code is correct?
  • +
  • How can we automate our software tests?
  • +
  • What makes a “good” test?
  • +
  • Which parts of our code should we prioritize for testing?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to:

+
  • Explain why code testing is important and how this supports FAIR +software.
  • +
  • Describe the different types of software tests (unit tests, +integration tests, regression tests).
  • +
  • Implement unit tests to verify that function(s) behave as expected +using the Python testing framework pytest.
  • +
  • Interpret the output from pytest to identify which +function(s) are not behaving as expected.
  • +
  • Write tests using typical values, edge cases and invalid inputs to +ensure that the code can handle extreme values and invalid inputs +appropriately.
  • +
  • Evaluate code coverage to identify how much of the codebase is being +tested and identify areas that need further tests.
  • +
+
+
+
+
+

Now that we have improved the structure and readability of our code - +it is much easier to test its functionality and improve it further. The +goal of software testing is to check that the actual results produced by +a piece of code meet our expectations, i.e. are correct.

+

Before we move on with further code modifications, make sure your +virtual development environment is active.

+ +

Why use software testing?

+

Adopting software testing as part of our research workflow helps us +to conduct better research and produce FAIR +software:

+
  • Software testing can help us be more productive as it helps us to +identify and fix problems with our code early and quickly and allows us +to demonstrate to ourselves and others that our code does what we claim. +More importantly, we can share our tests alongside our code, allowing +others to verify our software for themselves.
  • +
  • The act of writing tests encourages to structure our code as +individual functions and often results in a more +readable, modular and maintainable codebase that is +easier to extend or repurpose.
  • +
  • Software testing improves the accessibility and +reusability of our code - well-written software tests +capture the expected behaviour of our code and can be used alongside +documentation to help other developers quickly make sense of our code. +In addition, a well tested codebase allows developers to experiment with +new features safe in the knowledge that tests will reveal if their +changes have broken any existing functionality.
  • +
  • Software testing underpins the FAIR process by giving us the +confidence to engage in open research practices - if we are not sure +that our code works as intended and produces accurate results, we are +unlikely to feel confident about sharing our code with others. Software +testing brings piece of mind by providing a step-by-step approach that +we can apply to verify that our code is correct.
  • +

Types of software tests

+

There are many different types of software testing.

+
  • Unit tests focus on testing individual functions +in isolation. They ensure that each small part of the software performs +as intended. By verifying the correctness of these individual units, we +can catch errors early in the development process.

  • +
  • Integration tests check how different parts of +the code e.g. functions work together.

  • +
  • Regression tests are used to ensure that new +changes or updates to the codebase do not adversely affect the existing +functionality. They involve checking whether a program or part of a +program still generates the same results after changes have been +made.

  • +
  • End-to-end tests are a special type of +integration testing which checks that a program as a whole behaves as +expected.

  • +

In this course, our primary focus will be on unit testing. However, +the concepts and techniques we cover will provide a solid foundation +applicable to other types of testing.

+
+
+ +
+
+

Types of software tests

+
+

Fill in the blanks in the sentences below:

+
  • __________ tests compare the ______ output of a program to its +________ output to demonstrate correctness.
  • +
  • Unit tests compare the actual output of a ______ ________ to the +expected output to demonstrate correctness.
  • +
  • __________ tests check that results have not changed since the +previous test run.
  • +
  • __________ tests check that two or more parts of a program are +working together correctly.
  • +
+
+
+
+
+ +
+
+
  • End-to-end tests compare the actual output of a program to the +expected output to demonstrate correctness.
  • +
  • Unit tests compare the actual output of a single function to the +expected output to demonstrate correctness.
  • +
  • Regression tests check that results have not changed since the +previous test run.
  • +
  • Integration tests check that two or more parts of a program are +working together correctly.
  • +
+
+
+
+

Informal testing

+

How should we test our code? Let’s start by +considering the following scenario. A collaborator on our project has +sent us the following code to add a crew_size variable to +our data frame - a column which captures the number of astronauts +participating in a given spacewalk. How do we know that it works as +intended?

+
+

PYTHON +

+
import re
+import pandas
+
+def calculate_crew_size(crew):
+    """
+    Calculate crew_size for a single crew entry
+
+    Args:
+        crew (str): The text entry in the crew column
+
+    Returns:
+        int: The crew size
+    """
+    if crew.split() == []:
+        return None
+    else:
+        return len(re.split(r';', crew))-1
+
+
+def add_crew_size_variable(df_):
+    """
+    Add crew size (crew_size) variable to the dataset
+
+    Args:
+        df_ (pd.DataFrame): The input data frame.
+
+    Returns:
+        df_copy (pd.DataFrame): A copy of df_ with the new crew_size variable added
+    """
+    print('Adding crew size variable (crew_size) to dataset')
+    df_copy = df_.copy()
+    df_copy["crew_size"] = df_copy["crew"].apply(
+        calculate_crew_size
+    )
+    return df_copy
+    
+
+

One approach is to copy/paste the function(s) into a python +interpreter and check that they behave as expected with some input +values where we know what the correct return value should be.

+

Since add_crew_size_variable contains boiler plate code +for deriving one column from another let’s start with +calculate_crew_size:

+
+

PYTHON +

+
calculate_crew_size("Valentina Tereshkova;")
+calculate_crew_size("Judith Resnik; Sally Ride;")
+
+
+

OUTPUT +

+
1
+2
+
+

We can then explore the behaviour of +add_crew_size_variable by creating a toy data frame:

+
+

PYTHON +

+
# Create a toy DataFrame
+data = pd.DataFrame({
+    'crew': ['Anna Lee Fisher;', 'Marsha Ivins; Helen Sharman;']
+})
+
+add_crew_size_variable(data)
+
+
+

OUTPUT +

+
Adding crew size variable (crew_size) to dataset
+                           crew  crew_size
+0              Anna Lee Fisher;          1
+1  Marsha Ivins; Helen Sharman;          2
+
+

Although this is an important process to go through as we draft our +code for the first time, there are some serious drawbacks to this +approach if used as our only form of testing.

+
+
+ +
+
+

What are the limitations of informally testing code? (5 minutes)

+
+

Think about the questions below. Your instructors may ask you to +share your answers in a shared notes document and/or discuss them with +other participants.

+
  • Why might we choose to test our code informally?
  • +
  • What are the limitations of relying solely on informal tests to +verify that a piece of code is behaving as expected?
  • +
+
+
+
+
+ +
+
+

It can be tempting to test our code informally because this +approach:

+
  • is quick and easy
  • +
  • provides immediate feedback
  • +

However, there are limitations to this approach:

+
  • Working interactively is error prone
  • +
  • We must repeat our tests every time we update our code; this is time +consuming
  • +
  • We must rely on memory to keep track of how we have tested our code +e.g. what input values we tried
  • +
  • We must rely on memory to keep track of which functions have been +tested and which have not
  • +
+
+
+
+

Formal testing

+

We can overcome some of these limitations by formalising our testing +process. A formal approach to testing our function(s) is to write +dedicated test functions to check our code. These test functions:

+
  • Run the function we want to test - the target function with known +inputs
  • +
  • Compare the output to known, valid results
  • +
  • Raises an error if the function’s actual output does not match the +expected output
  • +
  • Are recorded in a test script that can be re-run on demand.
  • +

Let’s explore this process by writing some formal tests for our +text_to_duration function. (We’ll come back to our +colleague’s calculate_crew_size function later).

+

The text_to_duration function converts a duration stored +as a string (HH:MM) to a duration in hours e.g. duration “1.15” should +return a value of 1.25.

+
+

PYTHON +

+
def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        float: The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours) + int(minutes)/60
+    return duration_hours
+
+

Let’s create a new python file test_code.py in the root +of our project folder to store our tests.

+
+

BASH +

+
(venv_spacewalks) $ cd spacewalks
+(venv_spacewalks) $ touch test_code.py
+
+

First, we import text_to_duration into our test script. Then, we then +add our first test function:

+
+

PYTHON +

+

+from eva_data_analysis import text_to_duration
+
+def test_text_to_duration_integer():
+    input_value = "10:00"
+    test_result = text_to_duration("10:00") == 10
+    print(f"text_to_duration('10:00') == 10? {test_result}")
+
+test_text_to_duration()
+
+

This test checks that when we apply text_to_duration to input value +“10:00”, the output matches the expected value of 10.

+

In this example, we use a print statement to report whether the +actual output from text_to_duration meets our expectations.

+

However, this does not meet our requirement to “Raise an error if the +function’s output does not match the expected output” and means that we +must carefully read our test function’s output to identify whether it +has failed.

+

To ensure that our code raises an error if the function’s output does +not match the expected output, we can use an assert statement.

+

The assert statement in Python checks whether a condition is True or +False. If the statement is True, then assert does not return a value but +if the statement is false, then assert raises an AssertError.

+

Let’s rewrite our test with an assert statement:

+
+

PYTHON +

+

+from eva_data_analysis import text_to_duration
+
+def test_text_to_duration_integer():
+    assert text_to_duration("10:00") == 10
+
+test_text_to_duration_integer()
+
+

Notice that when we run test_text_to_duration_integer(), nothing +happens - there is no output. That is because our function is working +correctly and returning the expected value of 10.

+

Let’s see what happens when we deliberately introduce a bug into +text_to_duration: In the Spacewalks data analysis script +let’s change int(hours) to int(hour)/60 and +int(minutes)/60 to int(minutes)to mimic a +simple mistake in our code where the wrong element is divided by 60.

+
+

PYTHON +

+
def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        duration (float): The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours)/60 + int(minutes) # Divide the wrong element by 60
+    return duration_hours
+
+

Notice that this time, our test fails noisily. Our assert statement +has raised an AssertionError - a clear signal that there is +a problem in our code that we need to fix.

+
+

PYTHON +

+
test_text_to_duration_integer()
+
+
+

ERROR +

+
Traceback (most recent call last):
+  File "/Users/AnnResearchers/Desktop/Spacewalks/test_code.py", line 7, in <module>
+    test_text_to_duration_integer()
+  File "/Users/AnnResearchers/Desktop/Spacewalks/test_code.py", line 5, in test_text_to_duration_integer
+    assert text_to_duration("10:00") == 10
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+AssertionError
+
+

What happens if we add another test to our test script? This time +we’ll check that our function can handle durations with a non-zero +minute component. Notice that this time our expected value is a +floating-point number. Importantly, we cannot use a simple double equals +sign (==) to compare the equality of floating-point +numbers. Floating-point arithmetic can introduce very small differences +due to how computers represent these numbers internally - as a result, +we check that our floating point numbers are equal within a very small +tolerance (1e-5).

+
+

PYTHON +

+
from eva_data_analysis import text_to_duration
+
+def test_text_to_duration_integer():
+    assert text_to_duration("10:00") == 10
+    
+def test_text_to_duration_float():
+    assert abs(text_to_duration("10:20") - 10.33333333) < 1e-5
+
+test_text_to_duration_integer()
+test_text_to_duration_float()
+
+
+

OUTPUT +

+
Traceback (most recent call last):
+  File "/Users/AnnResearcher/Desktop/Spacewalks/test_code.py", line 9, in <module>
+    test_text_to_duration_integer()
+  File "/Users/AnnResearcher/Desktop/Spacewalks/test_code.py", line 4, in test_text_to_duration_integer
+    assert text_to_duration("10:00") == 10
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+AssertionError
+
+

What happens when we run our updated test script? Our script stops +after the first test failure and the second test is not run. To run our +remaining tests we would have to manually comment out our failing test +and re-run the test script. As our code base and tests grow, this will +become cumbersome. This is not ideal and can be overcome by automating +our tests using a testing framework.

+

Using a testing framework

+

Our approach so far has had two major limitations:

+
  • We had to carefully examine the output of our test script to work +out if our test failed.
  • +
  • Our test script only ran our tests up to the first test +failure.
  • +

We can do better than this! Testing frameworks can automatically find +all the tests in our code base, run all of them and present the test +results as a readable summary.

+

We will use the python testing framework pytest with its code +coverage plugin pytest-cov. To install these libraries, open a terminal +and type:

+
+

BASH +

+
(venv_spacewalks) $ python -m pip install pytest pytest-cov
+
+

Make sure to also capture the changes to our virtual development +environment.

+
+

BASH +

+
(venv_spacewalks) $ python -m pip freeze > requirements.txt
+(venv_spacewalks) $ git add requirements.txt
+(venv_spacewalks) $ git commit -m "Added pytest and pytest-cov libraries."
+(venv_spacewalks) $ git push origin main
+
+

Let’s make sure that our tests are ready to work with pytest.

+
  • +

    pytest automatically discovers tests based on specific naming +patterns. pytest looks for files that start with test_ or +end with _test.py. Then, within these files, pytest looks +for functions that start with test_.
    +Our test file already meets these requirements, so there is nothing to +do here. However, our script does contain lines to run each of our test +functions. These are no-longer required as pytest will run our tests so +we will remove them:

    +
    +

    PYTHON +

    +
    # Delete
    +test_text_to_duration_integer()
    +test_text_to_duration_float()
    +
    +
  • +
  • It is also conventional when working with a testing framework to +place test files in a tests directory at the root of our project and to +name each test file after the code file that it targets. This helps in +maintaining a clean structure and makes it easier for others to +understand where the tests are located.

  • +

A set of tests for a given piece of code is called a test suite. Our +test suite is currently located in the root folder of our project. Let’s +move it to a dedicated test folder and rename our test_code.py file to +test_eva_analysis.py.

+
+

BASH +

+
(venv_spacewalks) $ mkdir tests
+(venv_spacewalks) $ mv test_code.py tests/test_eva_analysis.py
+
+

Before we re-run our tests using pytest, let’s update our second +test. to use pytest’s approx function which is specifically +intended for comparing floating point numbers within a tolerance.

+
+

PYTHON +

+
import pytest
+from eva_data_analysis import text_to_duration
+
+def test_text_to_duration_integer():
+    assert text_to_duration("10:00") == 10
+    
+def test_text_to_duration_float():
+    assert text_to_duration("10:20") == pytest.approx(10.33333333)
+
+

Let’s also add some inline comments to clarify what each test is +doing and expand our syntax to highlight the logic behind our +approach:

+
+

PYTHON +

+
import pytest
+from eva_data_analysis import text_to_duration
+
+def test_text_to_duration_integer():
+    """
+    Test that text_to_duration returns expected ground truth values
+    for typical whole hour durations 
+    """
+    actual_result =  text_to_duration("10:00")
+    expected_result = 10
+    assert actual_result == expected_result
+    
+def test_text_to_duration_float():
+    """
+    Test that text_to_duration returns expected ground truth values
+    for typical durations with a non-zero minute component
+    """
+    actual_result = text_to_duration("10:20") 
+    expected_result = 10.33333333
+    assert actual_result == pytest.approx(expected_result)
+
+

Writing our tests this way highlights the key idea that each test +should compare the actual results returned by our function with expected +values.

+

Similarly, writing inline comments for our tests that complete the +sentence “Test that …” helps us to understand what each test is doing +and why it is needed.

+

Finally, let’s also modify our bug to something that will affect +durations with a non-zero minute component like “10:20” but not those +that are whole hours e.g. “10:00”.

+

Let’s change int(hours)/60 + int(minutes) to +int(hours) + int(minutes)/6 a simple typo.

+
+

PYTHON +

+
def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        duration (float): The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours) + int(minutes)/6 # Divide by 6 instead of 60
+    return duration_hours
+
+

Finally, let’s run our tests:

+
+

BASH +

+
(venv_spacewalks) $ python -m pytest 
+
+
+

OUTPUT +

+
========================================================== test session starts
+platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
+rootdir: /Users/AnnResearcher/Desktop/Spacewalks
+plugins: cov-5.0.0
+collected 2 items
+
+tests/test_eva_data_analysis.py .F                                                                                                 [100%]
+
+================================================================ FAILURES
+______________________________________________________ test_text_to_duration_float
+
+    def test_text_to_duration_float():
+        """
+        Test that text_to_duration returns expected ground truth values
+        for typical durations with a non-zero minute component
+        """
+        actual_result = text_to_duration("10:20")
+        expected_result = 10.33333333
+>       assert actual_result == pytest.approx(expected_result)
+E       assert 13.333333333333334 == 10.33333333 ± 1.0e-05
+E
+E         comparison failed
+E         Obtained: 13.333333333333334
+E         Expected: 10.33333333 ± 1.0e-05
+
+tests/test_eva_data_analysis.py:23: AssertionError
+======================================================== short test summary info
+FAILED tests/test_eva_data_analysis.py::test_text_to_duration_float - assert 13.333333333333334 == 10.33333333 ± 1.0e-05
+====================================================== 1 failed, 1 passed in 0.32s
+
+
  • Notice how if the test function finishes without triggering an +assertion, the test is considered successful and is marked with a dot +(‘.’).
  • +
  • If an assertion fails or an error occurs, the test is marked as a +failure with an ‘F’. and the output includes details about the error to +help identify what went wrong.
  • +
+
+ +
+
+

Interpreting pytest output

+
+

A colleague has asked you to conduct a pre-publication review of +their code Spacetravel which analyses time spent in space +by various individual astronauts.

+

Inspect the pytest output provided and answer the questions +below.

+
+

pytest output for Spacetravel +

+
+

OUTPUT +

+
============================================================ test session starts
+platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
+rootdir: /Users/Desktop/AnneResearcher/projects/Spacetravel
+collected 9 items
+
+tests/test_analyse.py FF....                                              [ 66%]
+tests/test_prepare.py s..                                                 [100%]
+
+====================================================================== FAILURES
+____________________________________________________________ test_total_duration
+
+    def test_total_duration():
+
+      durations = [10, 15, 20, 5]
+      expected  = 50/60
+      actual  = calculate_total_duration(durations)
+>     assert actual == pytest.approx(expected)
+E     assert 8.333333333333334 == 0.8333333333333334 ± 8.3e-07
+E
+E       comparison failed
+E       Obtained: 8.333333333333334
+E       Expected: 0.8333333333333334 ± 8.3e-07
+
+tests/test_analyse.py:9: AssertionError
+______________________________________________________________________________ test_mean_duration
+
+    def test_mean_duration():
+       durations = [10, 15, 20, 5]
+
+       expected = 12.5/60
+>      actual  = calculate_mean_duration(durations)
+
+tests/test_analyse.py:15:
+_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
+
+durations = [10, 15, 20, 5]
+
+    def calculate_mean_duration(durations):
+        """
+        Calculate the mean of a list of durations.
+        """
+        total_duration = sum(durations)/60
+>       mean_duration = total_duration / length(durations)
+E       NameError: name 'length' is not defined
+
+Spacetravel.py:45: NameError
+=========================================================================== short test summary info
+FAILED tests/test_analyse.py::test_total_duration - assert 8.333333333333334 == 0.8333333333333334 ± 8.3e-07
+FAILED tests/test_analyse.py::test_mean_duration - NameError: name 'length' is not defined
+============================================================== 2 failed, 6 passed, 1 skipped in 0.02s 
+
+
  1. How many tests has our colleague included in the test suite?
  2. +
  3. The first test in test_prepare.py has a status of s; what does this +mean?
  4. +
  5. How many tests failed?
  6. +
  7. Why did “test_total_duration” fail?
  8. +
  9. Why did “test_mean_duration” fail?
  10. +
+
+
+
+
+
+ +
+
+
  1. 9 tests were detected in the test suite
  2. +
  3. s - stands for “skipped”,
  4. +
  5. 2 tests failed: the first and second tests in test file +test_analyse.py +
  6. +
  7. +test_total_duration failed because the calculated total +duration differs from the expected value by a factor of 10 i.e. the +assertion actual == pytest.approx(expected) evaluated to +False +
  8. +
  9. +test_mean_duration failed because there is a syntax +error in calculate_mean_duration. Our colleague has used +the command length (not a python command) instead of +len. As a result, running the function returns a +NameError rather than a calculated value and the test +assertion evaluates to False.
  10. +
+
+
+
+

Test Suite Design

+

Now that we have tooling in place to automatically run our test +suite. What makes a good test suite?

+
+

Good Tests

+

We should aim to test that our function behaves as expected with the +full range of inputs that it might encounter. It is helpful to consider +each argument of a function in turn and identify the range of typical +values it can take. Once we have identified this typical range or ranges +(where a function takes more than one argument), we should:

+
  • Test at least one interior point
  • +
  • Test all values at the edge of the range
  • +
  • Test invalid values
  • +

Let’s revisit the crew_size functions from our +colleague’s codebase. First let’s add the the additional functions to +eva_data_analysis.py:

+
+

PYTHON +

+
import pandas as pd
+import matplotlib.pyplot as plt
+import sys
+import re
+
+...
+
+def calculate_crew_size(crew):
+    """
+    Calculate crew_size for a single crew entry
+
+    Args:
+        crew (str): The text entry in the crew column
+
+    Returns:
+        int: The crew size
+    """
+    if crew.split() == []:
+        return None
+    else:
+        return len(re.split(r';', crew))-1
+
+
+def add_crew_size_variable(df_):
+    """
+    Add crew size (crew_size) variable to the dataset
+
+    Args:
+        df_ (pd.DataFrame): The input data frame.
+
+    Returns:
+        df_copy (pd.DataFrame): A copy of df_ with the new crew_size variable added
+    """
+    print('Adding crew size variable (crew_size) to dataset')
+    df_copy = df_.copy()
+    df_copy["crew_size"] = df_copy["crew"].apply(
+        calculate_crew_size
+    )
+    return df_copy
+
+if __name__ == '__main__':
+
+    if len(sys.argv) < 3:
+        input_file = './eva-data.json'
+        output_file = './eva-data.csv'
+        print(f'Using default input and output filenames')
+    else:
+        input_file = sys.argv[1]
+        output_file = sys.argv[2]
+        print('Using custom input and output filenames')
+
+    graph_file = './cumulative_eva_graph.png'
+
+    eva_data = read_json_to_dataframe(input_file)
+
+    eva_data_prepared = add_crew_size_variable(eva_data)  # Add this line
+
+    write_dataframe_to_csv(eva_data_prepared, output_file)  # Modify this line
+
+    plot_cumulative_time_in_space(eva_data_prepared, graph_file) # Modify this line
+
+    print("--END--")
+
+

Now, let’s write some tests for calculate_crew_size.

+
+
+ +
+
+

Unit Tests for calculate_crew_size

+
+

Implement unit tests for the calculate_crew_size +function. Cover typical cases and edge cases.

+

Hint: use the following template:

+
def test_MYFUNCTION (): # FIXME
+    """
+    Test that ...   #FIXME
+    """
+
+    # Typical value 1
+    actual_result =  _______________ #FIXME
+    expected_result = ______________ #FIXME
+    assert actual_result == expected_result
+
+    # Typical value 2
+    actual_result =  _______________ #FIXME
+    expected_result = ______________ #FIXME
+    assert actual_result == expected_result
+    
+
+
+
+
+
+ +
+
+
+

PYTHON +

+
import pytest
+from eva_data_analysis import (
+    text_to_duration,
+    calculate_crew_size
+)
+
+def test_text_to_duration_integer():
+    """
+    Test that text_to_duration returns expected ground truth values
+    for typical whole hour durations
+    """
+    actual_result =  text_to_duration("10:00")
+    expected_result = 10
+    assert actual_result == expected_result
+
+def test_text_to_duration_float():
+    """
+    Test that text_to_duration returns expected ground truth values
+    for typical durations with a non-zero minute component
+    """
+    actual_result = text_to_duration("10:20")
+    expected_result = 10.33333333
+    assert actual_result == pytest.approx(expected_result)
+
+def test_calculate_crew_size():
+    """
+    Test that calculate_crew_size returns expected ground truth values
+    for typical crew values
+    """
+    actual_result = calculate_crew_size("Valentina Tereshkova;")
+    expected_result = 1
+    assert actual_result == expected_result
+
+    actual_result = calculate_crew_size("Judith Resnik; Sally Ride;")
+    expected_result = 2
+    assert actual_result == expected_result
+
+
+# Edge cases
+def test_calculate_crew_size_edge_cases():
+    """
+    Test that calculate_crew_size returns expected ground truth values
+    for edge case where crew is an empty string
+    """
+    actual_result = calculate_crew_size("")
+    assert actual_result is None
+
+
+

OUTPUT +

+
========================================================== test session starts
+platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
+rootdir: /Users/AnnResearcher/Desktop/Spacewalks
+plugins: cov-5.0.0
+collected 4 items
+
+tests/test_eva_data_analysis.py .F..                                                                                               [100%]
+
+================================================================ FAILURES
+______________________________________________________ test_text_to_duration_float
+
+    def test_text_to_duration_float():
+        """
+        Test that text_to_duration returns expected ground truth values
+        for typical durations with a non-zero minute component
+        """
+        actual_result = text_to_duration("10:20")
+        expected_result = 10.33333333
+>       assert actual_result == pytest.approx(expected_result)
+E       assert 13.333333333333334 == 10.33333333 ± 1.0e-05
+E
+E         comparison failed
+E         Obtained: 13.333333333333334
+E         Expected: 10.33333333 ± 1.0e-05
+
+tests/test_eva_data_analysis.py:23: AssertionError
+======================================================== short test summary info
+FAILED tests/test_eva_data_analysis.py::test_text_to_duration_float - assert 13.333333333333334 == 10.33333333 ± 1.0e-05
+====================================================== 1 failed, 3 passed in 0.33s 
+
+
+
+
+
+
+
+ +
+
+

Parameterising Tests

+
+

If we revisit our test suite, we can see that some of our tests do +not follow the “Don’t +Repeat Yourself principle” which prevents software - including +testing code - from becoming overly long and confusing. For example, if +we examine our test for calculate_crew_size, we can see +that a small block of code is repeated twice with different input +values:

+
+

PYTHON +

+
def test_calculate_crew_size():
+    """
+    Test that calculate_crew_size returns expected ground truth values
+    for typical crew values
+    """
+    actual_result = calculate_crew_size("Valentina Tereshkova;")
+    expected_result = 1
+    assert actual_result == expected_result
+
+    actual_result = calculate_crew_size("Judith Resnik; Sally Ride;")
+    expected_result = 2
+    assert actual_result == expected_result
+
+

Where the repeated code block is:

+
+

PYTHON +

+
actual_result = calculate_crew_size(input_value)
+expected_result = expected_value
+assert actual_result == expected_result
+
+

To avoid repeating ourselves, we can use an approach called test +parameterisation. This allows us to apply our test function to a list of +input / expected output pairs without the need for repetition. To +parameterise the calculate_crew_size test, we rewrite the +test function as follows:

+
+

PYTHON +

+
import pytest
+
+@pytest.mark.parametrize("input_value, expected_result", [
+    ("Valentina Tereshkova;", 1),
+    ("Judith Resnik; Sally Ride;", 2),
+])
+def test_calculate_crew_size(input_value, expected_result):
+    """
+    Test that calculate_crew_size returns expected ground truth values
+    for typical crew values
+    """
+    actual_result = calculate_crew_size(input_value)
+    assert actual_result == expected_result
+
+

Notice the following key changes to our code:

+
  • Our unparameterised test function did not have any arguments +(test_calculate_crew_size()) and our input / expected +values were all defined the body of our test function.

  • +
  • In the parameterised version, the body of our test function has +been rewritten as a parameterised block of code that uses the variables +input_value and expected_result which are now +arguments of the test function.

  • +
  • A python decorator @pytest.mark.parametrize is placed immediately +before the test function and indicates that it should be run once for +each set of parameters provided.

  • +

In Python, a decorator is a function that can modify the behaviour of +another function. @pytest.mark.parametrize +is a decorator provided by pytest that modifies the behaviour of our +test function by running it multiple times - once for each set of +inputs. This decorator takes two main arguments:

+
  • Parameter Names: A string with the names of the parameters that +the test function will accept, separated by commas – in this case +“input_value” and “expected_value”

  • +
  • Parameter Values: A list of tuples, where each tuple contains the +values for the parameters specified in the first argument.

  • +

The final parameterised version of our test, is more manageable, +readable and easier to maintain!

+
+
+
+
+
+

Enough Tests

+

In this episode, so far we’ve (only) written tests for two individual +functions text_to_duration and +calculate_crew_size.

+

We can quantify the proportion of our code base that is run (also +referred to as “exercised”) by a given test suite using a metric called +code coverage:

+

\[ \text{Line Coverage} = \left( +\frac{\text{Number of Executed Lines}}{\text{Total Number of Executable +Lines}} \right) \times 100 \]

+

We can calculate our test coverage using the pytest-cov library. +Before we do so, let’s fix our bug so that our output is cleaner and we +can focus on the code coverage information.

+
+

PYTHON +

+
def text_to_duration(duration):
+    """
+    Convert a text format duration "HH:MM" to duration in hours
+
+    Args:
+        duration (str): The text format duration
+
+    Returns:
+        duration (float): The duration in hours
+    """
+    hours, minutes = duration.split(":")
+    duration_hours = int(hours) + int(minutes)/60 # Bug-free line
+    return duration_hours
+
+
+

BASH +

+
(venv_spacewalks) $ python -m pytest --cov 
+
+
+

OUTPUT +

+
========================================================== test session starts
+platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
+rootdir: /Users/AnnResearcher/Desktop/Spacewalks
+plugins: cov-5.0.0
+collected 4 items
+
+tests/test_eva_data_analysis.py ....                                                                                               [100%]
+
+---------- coverage: platform darwin, python 3.12.3-final-0 ----------
+Name                              Stmts   Miss  Cover
+-----------------------------------------------------
+eva_data_analysis.py                 56     38    32%
+tests/test_eva_data_analysis.py      20      0   100%
+-----------------------------------------------------
+TOTAL                                76     38    50%
+
+
+=========================================================== 4 passed in 1.04s
+
+

To get an in-depth report about which parts of our code are tested +and which are not, we can add the option +--cov-report=html.

+
+

BASH +

+
(venv_spacewalks) $ python -m pytest --cov --cov-report=html 
+
+

This option generates a folder htmlcov which contains a +html code coverage report. This provides structured information about +our test coverage including (a) a table showing the proportion of lines +in each function that are currently tested (b) an annotated copy of our +code where untested lines are highlighted in red.

+

Ideally, all the lines of code in our code base should be exercised +by at least one test. However, if we lack the time and resources to test +every line of our code we should:

+
  • Avoid testing Python’s built-in functions or functions imported from +well-known and well-test libraries like Pandas or numpy.
  • +
  • Focus on the the parts of our code that carry the greatest +“reputational risk” i.e. that could affect the accuracy of our reported +results.
  • +

One the other hand, it is also important to realise that althought +coverage of less than 100% indicates that more testing may be helpful, +test coverage of 100% does not mean that our code is bug-free!

+
+
+ +
+
+

Evaluating Code Coverage

+
+

Generate a code coverage report for the Spacewalks test +suite and extract the following information:

+
  1. What proportion of the code base is currently NOT exercised by the +test suite?
  2. +
  3. Which functions in our code base are currently untested?
  4. +
+
+
+
+
+ +
+
+
+

BASH +

+
(venv_spacewalks) $ python -m pytest --cov --cov-report=html
+
+
  1. The proportion of the code base NOT covered by our tests is +100 - 32% = 68%
  2. +
  3. The following functions in our code base are currently untested: +
    • read_json_to_dataframe
    • +
    • write_dataframe_to_csv
    • +
    • add_duration_hours_variable
    • +
    • plot_cumulative_time_in_space
    • +
    • add_crew_size_variable
    • +
  4. +
+
+
+
+
+
+

Implementing a minimal test suite

+

A member of our research team shares the following code with us to +add to the Spacewalks codebase:

+
+

PYTHON +

+
def summarise_categorical(df_, varname_):
+    """
+    Tabulate the distribution of a categorical variable
+
+    Args:
+        df_ (pd.DataFrame): The input dataframe.
+        varname_ (str): The name of the variable
+
+    Returns:
+        pd.DataFrame: dataframe containing the count and percentage of
+        each unique value of varname_
+        
+    Examples:
+        >>> df_example  = pd.DataFrame({
+            'vehicle': ['Apollo 16', 'Apollo 17', 'Apollo 17'],
+            }, index=[0, 1, 2)
+        >>> summarise_categorical(df_example, "vehicle")
+        Tabulating distribution of categorical variable vehicle
+             vehicle  count  percentage
+        0  Apollo 16      1        33.0
+        1  Apollo 17      2        67.0
+    """
+    print(f'Tabulating distribution of categorical variable {varname_}')
+
+    # Prepare statistical summary
+    count_variable = df_[[varname_]].copy()
+    count_summary = count_variable.value_counts()
+    percentage_summary = round(count_summary / count_variable.size, 2) * 100
+
+    # Combine results into a summary data frame
+    df_summary = pd.concat([count_summary, percentage_summary], axis=1)
+    df_summary.columns = ['count', 'percentage']
+    df_summary.sort_index(inplace=True)
+
+
+    df_summary = df_summary.reset_index()
+    return df_summary
+
+

This looks like a useful tool for creating summary statistics tables, +so let’s integrate this into our eva_data_analysis.pycode +and then write a minimal test suite to check that this code is behaving +as expected.

+
+

PYTHON +

+
import pandas as pd
+import matplotlib.pyplot as plt
+import sys
+import re
+
+
+...
+
+def add_crew_size_variable(df_):
+    """
+    Add crew size (crew_size) variable to the dataset
+
+    Args:
+        df_ (pd.DataFrame): The input dataframe.
+
+    Returns:
+        pd.DataFrame: A copy of df_ with the new crew_size variable added
+    """
+    print('Adding crew size variable (crew_size) to dataset')
+    df_copy = df_.copy()
+    df_copy["crew_size"] = df_copy["crew"].apply(
+        calculate_crew_size
+    )
+    return df_copy
+
+
+def summarise_categorical(df_, varname_):
+    """
+    Tabulate the distribution of a categorical variable
+
+    Args:
+        df_ (pd.DataFrame): The input dataframe.
+        varname_ (str): The name of the variable
+
+    Returns:
+        pd.DataFrame: dataframe containing the count and percentage of
+        each unique value of varname_
+    """
+    print(f'Tabulating distribution of categorical variable {varname_}')
+
+    # Prepare statistical summary
+    count_variable = df_[[varname_]].copy()
+    count_summary = count_variable.value_counts() # There is a bug here that we will fix later!
+    percentage_summary = round(count_summary / count_variable.size, 2) * 100
+
+    # Combine results into a summary data frame
+    df_summary = pd.concat([count_summary, percentage_summary], axis=1)
+    df_summary.columns = ['count', 'percentage']
+    df_summary.sort_index(inplace=True)
+
+
+    df_summary = df_summary.reset_index()
+    return df_summary
+
+
+if __name__ == '__main__':
+
+    if len(sys.argv) < 3:
+        input_file = './eva-data.json'
+        output_file = './eva-data.csv'
+        print(f'Using default input and output filenames')
+    else:
+        input_file = sys.argv[1]
+        output_file = sys.argv[2]
+        print('Using custom input and output filenames')
+
+    graph_file = './cumulative_eva_graph.png'
+
+    eva_data = read_json_to_dataframe(input_file)
+
+    eva_data_prepared = add_crew_size_variable(eva_data)
+
+    write_dataframe_to_csv(eva_data_prepared, output_file)
+
+    table_crew_size = summarise_categorical(eva_data_prepared, "crew_size")
+
+    write_dataframe_to_csv(table_crew_size, "./table_crew_size.csv")
+
+    plot_cumulative_time_in_space(eva_data_prepared, graph_file)
+
+    print("--END--")
+
+

To write tests for this function, we’ll need to be able to compare +dataframes. The pandas.testing module in the pandas library provides +functions and utilities for testing pandas objects and includes a +function assert_frame_equal that we can use to compare two +dataframes.

+
+
+ +
+
+

Exercise 1 - Typical Inputs

+
+

First, check that the function behaves as expected with typical input +values. Fill in the gaps in the skeleton test below:

+
+

PYTHON +

+
import pandas.testing as pdt
+
+def test_summarise_categorical_typical():
+    """
+    Test that summarise_categorical correctly tabulates
+    distribution of values (counts, percentages) for a ground truth
+    example (typical values)
+    """
+    test_input = pd.DataFrame({
+        'country': _________________________________________, # FIX-ME
+    }, index=[0, 1, 2, 3, 4])
+
+    expected_result = pd.DataFrame({
+        'country': ["Russia", "USA"],
+        'count': [2, 3],
+        'percentage': [40.0, 60.0],
+    }, index=[0, 1])
+
+    actual_result = ____________________________________________ # FIX-ME 
+    
+    pdt.__________________(actual_result, _______________) #FIX-ME
+
+
+
+
+
+
+ +
+
+
+

PYTHON +

+
import pandas.testing as pdt
+
+def test_summarise_categorical():
+    """
+    Test that summarise_categorical correctly tabulates
+    distribution of values (counts, percentages) for a simple ground truth
+    example
+    """
+    test_input = pd.DataFrame({
+        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
+    }, index=[0, 1, 2, 3, 4])
+
+    expected_result = pd.DataFrame({
+        'country': ["Russia", "USA"],
+        'count': [2, 3],
+        'percentage': [40.0, 60.0],
+    }, index=[0, 1])
+
+    actual_result = summarise_categorical(test_input, "country")
+
+    pdt.assert_frame_equal(actual_result, expected_result)
+
+
+
+
+
+
+
+ +
+
+

Exercise 2 - Edge Cases

+
+

Now let’s check that the function behaves as expected with edge +cases.
+Does the code behave as expected when the column of interest contains +one or more missing values (pd.NA)? (write a new test).

+

Fill in the gaps in the skeleton test below:

+
+

PYTHON +

+
import pandas.testing as pdt
+
+def test_summarise_categorical_missvals():
+    """
+    Test that summarise_categorical correctly tabulates
+    distribution of values (counts, percentages) for a ground truth
+    example (edge case where all column contains missing values)
+    """
+    test_input = _______________
+    _______________
+    _______________ # FIX-ME
+    
+    expected_result = _______________
+    _______________
+    _______________ # FIX-ME
+    
+    actual_result = summarise_categorical(test_input, "country")
+
+    pdt.assert_frame_equal(actual_result, expected_result)
+
+
+
+
+
+
+ +
+
+
+

PYTHON +

+
import pandas.testing as pdt
+
+def test_summarise_categorical_missvals():
+    """
+    Test that summarise_categorical correctly tabulates
+    distribution of values (counts, percentages) for a ground truth
+    example (edge case where column contains missing values)
+    """
+    test_input = pd.DataFrame({
+        'country': ['USA', 'USA', 'USA', "Russia", pd.NA],
+    }, index=[0, 1, 2, 3, 4])
+
+    expected_result = pd.DataFrame({
+        'country': ["Russia", "USA", np.nan], # np.nan because pd.NA is cast to np.nan
+        'count': [1, 3, 1],
+        'percentage': [20.0, 60.0, 20.0],
+    }, index=[0, 1, 2])
+    actual_result = summarise_categorical(test_input, "country")
+
+    pdt.assert_frame_equal(actual_result, expected_result)
+
+
+
+
+
+
+
+ +
+
+

Exercise 3 - Invalid inputs

+
+

Now write a test to check that the summarise_categorical +function raises an appropriate error when asked to tabulate a column +that does not exist in the data frame.

+

Hint: lookup pytest.raises in the pytest +documentation.

+
+
+
+
+
+ +
+
+
+

PYTHON +

+

+def test_summarise_categorical_invalid():
+    """
+    Test that summarise_categorical raises an
+    error when a non-existent column is input
+    """
+    test_input = pd.DataFrame({
+        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
+    }, index=[0, 1, 2, 3, 4])
+
+    with pytest.raises(KeyError):
+        summarise_categorical(test_input, "vehicle")
+
+
+
+
+
+
+

Improving Our Code

+

At the end of this episode, our test suite in tests +should look like this:

+
+

PYTHON +

+
import pytest
+import pandas as pd
+import pandas.testing as pdt
+import numpy as np
+
+from eva_data_analysis import (
+    text_to_duration,
+    calculate_crew_size,
+    summarise_categorical
+)
+
+def test_text_to_duration_integer():
+    """
+    Test that text_to_duration returns expected ground truth values
+    for typical whole hour durations
+    """
+    actual_result =  text_to_duration("10:00")
+    expected_result = 10
+    assert actual_result == expected_result
+
+def test_text_to_duration_float():
+    """
+    Test that text_to_duration returns expected ground truth values
+    for typical durations with a non-zero minute component
+    """
+    actual_result = text_to_duration("10:20")
+    expected_result = 10.33333333
+    assert actual_result == pytest.approx(expected_result)
+
+def test_calculate_crew_size():
+    """
+    Test that calculate_crew_size returns expected ground truth values
+    for typical crew values
+    """
+    actual_result = calculate_crew_size("Valentina Tereshkova;")
+    expected_result = 1
+    assert actual_result == expected_result
+
+    actual_result = calculate_crew_size("Judith Resnik; Sally Ride;")
+    expected_result = 2
+    assert actual_result == expected_result
+
+
+def test_calculate_crew_size_edge_cases():
+    """
+    Test that calculate_crew_size returns expected ground truth values
+    for edge case where crew is an empty string
+    """
+    actual_result = calculate_crew_size("")
+    assert actual_result is None
+
+
+def test_summarise_categorical():
+    """
+    Test that summarise_categorical correctly tabulates
+    distribution of values (counts, percentages) for a simple ground truth
+    example
+    """
+    test_input = pd.DataFrame({
+        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
+    }, index=[0, 1, 2, 3, 4])
+
+    expected_result = pd.DataFrame({
+        'country': ["Russia", "USA"],
+        'count': [2, 3],
+        'percentage': [40.0, 60.0],
+    }, index=[0, 1])
+
+    actual_result = summarise_categorical(test_input, "country")
+
+    pdt.assert_frame_equal(actual_result, expected_result)
+
+
+def test_summarise_categorical_missvals():
+    """
+    Test that summarise_categorical correctly tabulates
+    distribution of values (counts, percentages) for a ground truth
+    example (edge case where column contains missing values)
+    """
+    test_input = pd.DataFrame({
+        'country': ['USA', 'USA', 'USA', "Russia", pd.NA],
+    }, index=[0, 1, 2, 3, 4])
+
+    expected_result = pd.DataFrame({
+        'country': ["Russia", "USA", np.nan],
+        'count': [1, 3, 1],
+        'percentage': [20.0, 60.0, 20.0],
+    }, index=[0, 1, 2])
+    actual_result = summarise_categorical(test_input, "country")
+
+    pdt.assert_frame_equal(actual_result, expected_result)
+    
+
+
+def test_summarise_categorical_invalid():
+    """
+    Test that summarise_categorical raises an
+    error when a non-existent column is input
+    """
+    test_input = pd.DataFrame({
+        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
+    }, index=[0, 1, 2, 3, 4])
+
+    with pytest.raises(KeyError):
+        summarise_categorical(test_input, "vehicle")
+
+

Finally lets commit our test suite to our codebase and push the +changes to GitHub.

+
+

BASH +

+
(venv_spacewalks) $ git add eva_data_analysis.py 
+(venv_spacewalks) $ git commit -m "Add additional analysis functions"
+(venv_spacewalks) $ git add tests/
+(venv_spacewalks) $ git commit -m "Add test suite"
+(venv_spacewalks) $ git push origin main
+
+

Continuous Integration for automated testing

+

Continuous Integration (CI) services provide the infrastructure to +automatically run the code’s test suite every time changes are pushed to +a remote repository. There is an extra +episode on configuring CI for automated tests on GitHub for some +additional reading.

+

Summary

+

During this episode, we have covered how to use software tests to +verify the correctness of our code. We have seen how to write a unit +test, how to manage and run our tests using the pytest framework and how +identify which parts of our code require additional testing using test +coverage reports.

+

These skills reduce the probability that there will be a “mistake in +our code” and support reproducible research by giving us the confidence +to engage in open research practices. Tests also document the intended +behaviour of our code for other developers and mean that we can +experiment with changes to our code knowing that our tests will let us +know if we break any existing functionality. In other words, software +testing supports the FAIR software principles by making our code more +accessible and reusable.

+

Further reading

+

We recommend the following resources for some additional reading on +the topic of this episode:

+

Also check the full reference set +for the course.

+
+
+ +
+
+

Key Points

+
+
  1. Code testing supports the FAIR principles by improving the +accessibility and re-usability of research code.
  2. +
  3. Unit testing is crucial as it ensures each functions works +correctly.
  4. +
  5. Using the pytest framework, you can write basic unit +tests for Python functions to verify their correctness.
  6. +
  7. Identifying and handling edge cases in unit tests is essential to +ensure your code performs correctly under a variety of conditions.
  8. +
  9. Test coverage can help you to identify parts of your code that +require additional testing.
  10. +
  11. +
  12. +
+
+
+ +
+
+ + +
+
+ + + diff --git a/09-code-documentation.html b/09-code-documentation.html new file mode 100644 index 00000000..216b72fe --- /dev/null +++ b/09-code-documentation.html @@ -0,0 +1,1569 @@ + +Tools and practices for FAIR research software: Code documentation +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Code documentation

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How should we document our code?
  • +
  • Why are documentation and repository metadata important and how they +support FAIR software?
  • +
  • What are the minimum elements of documentation needed to support +FAIR software?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to:

+
  • Use a README file to provide an overview and a +CITATION.cff file to add citation instructions to a code +repository
  • +
  • Describe the main types of software documentation (tutorials, how to +guides, reference and explanation).
  • +
  • Apply a documentation framework to write effective documentation of +any type.
  • +
  • Describe the different formats available for delivering software +documentation (Markdown files, wikis, static webpages).
  • +
  • Implement MkDocs to generate and manage comprehensive project +documentation
  • +
+
+
+
+
+

We have seen how writing inline comments and docstrings within our +code can help with improving its readability. The purpose of software +documentation is to communicate other important information about our +software (its purpose, dependencies, how to install and run it, etc.) to +the people who need it – both users and developers.

+

Why document our software?

+

Software documentation is often perceived as a thankless and +time-consuming task with few tangible benefits and is often neglected in +research projects. However, like software testing, documenting our +software can help us and others conduct better research +and produce FAIR software:

+
  • Good documentation captures important methodological details ready +for when we come to publish our research
  • +
  • Good documentation can help us return to a project seamlessly after +time away
  • +
  • Documentation can facilitate collaborations by helping us onboard +new project members quickly and more easily
  • +
  • Good documentation can save us time by answering frequently asked +questions (FAQs) about our code for us
  • +
  • Software documentation supports the FAIR research software +principles by improving the re-usability of our code. +
    • Good documentation can make our software more understandable and +reusable by others, and can bring us some citations and credit
    • +
    • How-to guides and tutorials ensure that users can install our +software independently and make use of its basic features
    • +
    • Reference guides and background information can help developers +understand our code sufficiently to modify/extend/repurpose it.
    • +
  • +

Before we move on with further code modifications, make sure your +virtual development environment is active.

+
+
+ +
+
+

Activate your virtual environment

+
+

If it is not already active, make sure to activate your virtual +environment from the root of your project directory in your command line +terminal (e.g. Bash or GitBash):

+
+

BASH +

+
$ source venv_spacewalks/bin/activate # Mac or Linux
+$ source venv_spacewalks/Scripts/activate # Windows
+(venv_spacewalks) $
+
+
+
+
+

Software-level documentation

+

In previous episodes we encountered several different forms of +in-code documentation aspects, including in-line comments and +docstrings.

+

These are an excellent way to improve the readability of our code, +but by themselves are insufficient to ensure that our code is easy to +use, understand and modify - this requires additional software-level +documentation.

+

There are many different types of software-level documentation.

+
+

Technical documentation

+

Software-level technical documentation encompasses:

+
  • Tutorials - lessons that guide learners through a series of +exercises to build proficiency as using the code
  • +
  • How-To Guides - step by step instructions on how to accomplish +specific goals using the code.
  • +
  • Reference - a lookup manual to help users find relevant information +about the software e.g. functions and their parameters.
  • +
  • Explanation - conceptual discussion of the code to help users +understand implementation decisions
  • +
+
+

Repository metadata files

+

In addition to software-level technical documentation, it is also +common to see repository metadata files included in a code repository. +Many of these files can be described as “social documentation” i.e. they +indicate how users should “behave” in relation to our software project. +Some common examples of repository metadata files and their role are +tabulated below:

+ + + + + + + + + + + + + + +
FileDescription
READMEProvides an overview of the project, including installation, usage +instructions, dependencies and links to other metadata files and +technical documentation (tutorial/how-to/explanation/reference)
CONTRIBUTINGExplains to developers how to contribute code to the project +including processes and standards that should be followed
CODE_OF_CONDUCTDefines expected standards of conduct when engaging in a software +project
LICENSEDefines the (legal) terms of using, modifying and distributing the +code
CITATIONProvides instructions on how to cite the code
AUTHORSProvides information on who authored the code (can also be included +in CITATION)
+

Just enough documentation

+

For many small projects the following three pieces of documentation +may be sufficient: README, LICENSE and CITATION.

+

Let’s look at each of these files in turn.

+
+

README file

+

A README file acts as a “landing page” for your code repository on +GitHub and should provide sufficient information for users to and +developers to get started using your code.

+
+
+ +
+
+

README and the FAIR principles

+
+

Think about the question below. Your instructors may ask you to share +your answer in a shared notes document and/or discuss them with other +participants.

+

Here are some of the major sections you might find in a typical +README. Which are essential to support the FAIR +principles? Which are optional?

+
  • Purpose of the code
  • +
  • Audience (who the code is intended for)
  • +
  • Installation instructions
  • +
  • Contribution guide
  • +
  • How to get help
  • +
  • License
  • +
  • Software citation
  • +
  • Usage example
  • +
  • Dependencies and their versions
  • +
  • FAQs
  • +
  • Code of Conduct
  • +
+
+
+
+
+ +
+
+

To support the FAIR principles (Findability, Accessibility, +Interoperability, and Reusability), certain sections in a README file +are more important than others. Below is a breakdown of the sections +that are essential or optional in a README to align +with these principles.

+
+

Essential

+
  • +Purpose of the code - clearly explains what the +code does; essential for findability and reusability.
  • +
  • +Installation instructions - provides step-by-step +instructions on how to install the software, ensuring +accessibility.
  • +
  • +Usage Example - provides examples of how to use the +code, helping users understand its functionality and enhancing +reusability.
  • +
  • +License- specifies the terms under which the code +can be used, which is crucial for legal clarity and reusability.
  • +
  • +Dependencies and their versions - lists the +external libraries and tools required to run the code, including their +versions; essential for reproducibility and interoperability.
  • +
  • +Software citation - provides citation information +for academic use, ensuring proper attribution and reusability.
  • +
+
+

Optional

+
  • +Audience (who the code is intended for) - helps +users identify if the code is relevant to them, improving findability +and usability.
  • +
  • +How to get help - informs users where they can get +help, ensuring better accessibility.
  • +
  • +Contribution guide - encourages and guides +contributions from the community, enhancing the code’s development and +reusability.
  • +
  • +FAQs - provide answers to common questions, aiding +in troubleshooting and improving accessibility.
  • +
  • +Code of Conduct - sets expectations for behaviour +in the community, fostering a welcoming environment and enhancing +accessibility.
  • +
+
+
+
+
+

Let’s create a simple README for our repository.

+
+

BASH +

+
$ cd ~/Desktop/Spacewalks
+$ touch README.md
+
+

Let’s start by adding a one-liner that explains the purpose of our +code and who it is for.

+
# Spacewalks
+
+## Overview
+Spacewalks is a python-based analysis tool for researchers to generate visualisations
+and statistical summaries of NASA's extravehicular activity datasets.
+

Now let’s add a list of Spacewalks’ key features:

+
## Features
+Key features of Spacewalks:
+- Generates a CSV table of summary statistics of extravehicular activity crew sizes
+- Generates a line plot to show the cumulative duration of space walks over time
+

Now let’s tell users about any pre-requisites required to run the +software:

+
## Pre-requisites
+
+Spacewalks was developed using Python version 3.12
+
+To install and run Spacewalks you will need have Python >=3.12
+installed. You will also need the following libraries (minimum versions in brackets)
+
+- [NumPy](https://www.numpy.org/) >=2.0.0 - Spacewalk's test suite uses NumPy's statistical functions
+- [Matplotlib](https://matplotlib.org/stable/index.html) >=3.0.0  - Spacewalks uses Matplotlib to make plots
+- [pytest](https://docs.pytest.org/en/8.2.x/#) >=8.2.0  - Spacewalks uses pytest for testing
+- [pandas](https://pandas.pydata.org/) >= 2.2.0 - Spacewalks uses pandas for data frame manipulation 
+
+
+ +
+
+

Spacewalks README

+
+

Extend the README for Spacewalks by adding a. Installation +instructions b. A simple usage example

+
+
+
+
+
+ +
+
+

Installation instructions:

+

NB: In the solution below the back ticks of each code block have been +escaped to avoid rendering issues.

+
# Installation instructions
+
++ Clone the Spacewalks repository to your local machine using Git.
+If you don't have Git installed, you can download it from the official Git website.
+
+\`\`\`bash
+git clone https://github.com/your-repository-url/spacewalks.git
+cd spacewalks
+\`\`\`
+
++ Install the necessary dependencies:
+\`\`\`bash
+pip install pandas==2.2.2 matplotlib==3.8.4 numpy==2.0.0 pytest==7.4.2
+\`\`\`
+
++ To ensure everything is working correctly, run the tests using pytest.
+
+\`\`\`bash
+python -m pytest
+\`\`\`
+

Usage instructions:

+
# Usage Example
+
+To run an analysis using the eva_data_analysis.py script from the command line terminal,
+launch the script using Python as follows:
+
+\`\`\`python
+# Usage Examples
+python eva_data_analysis.py eva-data.json eva-data.csv
+\`\`\`
+
+The first argument is path to the Json data file.
+The second argument is the path the CSV output file.
+
+
+
+
+
+
+

LICENSE file

+

Copyright allows a creator of work (such as written text, +photographs, films, music, software code) to state that they own the +work they have created. Copyright is automatically implied - even if the +creator does not explicitly assert it, copyright of the work exists from +the moment of creation. A licence is a legal document which sets down +the terms under which the creator is releasing what they have created +for others to use, modify, extend or exploit.

+

Because any creative work is copyrighted the moment it is created, +even without any kind of licence agreement, it is important to state the +terms under which software can be reused. The lack of a licence for your +software implies that no one can reuse the software at all - hence it is +imperative you declare it. A common way to declare your copyright of a +piece of software and the license you are distributing it under is to +include a file called LICENSE in the root directory of your code +repository.

+

There is an optional extra episode in this +course on different open source software licences that you can +choose for your code and that we recommend for further reading.

+ +
+
+ +
+
+

Tools to help you choose a licence

+
+

A short intro on different open source +software licences included as extra content to this course.

+

Check out the open +source guide on applying, changing and editing licenses.

+

The website choosealicense.com has some great +resources to help you choose a license that is appropriate for your +needs, and can even automate adding the LICENSE file to your GitHub code +repository.

+
+
+
+
+
+ +
+
+

Select a licence

+
+

Choose a license for your code. Discuss with your neighbour or the +group your choice of license and reason for choosing it.

+
+
+
+
+
+ +
+
+

Add a license to your code

+
+

Add a LICENSE file containing the full text of your chosen license to +your code repository.

+
+
+
+
+
+ +
+
+
  1. Licence can be done in either of the following two ways: +
    1. Create a LICENSE file in the root of your software repository on +your local machine and copy into it the text of your chosen licence (you +can find it online). Push your local changes to your GitHub +repository.
    2. +
    3. In your repository on GitHub, go to Add file option and +start typing file name “LICENSE” - GitHub will recognise that you want +to add a licence and will offer you a choice of difference licences to +choose from. Select one and commit your changes, then do +git pull locally to bring those changes to your +machine.
    4. +
  2. +
  3. Add a copyright statement, the name of the license you are using and +a mention of the LICENSE file to at least one source code file +(e.g. eva_data_analysis.py)
  4. +
  5. Link to your LICENSE file from README to make this information about +your code more prominent.
  6. +

After completing the above, check the “About” section of your +repository’s GitHub landing webpage and see if there is now a license +listed.

+
+
+
+
+
+
+

CITATION file

+

We can add a citation file to our repository to provide instructions +on how and when to cite our code. A citation file can be a plain text +(CITATION.txt) or a Markdown file (CITATION.md), but there are certain +benefits to use a special file format called the Citation File Format +(CFF) which provides a way to include richer metadata about software or +datasets we want to cite, making it easy for both humans and machines to +use this information.

+
+

Why use CFF?

+

For developers, using a CFF file can help to automate the process of +publishing new releases on Zenodo via GitHub. GitHub also “understands” +CFF, and will display citation information prominently on the landing +page of a repository that contains citation info in CFF.

+

For users, having a CFF file makes it easy to cite the software or +dataset with formatted citation information available for copy-paste and +direct import from GitHub into reference managers like Zotero.

+
+
+

Creating a CFF file

+

A CFF file is using the YAML +key-value pair format. At a minimum a CFF file must contain the title of +the software/data, the type of asset (software or data) and at least one +author:

+
+

YAML +

+
# This CITATION.cff file was generated with cffinit.
+# Visit https://bit.ly/cffinit to generate yours today!
+cff-version: 1.2.0
+title: My Software
+message: >-
+  If you use this software, please cite it using the
+  metadata from this file.
+type: software
+authors:
+  - given-names: Anne
+    family-names: Researcher
+
+

Additional and optional metadata includes an abstract, repository URL +and more.

+
+
+

Steps to make your software citable

+

We can create (and update) a CFF file for our software using an +online application called cffinit.

+

Let’s create a dummy citation file for a project called “Spacetravel” +with Author “Max Hypothesis” by following these steps:

+
  1. First, head to cffinit online at cffinit.
  2. +
  3. Then, let’s work through the metadata input form to complete the +minimum information needed to generate a CFF. We’ll also add the +following abstract: +"Spacetravel - a simple python script to calculate time spent in Space by individual NASA astronauts" +
  4. +
  5. At the end of the process, download the CFF file and inspect it. It +should look like this:
  6. +
+

YAML +

+
# This CITATION.cff file was generated with cffinit.
+# Visit https://bit.ly/cffinit to generate yours today!
+
+cff-version: 1.2.0
+title: Spacetravel
+message: >-
+  If you use this software, please cite it using the
+  metadata from this file.
+type: software
+authors:
+  - given-names: Hypothesis
+    name-particle: Max
+abstract: >-
+    A simple python script to calculate time spent in Space by individual NASA astronauts
+
+
+
+

Updating and citing

+

CFF files can also be updated using the cffinit online +tool.

+

To cite our software (or dataset), once a CFF file has been pushed to +our remote repository, GitHub’s “Cite this repository” button can be +used to generate a citation in various formats (APA, BibTeX).

+
+
+

Tools

+

Command line tools are also available for creating, validating, and +converting CFF files. Further information is available from the Turing +Way’s guide to software citation.

+
+
+ +
+
+

Spacewalks software citation

+
+

Write a software citation file for our Spacewalks software and add it +to the root folder of our project.

+
  • Add the URL of the code repository as a “Related Resources”
  • +
  • Add a one-line description of the code under the “Abstract” +section
  • +
  • Add at least two key words under the “Keywords” section
  • +
+
+
+
+
+ +
+
+

Use cffinit, +a web application to create your citation file using a series of online +forms.

+
+

YAML +

+
# This CITATION.cff file was generated with cffinit.
+# Visit https://bit.ly/cffinit to generate yours today!
+
+cff-version: 1.2.0
+title: Spacewalks
+message: >-
+  If you use this software, please cite it using the
+  metadata from this file.
+type: software
+authors:
+  - given-names: Jaffa
+    name-particle: Sarah
+  - given-names: Aleksandra
+    family-names: Nenadic
+  - given-names: Kamilla
+    family-names: Kopec-Harding
+repository-code: >-
+  https://github.com/YOUR-REPOSITORY-URL/spacewalks.git
+abstract: >-
+  A Python script to analyse NASA extravehicular activity
+  data
+keywords:
+  - NASA
+  - Extravehicular activity
+
+
+
+
+
+
+
+

Documentation tools

+

Once our project reaches a certain size or level of complexity we may +want to add additional documentation such as a standalone tutorial or +“Background” explaining our methodological choices.

+

Once we move beyond using a README as our primary source of +documentation, we need to consider how we will distribute our +documentation to our users.

+

Options include:

+
  • A docs/ folder of Markdown files.
  • +
  • Adding a Wiki to our repository.
  • +
  • Creating a set of web pages for our documentation using a static +site generator for our documentation such as Sphinx or MkDocs
  • +

Creating a static site is a popular solution as it has the key +benefit being able to automatically generate a reference manual from any +docstrings we have added to our code.

+
+

MkDocs

+

Let’s setup the static documentation site generator tool MkDocs.

+
+

BASH +

+
python -m pip install mkdocs
+python -m pip install "mkdocstrings[python]"
+python -m pip install mkdocs-material
+
+

Let’s check that MkDocs has been setup correctly:

+
+

BASH +

+
python -m pip list
+
+

Let’s creates a new MkDocs project in the current directory

+
+

BASH +

+
# In ~/Desktop/spacewalks
+mkdocs new .    
+
+
+

OUTPUT +

+
INFO    -  Writing config file: ./mkdocs.yml
+INFO    -  Writing initial docs: ./docs/index.md
+
+

This command creates a new MkDocs project in the current directory +with a docs folder containing an index.md file +and a mkdocs.yml configuration file.

+

Now, let’s fill in the configuration file for our project.

+
+

YAML +

+
site_name: Spacewalks Documentation
+
+theme:
+  name: "material"
+font: false
+nav:
+  - Spacewalks Documentation: index.md
+  - Tutorials: tutorials.md
+  - How-To Guides: how-to-guides.md
+  - Reference: reference.md
+  - Background: explanation.md
+
+

Note font-false is for GDPR compliance.

+

Let’s add support for mkdocstrings - this will allow us +to automatically our docstrings into our documentation using a simple +tag.

+
+

YAML +

+
site_name: Spacewalks Documentation
+use_directory_urls: false
+
+theme:
+  name: "material"
+
+nav:
+  - Spacewalks Documentation: index.md
+  - Tutorials: tutorials.md
+  - How-To Guides: how-to-guides.md
+  - Reference: reference.md
+  - Background: explanation.md
+
+plugins:
+  - mkdocstrings
+
+

Let’s populate our docs/ folder to match our +configuration file.

+
+

BASH +

+
touch docs/tutorials.md
+touch docs/how-to-guides.md
+touch docs/reference.md
+touch docs/explanation.md
+
+

Let’s populate our reference file with some preamble to include +before the reference manual that will be generated from the docstrings +we created.

+
+

MARKDOWN +

+
This file documents the key functions in the Spacewalks tool.
+It is provided as a reference manual.
+
+::: eva_data_analysis
+
+

Finally, let’s build our documentation.

+
+

BASH +

+
mkdocs build
+
+
+

OUTPUT +

+
INFO    -  Cleaning site directory
+INFO    -  Building documentation to directory: /Users/AnnResearcher/Desktop/Spacewalks/site
+WARNING -  griffe: eva_data_analysis.py:105: No type or annotation for returned value 'int'
+WARNING -  griffe: eva_data_analysis.py:84: No type or annotation for returned value 1
+WARNING -  griffe: eva_data_analysis.py:33: No type or annotation for returned value 1
+INFO    -  Documentation built in 0.31 seconds
+
+

Once the build step is completed, our documentation site is saved to +a site folder in the root of our project folder.

+

These files will be distributed with our code. We can either direct +users to read these files locally on their own device using their +browser, or we can choose to host our documentation as a website that +our uses can navigate to.

+

Note that we used the setting use_directory_urls: false +in the mkdocs.yml file. This setting ensures that the +documentation site is generated with URLs that are easy to navigate +locally on a user’s device.

+

Finally let us commit our documentation to the main branch of our git +repository and push the changes to GitHub

+
+

BASH +

+
git add mkdocs.yml 
+git add docs/
+git add site/
+git commit -m "Add project-level documentation"
+git push origin main
+
+
+
+ +
+
+

Hosting documentation

+
+

In the previous section, we saw how Mkdocs documentation can be +distributed with our repository and viewed “offline” using a +browser.

+

We can also make our documentation available as a live website by +deploying our documentation to a hosting service.

+
+
+ +
+
+
+

GitHub Pages

+

As our repository is hosted in GitHub, we can use GitHub Pages - a +service that allows GitHub users to host websites directly from their +GitHub repositories.

+

There are two types of GitHub Pages: project pages and +user/organization Pages. While similar, they have different deployment +workflows, and we will only discuss project pages here. For information +about deploying to user/organisational pages, see [MkDocs Deployment +pages][mkdocs-deploy].

+

Project Pages deploy site files to a branch within the project +repository (default is gh-pages). To deploy our documentation:

+
+

Warning! Before we proceed to the next step, we MUST +ensure that there are no uncommitted changes or untracked files in our +repository.

+

If there are, the commands used in the upcoming steps will include +them in our documentation!

+
+
  1. (If not done already), let us commit our documentation to the main +branch of our git repository and push the changes to GitHub
  2. +
+

BASH +

+
git add mkdocs.yml 
+git add docs/
+git add site/
+git commit -m "Add project-level documentation"
+git push origin main
+
+
  1. Once we are on the main branch and all our changes are up to date, +run the following command to deploy our documentation to GitHub.
  2. +
+

BASH +

+
# Important: 
+# - This command will push the documentation to the gh-pages branch of your repository
+# - It will ALSO include uncommitted changes and untracked files (read the warning above!!) <- VERY IMPORTANT!!
+mkdocs gh-deploy
+
+
+

OUTPUT +

+
INFO    -  Cleaning site directory
+INFO    -  Building documentation to directory: /Users/AnnResearch/Desktop/Spacewalks/site
+WARNING -  griffe: eva_data_analysis.py:105: No type or annotation for returned value 'int'
+WARNING -  griffe: eva_data_analysis.py:84: No type or annotation for returned value 1
+WARNING -  griffe: eva_data_analysis.py:33: No type or annotation for returned value 1
+INFO    -  Documentation built in 0.37 seconds
+WARNING -  Version check skipped: No version specified in previous deployment.
+INFO    -  Copying '/Users/AnnResearcher/Desktop/Spacewalks/site' to 'gh-pages' branch and pushing to
+           GitHub.
+Enumerating objects: 63, done.
+Counting objects: 100% (63/63), done.
+Delta compression using up to 11 threads
+Compressing objects: 100% (60/60), done.
+Writing objects: 100% (63/63), 578.91 KiB | 7.93 MiB/s, done.
+Total 63 (delta 7), reused 0 (delta 0), pack-reused 0
+remote: Resolving deltas: 100% (7/7), done.
+remote:
+remote: Create a pull request for 'gh-pages' on GitHub by visiting:
+remote:      https://github.com/kkh451/spacewalks/pull/new/gh-pages
+remote:
+To https://github.com/kkh451/spacewalks-dev.git
+ * [new branch]      gh-pages -> gh-pages
+INFO    -  Your documentation should shortly be available at: https://kkh451.github.io/spacewalks/
+
+

This command will build our documentation with MkDocs, then commit +and push the files to the gh-pages branch using the GitHub ghp-import +tool which is installed as a dependency of MkDocs.

+

For more options, use:

+
+

BASH +

+
mkdocs gh-deploy --help
+
+

Notice that the deploy command did not allow us to preview the site +before it was pushed to GitHub; so, it is a good idea to check changes +locally with the build commands before deploying.

+
+
+

Other options

+

You can find out about other deployment options including free +documentation hosting service ReadTheDocs on the [MkDocs deployment +pages][mkdocs-deploy].

+
+
+
+
+
+
+
+
+
+

Documentation guides

+

Once we start to consider other forms of documentation beyond the +README, we can also increase the re-usability of our code by ensuring +that the content and style of our documentation matches its purpose.

+

Documentation guides such as Write the Docs, The Good Docs Project and +the Diataxis framework provide a +range of resources including documentation templates to help to help us +do this.

+
+
+ +
+
+

Spacewalks how-to guide

+
+
  1. Review the Diataxis guidance page on writing a How-to guide. +Identify three features of an effective how-to guide.

  2. +
  3. Following the Diataxis guidelines, add a how-to guide to the +docs folder that show users how to change the destination +filename for the output dataset generated by Spacewalks.

  4. +
+
+ +
+
+

An effective how-to guide should:

+
  • be goal oriented and focus on action.
  • +
  • avoid teaching or explanation
  • +
  • use appropriate language e.g. conditional imperatives
  • +
  • have an informative title
  • +

An example how-to guide:

+
# How to change the file path of Spacewalk's output dataset
+
+This guide shows you how to set the file path for Spacewalk's output
+data set to a location of your choice.
+
+By default, the cleaned data set generated by Spacewalk is saved to the current
+working directory with file name `eva-data.csv`.
+
+If you would like to modify the name or location of the output dataset, set the
+second command line argument to your chosen file path.
+
+`python eva_data_analysis.py eva-data.json data/clean/eva-data-clean.csv`
+
+The specified destination folder must exist before running spacewalks analysis script.
+
+
+
+
+
+
+
+

The Diataxis framework provides guidance for developing technical +documentation for different purposes. Tutorials differ in purpose and +scope to how-to guides, and as a result, differ in content and +style.

+
+
+ +
+
+

Spacewalks tutorial

+
+

Let’s adapt the how-to guide from the previous challenge into a +tutorial that explains how to change the file path for the output +dataset when running the analysis script.

+
+
+ +
+
+

Here is what an example tutorial may look like.

+
+

Introduction

+

In this tutorial, we will learn how to change the file path for the +output dataset generated by Spacewalk. By the end of this tutorial, you +will be able to specify a custom file path for the cleaned dataset.

+
+
+

Prerequisites

+

Before you start, ensure you have the following:

+
  • Python installed on your system
  • +
  • The Spacewalk script (eva_data_analysis.py)
  • +
  • An input dataset (eva-data.json)
  • +
+
+

Prepare the destination directory

+

First, let us decide where we want to save the cleaned dataset. and +make sure the directory exists.

+

For this tutorial, we will use data/clean as the destination +folder.

+

Let’s create the directory if it does not exist:

+
+

BASH +

+
mkdir -p data/clean
+
+
+
+

Run the analysis script with custom path

+

Next, execute the Spacewalk script and specify the custom file path +for the output dataset:

+
+

BASH +

+
python eva_data_analysis.py <input-file> <output-file>
+
+

Replace with your input dataset (eva-data.json) and + with your desired output path +(data/clean/eva-data-clean.csv).

+

Here is the complete command:

+
+

BASH +

+
python eva_data_analysis.py eva-data.json data/clean/eva-data-clean.csv
+
+

Notice how the output to the command line clearly indicates that we +are using a custom output file path.

+
+

OUTPUT +

+
Using custom input and output filenames
+Reading JSON file eva-data.json
+Saving to CSV file data/clean/eva-data-clean.csv
+Adding crew size variable (crew_size) to dataset
+Saving to CSV file data/clean/eva-data-clean.csv
+Plotting cumulative spacewalk duration and saving to ./cumulative_eva_graph.png
+
+

After running the script, let us check the data/clean directory to +ensure the cleaned dataset has been saved correctly.

+
+

BASH +

+
ls data/clean
+
+

You should see eva-data-clean.csv listed in the data/clean folder

+
+
+

Exercise: custom output path

+
  • Create a new directory named output/data in your working +directory.
  • +
  • Run the Spacewalk script to save the cleaned dataset in the newly +created output/data directory with the filename +cleaned-eva-data.csv.
  • +
  • Verify that the dataset has been saved correctly.
  • +
+
Solution
+
+

BASH +

+
# Create the directory:
+mkdir -p output/data
+
+# Run the script:
+python eva_data_analysis.py eva-data.json output/data/cleaned-eva-data.csv
+
+# Verify the output:
+ls output/data
+
+# You should see cleaned-eva-data.csv listed
+
+

Congratulations! You have successfully changed the file path for +Spacewalks output dataset and completed an exercise to practice the +process. You can now customize the output location and filename +according to your needs.

+
+
+
+
+
+
+
+
+
+

Now that we have seen examples of both a how-to guide and a tutorial, +let’s compare the two.

+
+
+ +
+
+

Tutorial vs. how-to guide

+
+

How does the content and language of our example tutorial differ from +our example how-to guide?

+
+
+ +
+
+
+

Content

+
  • The tutorial clearly signposts what will be covered
  • +
  • The tutorial includes a narrative of each step and the expected +output
  • +
  • The tutorial highlights important behaviour the learner should +notice
  • +
  • The tutorial includes an exercise to practice skills
  • +
+
+

Language

+
  • The tutorial uses the “we” language
  • +
  • The tutorial uses imperative to provide clear instructions, +e.g. “First do x, then do y”
  • +
+
+
+
+
+
+
+
+
+
+ +
+
+

Commit and push your changes

+
+

Do not forget to commit any uncommited changes you may have and then +push your work to GitHub.

+
+

BASH +

+
git add <your_changed_files>
+git commit -m "Your commit message"
+git push origin main
+
+
+
+
+

Further reading

+

We recommend the following resources for some additional reading on +the topic of this episode:

+

Also check the full reference set +for the course.

+
+
+ +
+
+

Key Points

+
+
  • Documentation allows users to run and understand software without +having to work things out for themselves directly from the source +code.
  • +
  • Software documentation supports the FAIR principles by improving the +reusability of research code.
  • +
  • A (good) README, CITATION file and LICENSE file are the minimum +documentation elements required to support FAIR research code.
  • +
  • Documentation can be provided to users in a variety of formats +including a docs folder of Markdown files, a repository +Wiki and static webpages.
  • +
  • A static documentation site can be created using the tool +MkDocs.
  • +
  • Documentation frameworks such as Diataxis provide content and style +guidelines that can helps us write high quality documentation.
  • +
+
+
+ +
+
+ + +
+
+ + + diff --git a/10-open-collaboration.html b/10-open-collaboration.html new file mode 100644 index 00000000..39e0fdc8 --- /dev/null +++ b/10-open-collaboration.html @@ -0,0 +1,1087 @@ + +Tools and practices for FAIR research software: Open collaboration on code +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Open collaboration on code

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How do I ensure my code is citable?
  • +
  • How do we track issues with code in GitHub?
  • +
  • How can we ensure that multiple developers can work on the same +files simultaneously?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to:

+
  • Understand how to archive code to Zenodo and create a digital object +identifier (DOI) for a software project (and include that info in +CITATION.cff).
  • +
  • Understand how to track issues with code in GitHub.
  • +
  • Understand how to use Git branches for working on code in parallel +and how to merge code back using pull requests.
  • +
  • Apply issue tracking, branching and pull requests together to fix +bugs while allowing other developers to work on the same code.
  • +
+
+
+
+
+

In addition to adding a license and other metadata to our code +(covered in previous episode) there are several other important steps to +consider before sharing the code publicly.

+

Before we move on with further code modifications, make sure your +virtual development environment is active.

+
+
+ +
+
+

Activate your virtual environment

+
+

If it is not already active, make sure to activate your virtual +environment from the root of your project directory in your command line +terminal (e.g. Bash or GitBash):

+
+

BASH +

+
$ source venv_spacewalks/bin/activate # Mac or Linux
+$ source venv_spacewalks/Scripts/activate # Windows
+(venv_spacewalks) $
+
+
+
+
+

Sharing code to encourage collaboration

+
+

Making the code public

+

By default repositories created on Github are private and only their +creator can see them. If we’re going to be adding an open source license +to our repository we probably want to make sure people can actually +access it too!

+

Go to the Github webpage of your repository +(https://github.com/<yourusername>/<yourrepsoitoryname>) +and click on the Settings link near the top right corner. Then scroll +down to the bottom of the page and the “Danger Zone” settings. Click on +“Change Visibility” and you should see a message saying “Change to +public”, if it says “Change to private” then the repository is already +public. You’ll then be asked to confirm that you want to make the +repository public and agree to the warning that the code will now be +publicly visible. As a security measure you’ll then have to put in your +Github password.

+
+
+

Transferring to an organisation

+

Currently our repository is under the Github “namespace” of our +individual user. This is ok for individual projects where we are the +sole or at least main author, but for bigger and more complex projects +it is common to use a Github organisation named after our project. If we +are a member of an organisation and have the appropriate permissions +then we can transfer a repository from our personal namespace to the +organisation’s. This can be done with another option in the “Danger +Zone” settings, the “Transfer ownership” button. Pressing this will then +prompt us as to which organisation we want to transfer the repository +to.

+
+
+

Archiving to Zenodo and obtaining a DOI

+

Zenodo is a data archive run by CERN. Anybody can upload datasets up +to 50GB to it and receive a Digital Object Identifier (DOI). Zenodo’s +definition of a dataset is quite broad and can include code. This gives +us a way to obtain a DOI for our code. We can archive our Github +repository to Zenodo by doing the following:

+
  1. Go to the Zenodo Login page +and choose to login with Github.
  2. +
  3. Authorise Zenodo to connect to Github.
  4. +
  5. Go to the Github page in +your Zenodo account. This can be found in the pull down menu with your +user name in the top right corner of the screen.
  6. +
  7. You’ll now have a list of all of your Github repositories. Next to +each will be an “On” button. If you have created a new repository you +might need to press the “Sync” button to update the list of repositories +Zenodo knows about.
  8. +
  9. Press the “On” button for the repository you want to archive. If +this was successful you’ll be told to refresh the page.
  10. +
  11. The repository should now appear in the list of Enabled repositories +at the top of the screen. But it doesn’t yet have a DOI. To get one we +have to make a release on Github. Click on the repository and then press +the green button to create a release. This will take you to Github’s +release page where you’ll be asked to give a title and description of +the release. You will also have to create a tag, this is a way of having +a friendly name for the version of some code in Git instead of using a +long hash code. Often we’ll create a sequential version number for each +release of the software and have the tag name match this, for example +v1.0 or just 1.0.
  12. +
  13. If we now refresh the Zenodo page for this repository we will see +that it has been assigned a DOI.
  14. +

The DOI doesn’t just link to Github, Zenodo will have taken a copy of +our repository at the point where we tagged the release. This means that +even if we delete it from Github or even if Github were ever to go away +or remove it, they’ll still be a copy on Zenodo. Zenodo will allow +people to download the entire repository as a single Zip file.

+

Zenodo will have actually created two DOIs for you. One represents +the latest version of the software and will always represent the latest +if you make more releases. The other is specfic to the release you made +and will always point to that version. We can see both of these by +clicking on the DOI link in the Zenodo page for the repository. One of +the things which is displayed on this page is a badge image that you can +copy the link for and add to the README file in your Git repository so +that people can find the Zenodo version of the repository. If you click +on the DOI image in the Details section of the Zenodo page then you will +be shown instructions for obtaining a link to the DOI badge in various +formats including Markdown. Here is the badge for this repository and +the corresponding Markdown:

+

DOI

+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11869450.svg)](https://doi.org/10.5281/zenodo.11869450)
+
+
+ +
+
+

Archive your repository to Zenodo

+
+
  • Create an account on Zenodo that is linked to your Github +account.
  • +
  • Use Zenodo to create a release for your repository and obtain a DOI +for it.
  • +
  • Get the link to the DOI badge for your repository and add a link to +this image to your README file in Markdown format. Check that this is +the DOI for the latest version and not the DOI for a specific version, +if not you’ll be updating this every time you make a release.
  • +
+
+
+
+
+ +
+
+

Problems with Github and Zenodo integration

+
+

The integration between Github and Zenodo does not interact well with +some browser privacy features and extensions. Firefox can be +particularly problematic with this and might open new tabs to login to +Github and then give an error saying: +Your browser did something unexpected. Please try again. If the error continues, try disabling all browser extensions. +If this happens try disabling the extra privacy features/extensions or +using another browser such as Chrome.

+
+
+
+
+
+

Adding a DOI and ORCID to the citation file

+

Now that we have our DOI it is good practice to include this +information in our citation file. In the previous part of this lesson we +created a CITATION.cff file with information about how to +cite our code. There are a few fields we can add which are related to +the DOI, one of these is the version file which covers the +version number of the software. We can add a DOI to the file in the +identifiers section with a type of doi and +value with the URL. Optionally we can also add a +date-released field indicating the date we released this +software. Here is an updated version of our CITATION.cff from the +previous episode with a version number, DOI and release date added.

+
+

YAML +

+
# This CITATION.cff file was generated with cffinit.
+# Visit https://bit.ly/cffinit to generate yours today!
+cff-version: 1.2.0
+title: My Software
+message: >-
+  If you use this software, please cite it using the
+  metadata from this file.
+type: software
+authors:
+  - given-names: Anne
+    family-names: Researcher
+version: 1.0.1
+identifiers:
+  - type: doi
+    value: 10.5281/zenodo.1234
+date-released: 2024-06-01
+
+
+
+ +
+
+

Add a DOI to your citation file

+
+

Add the DOI you were allocated in the previous exercise to your +CITATION.cff file and commit/push the updated version to your Github +repository. You can remove the commit field from the +CITATION.cff file as the DOI is a better way to point to given version +of the code.

+
+
+
+
+
+ +
+
+

Going further with publishing code

+
+

We now have our code published online, licensed as open source, +archived with Zenodo, accessible via a DOI and with a citation file to +encourage people to cite it. What else might we want to do in order to +improve how findable, accessible or reusable it is? One further step we +could take is to publish the code with a peer reviewed journal. Some +traditional journals will accept software submissions, although these +are usually as a supplementary material for a paper. There also journals +which specialise in research software such as the Journal of Open +Research Software, The Jornal of +Open Source Software or SoftwareX. +With these venues, the submission will be the software itself and not a +paper, although a short abstract or description of the software is often +required.

+
+
+
+
+

Working with collaborators

+

The strength of online collaboration tools such as Github doesn’t +just lie in the ability to share code. They also allow us to track +problems with that code, for multiple developers to work on it +independently and bring their changes together and to review those +changes before they are accepted.

+
+

Tracking issues with code

+

A key feature of Github (as opposed to Git itself) is the issue +tracker. This provides us with a place to keep track of any problems or +bugs in the code and to discuss them with other developers. Sometimes +advanced users will also use issue trackers of public projects to report +problems they are having (and sometimes this is misused by users seeking +help using documented features of the program).

+

The code from the testing chapter earlier has a bug with an extra +bracket in eva_data_analysis.py (and if you’ve fixed that a missing +import of summarise_categorical in the test). Let’s go ahead and create +a new issue in our Github repository to describe this problem. We can +find the issue tracker on the “Issues” tab in the top left of the Github +page for the repository. Click on this and then click on the green “New +Issue” button on the right hand side of the screen. We can then enter a +title and description of our issue.

+

A good issue description should include:

+
  • What the problem is, including any error messages that are +displayed.
  • +
  • What version of the software it occurred with.
  • +
  • Any relevant information about the system running it, for example +the operating system being used.
  • +
  • Versions of any dependent libraries.
  • +
  • How to reproduce it.
  • +

After the issue is created it will be assigned a sequential ID +number.

+
+
+ +
+
+

Write an issue to describe our bug

+
+

Create a new issue in your repository’s issue tracker by doing the +following:

+
  • Go to the Github webpage for your code
  • +
  • Click on the Issues tab
  • +
  • Click on the “New issue” button
  • +
  • Enter a title and description for the issue
  • +
  • Click the “Submit Issue” button to create the issue.
  • +
+
+
+
+
+

Discussing an issue

+

Once the issue is created, further discussion can take place with +additional comments. These can include code snippets and file +attachments such as screenshots or logfiles. We can also reference other +issues by writing a # symbol and the number of the other issue. This is +sometimes used to identify related issues or if an issue is a +duplicate.

+
+
+

Closing an issue

+

Once an issue is solved then it can be closed. This can be done +either by pressing the “Close” button in the Github web interface or by +making a commit which includes the word “fixes”, “fixed”, “close”, +“closed” or “closes” followed by a # symbol and the issue number.

+
+
+

Working in parallel with Git branches

+

Branching is a feature of Git that allows two or more parallel +streams of work. Commits can be made to one branch without interfering +with another. Branches are commonly used as a way for one developer to +work on a new feature or a bug fix while other developers work on other +features. When those new features or bug fixes are complete, the branch +will be merged back with the main (sometimes called master) branch.

+
+

Creating a new branch

+

New git branches are created with the git branch +command. This should be followed by the name of the branch to create. It +is common practice when the bug we are fixing has a corresponding issue +to name the branch after the issue number and name. For example, we +might call the branch 01-extra-brakcet-bug instead of +something less descriptive like bugfix.

+

For example, the command:

+
+

BASH +

+
git branch 01-extra-brakcet-bug
+
+

will create a new branch called 01-extra-brakcet-bug. We +can view the names of all the branches, by default there should be one +branch called main or perhaps master and our +new 01-extra-brakcet-bug branch. by running +git branch with no parameters. This will put * +next to the currently active branch.

+
+

BASH +

+
git branch
+
+
+

OUTPUT +

+
  01-extra-brakcet-bug
+* main
+
+

We can see that creating a new branch has not activated that branch. +To switch branches we can either use the git switch or +git checkout command followed by the branch name. For +example:

+
+

BASH +

+
git switch 01-extra-brakcet-bug
+
+

To create a branch and change to it in a single command we can use +git switch with the -c option (or +git checkout with the -b option, +git switch is only found in more recent versions of +Git).

+
+

BASH +

+
git switch -c 02-another-bug
+
+
+
+

Committing to a branch

+

Once we have switched to a branch any further commits that are made +will go to that branch. When we run a git commit command +we’ll see the name of the branch we’re committing to in the output of +git commit. Let’s edit our code and fix the lack of default +values bug that we entered into the issue tracker earlier on.

+

Change your code from

+
+

PYTHON +

+
<call to pandas without checks identified in testing section>
+
+

to:

+
+

PYTHON +

+
<call to pandas with checks identified in testing section>
+
+

and now commit it.

+
+

BASH +

+
git commit -m "fixed bug" eva_data_analysis.py
+
+

In the output of git commit -m the first part of the +output line will show the name of the branch we just made the commit +to.

+
+

OUTPUT +

+
[01-extra-brakcet-bug 330a2b1] fixes missing values bug, closes #01 
+
+

If we now switch back to the main branch our new commit +will no longer be there in the source file or the output of +git log.

+
+

BASH +

+
git switch main
+
+

And if we go back to the 01-extra-brakcet-bug branch it +will re-appear.

+
+

BASH +

+
git switch 01-extra-brakcet-bug
+
+

If we want to push our changes to a remote such as GitHub we have to +tell the git push command which branch to push to. If the +branch doesn’t exist on the remote (as it currently won’t) then it will +be created.

+
+

BASH +

+
git push origin 01-extra-brakcet-bug
+
+

If we now refresh the Github webpage for this repository we should +see the bugfix branch has appeared in the list of branches.

+

If we needed to pull changes from a branch on a remote (for example +if we’ve made changes on another computer or via Github’s web based +editor), then we can specify a branch on a git pull +command.

+
+

BASH +

+
git pull origin 01-extra-brakcet-bug
+
+
+
+
+

Merging branches

+

When we have completed working on a branch (for example fixing a bug) +then we can merge our branch back into the main one (or any other +branch). This is done with the git merge command.

+

This must be run on the TARGET branch of the merge, so we’ll +have to use a git switch command to set this.

+
+

BASH +

+
git switch main
+
+

Now we’re back on the main branch we can go ahead and merge the +changes from the bugfix branch:

+
+

BASH +

+
git merge 01-extra-bracket-bug
+
+
+
+

Pull requests

+

On larger projects we might need to have a code review process before +changes are merged, especially before they are merged onto the main +branch that might be what is being released as the public version of the +software. Github has a process for this that it calls a “Pull Request”, +other Git services such as GitLab have different names for this, GitLab +calls them “Merge Requests”. Pull requests are where one developer +requests that another merge code from a branch (or “pull” it from +another copy of the repository). The person receiving the request then +has the chance to review the code, write comments suggesting changes or +even change the code themselves before merging it. It is also very +common for automated checks of code to be run on a pull request to +ensure the code is of good quality and is passing automated tests.

+

As a simple example of a pull request, we can now create a pull +request for the changes we made on our 01-extra-bracket-bug +branch and pushed to Github earlier on. The Github webpage for our +repository will now be saying something like “bugfix had recent pushes n +minutes ago - Compare & Pull request”. Click on this button and +create a new pull request.

+

Give the pull request a title and write a brief description of it, +then click the green “Create pull request” button. Github will then +check if we can merge this pull request without any problems. We’ll look +at what to do when this isn’t possible later on.

+

There should be a green “Merge pull request” button, but if we click +on the down arrow inside this button there are three options on how to +handle this request:

+
  1. Create a merge commit
  2. +
  3. Squash and merge
  4. +
  5. Rebase and merge
  6. +

The default is option 1, which will keep all of the commits made on +our branch intact. This can be useful for seeing the whole history of +our work, but if we’ve done a lot of minor edits or attempts at fixing a +problem to fix one bug it can be excessive to have all of this history +saved. This is where the second option comes in, this will place all of +our changes from the branch into just a single commit, this might be +much more obvious to other developers who will now see our bugfix as a +single commit in the history. The third option merges the branch +histories together in a different way that doesn’t make merges as +obvious, this can make the history easier to read but effectively +rewrites the commit history and will change the commit hash IDs. Some +projects that you contribute to might have their own rules about what +kind of merge they will prefer. For the purposes of this exercise we’ll +stick with the default merge commit.

+

Go ahead and click on “Merge pull request”, then “Confirm merge”. The +changes will now be merged together. Github gives us the option to +delete the branch we were working on, since it’s history is preserved in +the main branch there isn’t any reason to keep it.

+
+

Using forks instead of branches

+

A fork is similar to a branch, but instead of it being part of the +same repository it is a entirely new copy of the repository. Forks are +commonly used by Github users who wish to work on a project that they’re +not a member of. Typically forking will copy the repository to our own +namespace (e.g. github.com/username/reponame instead of +github.com/projectname/reponame)

+

To create a fork on github use the “Fork” button to the right of the +repository name. After we create our fork we can make some changes and +these could even be on the main branch inside our forked repository. +Github will track that a fork has been made displays a “Contribute” +button to create a pull request back to the original repository. Using +this we can request that the changes on our fork are incorporated by the +upstream project.

+
+
+ +
+
+

Pull request exercise

+
+

Q: Work in pairs for this exercise. Share the Github link of your +repository with your partner. If you have set your repository to +private, you’ll need to add them as a collaborator. Go to the settings +page on your Github repository’s webpage, click on Collaborators from +the left hand menu and then click the green “Add People” button and +enter the Github username or email address of your partner. They will +get an email and an alert within Github to accept your invitation to +work on this repository, without doing this they won’t be able to access +it.

+
  • Now make a fork of your partners repository.
  • +
  • Edit the CITATION.cff file and add your name to it.
  • +
  • Commit these changes to your fork
  • +
  • Create a pull request back to the original repository
  • +
  • Your partner will now receive your pull request and can review
  • +
+
+
+
+
+ +
+
+

Commit and push your changes

+
+

Do not forget to commit any uncommited changes you may have and then +push your work to GitHub.

+
+

BASH +

+
git add <your_changed_files>
+git commit -m "Your commit message"
+git push origin main
+
+
+
+
+
+
+

Further reading

+

We recommend the following resources for some additional reading on +the topic of this episode:

+

Also check the full reference set +for the course.

+
+
+ +
+
+

Key Points

+
+
  • Open source applies Copyright licenses permitting others to reuse +and adapt your code or data.
  • +
  • Permissive licenses allow code to be used in other products +providing the copyright statement is displayed.
  • +
  • Copyleft licenses require the source code of any modifications to be +released under a copyleft license.
  • +
  • Creative commons licenses are suitable for non-code files such as +documentation and images.
  • +
  • Open source software can be sold, but you must supply the source +code and the people you sell it to can give it away to somebody +else.
  • +
  • Add license file to your repository and add a license to each file +in case it gets detached.
  • +
  • Zenodo can be used to archive a Github repository and obtain a DOI +for it.
  • +
  • We can include a CITATION file to tell people how to cite our +code.
  • +
  • Github can track bugs or issues with a program.
  • +
  • Git branches can be used to allow multiple developers to work on the +same part of a program in parallel.
  • +
  • The git branch command shows the list of branches and +can create new branches.
  • +
  • The git switch command changes which branch we are +working on.
  • +
  • The git merge command merges another branch into the +current one.
  • +
  • Pull requests allow developers to work on their own branch/fork and +then request other developers review their changes before they are +merged.
  • +
+
+
+ +
+
+ + +
+
+ + + diff --git a/11-wrap-up.html b/11-wrap-up.html new file mode 100644 index 00000000..199cb11e --- /dev/null +++ b/11-wrap-up.html @@ -0,0 +1,531 @@ + +Tools and practices for FAIR research software: Wrap-up +
+ Tools and practices for FAIR research software +
+ +
+
+ + + + + +
+
+

Wrap-up

+

Last updated on 2024-09-17 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • What are the wider Research Software Development +Principles and where does FAIR fit into them?
  • +
+
+
+
+
+
+

Objectives

+
  • Reflect on the Research Software Development +Principles and their relevance to own research.
  • +
+
+
+
+
+

In this course we have explored the significance of reproducible +research and how following the FAIR principles in our own work can help +us and others do better research. Reproducible research often +requires that researchers implement new practices and learn new tools - +in this course we taught you some of these as a starting point but you +will discover what works best for yourself, your group, community and +domain. Some of these practices may take a while to implement and may +require perseverance, others you can start practicing today.

+
An image of a Chinese proverb "The best time to plant a tree was 20 years ago. The second best time is now
An image of a Chinese proverb “The best time to +plant a tree was 20 years ago. The second best time is now” by CCNULL, +used under a CC-BY 2.0 licence
+

Research software development principles

+

Software and people who develop it have significance within the +research environment and a broader impact on society and the planet. +FAIR research software principles cover some aspects and operate within +the wider Research +Software Development Principles - recommended by Software +Sustainability Institute’s Director Neil Chue Hong during his keynote +at RSECon23. These principles can help us explore and reflect on our +own work and guide us on a path to better research.

+
+

Helping your team

+
Help the team principles of writing FAIR, secure and maintainable code
Helping your team, image from RSECon2024, used +under CC BY 4.0
+
+
+

Helping you peers

+
Help the peers principles of making your work reproducible, inclusive and credit everyone involved
Helping your peers, image from RSECon2024, used +under CC BY 4.0
+
+
+

Helping the world

+
Help the world principles of being responsible, open and global, and humanist when developing research software
Helping the world, image from RSECon2024, used +under CC BY 4.0
+
+

Further reading

+

Please check out the following +resources for some additional reading on the topic of this course +and the full reference set.

+
+
+ +
+
+

Key Points

+
+
  • When developing software for your research, think about how it will +help you and your team, your peers and domain/community and the +world.
  • +
+
+
+ +
+
+ + +
+
+ + + diff --git a/404.html b/404.html index 115369ae..bba7215b 100644 --- a/404.html +++ b/404.html @@ -99,7 +99,9 @@ -
@@ -252,7 +254,7 @@

@@ -263,7 +265,7 @@

@@ -274,7 +276,7 @@

@@ -285,7 +287,7 @@

@@ -242,7 +242,7 @@

@@ -251,7 +251,7 @@

@@ -242,7 +242,7 @@

@@ -251,7 +251,7 @@

@@ -242,7 +242,7 @@

@@ -251,7 +251,7 @@

@@ -260,7 +260,7 @@

@@ -269,7 +269,7 @@

diff --git a/aio.html b/aio.html index 86eb6ebb..231449a5 100644 --- a/aio.html +++ b/aio.html @@ -122,7 +122,9 @@ More @@ -267,7 +269,7 @@

@@ -291,7 +293,7 @@

@@ -303,7 +305,7 @@

@@ -315,7 +317,7 @@

@@ -327,7 +329,7 @@

@@ -339,7 +341,7 @@

@@ -351,7 +353,19 @@

+ + + @@ -977,7 +993,7 @@

Discussion

-
+

Here are some questions to help you assess where on the FAIR spectrum the code is:

@@ -1043,7 +1059,7 @@

Give me a hint

-
+

I would give the following scores:

F - 1/5

@@ -1140,9 +1156,9 @@

Key Points

Place links that you need to refer to multiple times across pages here. Delete any links that you are not going to use. --> -

Content from Tools and practices for research software development

+

Content from Tools and practices for FAIR research software development


-

Last updated on 2024-07-04 | +

Last updated on 2024-09-17 | Edit this page

@@ -1185,90 +1201,9 @@

Objectives

overview of the tools, how they help you achieve the aims of FAIR research software and how they work together. In later episodes we will describe some of these tools in more detail.

-

The table below summarises some tools and practices that can help -with each of the FAIR software principles.

- ------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tools and practicesFindableAccessibleInteroperableReusable
Integrated development environments (e.g. VS Code) - development -environments (run, test, debug)x
Command line terminal (e.g. Bash)- reproducible -workflows/pipelinesxx
Version control toolsx
Testingxx
Coding conventions and documentationxxx
Licensexx
Citationxx
Software repositories (e.g. GitHub)xx
-

Writing your code +

Development environment


-
-

Development environment -

One of the first choices we make when writing code is what tool to use to do the writing. You can use a simple text editor such as Notepad, a terminal based editor with syntax highlighting such as Vim or Emacs, @@ -1302,10 +1237,9 @@

Development environment

Use VS Code to open the Python script and the data file from our project.

-

-

Command line tool/shell +

+

In VS Code and similar IDEs you can often run the code by clicking a button or pressing some keyboard shortcut. If you gave your code to a colleague or collaborator they might use the same IDE or something @@ -1333,10 +1267,9 @@

Command line tool/shell -

-

Version control +

+

Version control means knowing what changes were made to your code and when. Many people who have worked on large documents such as essays start doing this by saving files called essay_draft, @@ -1357,10 +1290,13 @@

Version controlWe will be using the Git version control system, which can be used through the command line terminal, in a browser or in a desktop application.

-

-
-

Testing -

+

Code structure and style guidelines +

+
+

TODO

+

Code correctness +

+

Testing ensures that your code is correct and does what it is set out to do. When you write code you often feel very confident that it is perfect, but when writing bigger codes or code that is meant to do @@ -1372,10 +1308,9 @@

Testing -

-
-

Documentation -

+

Documentation +

+

Documentation comes in many forms - from the names of variables and functions in your code, additional comments that explain some lines, up to a whole website full of documentation with function definitions, @@ -1383,10 +1318,9 @@

Documentation -

Licences and citation -

+

Licences and citation +

+

A licence states what people can legally do with your code, and what restrictions you have placed on it. Whenever you want to use someone else’s code you should check what license they have and make sure your @@ -1402,10 +1336,9 @@

Licences and citation -

-
-

Code repositories and registries -

+

Code repositories and registries +

+

Having somewhere to share your code is fundamental to making it Findable. Your institution might have a code repository, your research field may have a practice of sharing code via a specific website or @@ -1415,50 +1348,127 @@

Code repositories and registries

We will discuss later how to share your code on GitHub and make it easy for others to find and use.

-

-

Summary +

Summary of tools & practices


-
-

Findable -

-
    -
  • Describe your software - README
  • -
  • Software repository/registry - GitHub, registries
  • -
  • Unique persistent identifier - GitHub commits/tags/releases, -Zenodo
  • -
-
-
-

Accessible -

-
    -
  • Software repository/registry
  • -
  • License
  • -
  • Language and dependencies
  • -
-
-
-

Interoperable -

-
    -
  • Explain functionality - readme, inline comments and -documentation
  • -
  • Standard formats
  • -
  • Communication protocols - CLI/API
  • -
-
-
-

Reusable -

-
    -
  • Document functionality/installation/running
  • -
  • Follow best practices where appropriate
  • -
  • License
  • -
  • Citation
  • -
-
-

Checking your setup +

The table below summarises some tools and practices that can help +with each of the FAIR software principles.

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Tools and practicesFindableAccessibleInteroperableReusable
Virtual development environments, programming language and +dependencies - run, test, debug, sharexx
Integrated development environments/IDEs (e.g. VS Code, PyCharm) - +run, test, debugx
Command line terminal (e.g. Bash, GitBash) - reproducible +workflows/pipelinesxx
Version control toolsx
Testing - code correctness and reproducibilityxx
Coding conventions and documentationxxx
Explaining functionality/installation/running - README, inline +comments and documentationxxx
Standard formats - e.g. for data exchange (CSV, YAML)xxx
Communication protocols - Command Line Interface (CLI) or +Application Programming Interface (API)xxx
Licensexx
Citationxx
Software repositories (e.g. GitHub, PyPi) or registries +(e.g. BioTools)xx
Unique persistent identifier (e.g. DOIs, commits/tags/releases) - +Zenodo, FigShare GitHubxx

Checking your setup


@@ -1508,7 +1518,7 @@

Challenge

-
+

The prompt is the $ character and any text that comes before it, that is shown on every new line before you type in commands. @@ -1523,7 +1533,7 @@

Give me a hint

-
+

The expected out put of each command is:

    @@ -1616,7 +1626,7 @@

    Key Points

    -->

Content from Version control


-

Last updated on 2024-09-16 | +

Last updated on 2024-09-17 | Edit this page

@@ -1660,19 +1670,18 @@

Objectives

from our existing code, make some changes to it and track them with version control, and then push those changes to a remote server for safe-keeping.

-
-

What is a version control system? -

+

What is a version control system? +

+

Version control is the practice of tracking and managing changes to files. Version control systems are software tools that assist in the management of these file changes over time. They keep track of every modification to the files in a special database that allows users to “travel through time” and compare earlier versions of the files with the current state.

-
-
-

Motivation for using a version control system -

+

Why use a version control system? +

+

The main motivation as scientists to use version control in our projects is for reproducibility purposes. As hinted to above, by tracking and storing every change we make, we can more effectively @@ -1688,10 +1697,9 @@

Motivation for using a ve

Later on in this workshop, we will also see how using a version control system allows many people to collaborate on the same project without a lot of manual effort to combine different items of work.

-

-
-

Git version control system -

+

Git version control system +

+

Git is one of the version control systems around and the one we will be using in this course. It is primarily used for source code management in software development but it can be used to track changes in files in @@ -1752,7 +1760,6 @@

Git version control system -

Create a new repository

@@ -1983,7 +1990,7 @@

Add and commit the changed file

-
+

To save the changes to the renamed Python file, use the following Git commands:

@@ -2155,7 +2162,7 @@

OUTPUT