diff --git a/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json b/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json index 7254ae1..c9076ac 100644 --- a/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json +++ b/_freeze/materials/02-basic-objects-and-data-types/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "5d0ab2af681a0e7b1e4d5a9e499c771f", + "hash": "81fa97bd14762d3c6bee5547652f61b0", "result": { "engine": "knitr", - "markdown": "---\ntitle: Data types & structures\n---\n\n\n\n\n::: {.callout-tip}\n#### Learning objectives\n\n- \n:::\n\n\n## Context\n\nWe’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We'll also briefly mention other data types, such as matrices, arrays.\n\n## Explained: Data types & structures\n\n### Data types\n\nProgramming languages are able to deal with different data types - and they need to. For example, it makes little sense to perform mathematical operations on text! To ensure that your data is viewed in the appropriate way, you need to be aware of some of the different **data types**.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nR has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| numeric | Represents numbers; can be whole (integers) or decimals \\\n(e.g., `19`or `2.73`).|\n| integer | Specific type of numeric data; can only be an integer \\\n(e.g., `7L` where `L` indicates an integer). |\n| character | Also called *text* or *string* \\\n(e.g., `\"Rabbits are great!\"`).|\n| logical | Also called *boolean values*; takes either `TRUE` or `FALSE`.|\n| factor | A type of categorical data that can have inherent ordering \\\n(e.g., `low`, `medium`, `high`).|\n\n\n## Python\n\nPython has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| int | Specific type of numeric data; can only be an integer \\\n(e.g., `7` or `56`).|\n| float | Decimal numbers \\\n(e.g., `3.92` or `9.824`).|\n| str | *Text* or *string* data \\\n(e.g., `\"Rabbits are great!\"`).|\n| bool | *Logical* or *boolean* values; takes either `True` or `False`.|\n\n:::\n\n### Data structures\n\nIn the section on [running code](#running-code) we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We've just seen that we can have different *types* of data. We can save these into different *data structures*. Which data structure you need is often determined by the type of data and the complexity.\n\nIn the following sections we look at simple data structures.\n\n## Objects\n\nWe can store values into *objects*. To do this, we *assign* values to them. An object acts as a container for that value.\n\nTo create an object, we need to give it a name followed by the\nassignment operator and the value we want to give it, for example:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`<-`) to the object `temperature`. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).\n\nWhen assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`=`) to the object `temperature`.\n\nWhen assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n:::\n\n::: {.callout-important}\n## The assignment operator\n\nWe use an assignment operator to assign values on the right to objects on the left.\n\n::: {.panel-tabset group=\"language\"}\n## R\nIn R we use `<-` as the assignment operator.\n\nIn RStudio, typing Alt + - (push Alt at the same time as the - key) will write ` <- ` in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.

\n\n## Python\nIn Python we use `=` as the assignment operator.

\n\n:::\n\\\n:::\n\nObjects can be given almost any name such as `x`, `current_temperature`, or\n`subject_id`. You want the object names to be explicit and short. There are some exceptions / considerations (see below).\n\n::: {.callout-warning}\n## Restrictions on object names\n\nObject names can contain letters, numbers, underscores and periods. They *cannot start with a number nor contain spaces*. Different people use different conventions for long variable names, two common ones being:\n\nUnderscore: my_long_named_object\n\nCamel case: myLongNamedObject\n\nWhat you use is up to you, but be consistent. Programming languages are **case-sensitive** so `temperature` is different from `Temperature.`\n\n* Some names are reserved words or keywords, because they are the names of fundamental functions (e.g., `if`, `else`, `for`, see [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) or [Python](https://docs.python.org/3/reference/lexical_analysis.html#keywords) for a complete list).\n* Avoid using function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`), even if allowed. If in doubt, check the help to see if the name is already in use.\n* Avoid full-stops (`.`) within an object name as in `my.data`. Full-stops often have meaning in programming languages, so it's best to avoid them.\n* Use consistent styling. In R, popular style guides are:\n * [R's tidyverse's](http://style.tidyverse.org/).\n * [Google's](https://google.github.io/styleguide/Rguide.xml)\n\n**Whatever style you use, be consistent!**\n:::\n\n### Using objects\n\nNow that we have the `temperature` in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.\n\nTo do this, we need to add `273.15`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 296.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n296.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nWe can change an object's value by assigning a new one:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nFinally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object `temp_K`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_K <- temperature + 273.15\n```\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_K = temperature + 273.15\n```\n:::\n\n\n\n:::\n\nChanging the value of `temperature` does not change the value of `temp_K`.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\n### Updating objects\n\n> LO: update objects in R\n> LO: update objects in Python & demonstrate lack of updates in tuples\n\n## Collections of data\n\nIn the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create **collections** of data. This enables us to organise our data, for example by creating a collection of numbers or text values.\n\n### Creating collections\n\nCreating a collection of data is pretty straightforward, particularly if you are doing it manually.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe simplest collection of data in R is called a **vector**. This really is the workhorse of R.\n\nA vector is composed by a series of values, which can numbers, text or any of the data types described.\n\nWe can assign a series of values to a vector using the `c()` function. For example, we can create a vector of temperatures and assign it to a new object `temp_c`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_c <- c(23, 24, 31, 27, 18, 21)\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21\n```\n\n\n:::\n:::\n\n\n\n\nA vector can also contain text. For example, let's create a vector that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather <- c(\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\")\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nThe simplest collection of data in Python is either a **list** or a **tuple**. Both can hold items of the same of different types. Whereas a tuple *cannot* be changed after it's created, a *list* can.\n\nWe can assign a collection of numbers to a list:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c = [23, 24, 31, 27, 18, 21]\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21]\n```\n\n\n:::\n:::\n\n\n\n\n\nA list can also contain text. For example, let's create a list that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather = [\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\"]\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\nWe can also create a *tuple*. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use `( )` round brackets instead of `[ ]` square brackets:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c_tuple = (23, 24, 31, 27, 18, 21)\n```\n:::\n\n\n\n\n:::\n\nNote that when we define text (e.g. `\"cloudy\"` or `\"sunny\"`), we need to use quotes.\n\nWhen we deal with numbers - whole or decimal (e.g. `23`, `18.5`) - we do not use quotes.\n\n\n::: {.callout-important}\n## Having a type\n\nDifferent data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use the `class()` function to find out how R views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nWe can use the `type()` function to find out how Python views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c_tuple)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n\n\n\n\n:::\n:::\n\n### Making changes\n\nQuite often we would want to make some changes to a collection of data. There are different ways we can do this.\n\nLet's say we gathered some new temperature data and wanted to add this to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'd use the `c()` function to combine the new data:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, 22, 34)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21 22 34\n```\n\n\n:::\n:::\n\n\n\n\n\n## Python\n\nWe take the original `temp_c` list and add the new values:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + [22, 34]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 22, 34]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nLet's consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called `temp_new` and wanted to add these to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_new <- c(5, 16, 8, 12)\n```\n:::\n\n\n\n\nNext, we wanted to combine these new data with the original data, which we stored in `temp_c`.\n\nAgain, we can use the `c()` function:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, temp_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 23 24 31 27 18 21 5 16 8 12\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_new = [5, 16, 8, 12]\n```\n:::\n\n\n\n\nWe can use the `+` operator to add the two lists together:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + temp_new\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n### Number sequences\n\nWe often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1:10 # integers from 1 to 10\n10:1 # integers from 10 to 1\nseq(1, 10, by = 2) # from 1 to 10 by steps of 2\nseq(10, 1, by = -0.5) # from 10 to 1 by steps of -0.5\nseq(1, 10, length.out = 20) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n## Python\n\nPython has some built-in functionality to deal with number sequences, but the `numpy` library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n```\n:::\n\n\n\n\nNext, we can create several different number sequences:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlist(range(1, 11)) # integers from 1 to 10\nlist(range(10, 0, -1)) # integers from 10 to 1\nlist(range(1, 11, 2)) # from 1 to 10 by steps of 2\nlist(np.arange(10, 1, -0.5)) # from 10 to 1 by steps of -0.5\nlist(np.linspace(1, 10, num = 20)) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n:::\n\n### Subsetting\n\nSometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we'll see how to do this on the simple data structures we've covered so far.\n\n::: {.callout-warning collapse=\"true\"}\n## Technical: Differences in indexing between R and Python\n\nIn the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it's important to be aware about some key differences. One of them is **indexing**.\n\nEach item in a collection of data has a number, called an *index*. Now, it would be great if this was consistent across all programming languages, but it's not.\n\nR uses **1-based indexing** whereas Python uses **zero-based indexing**. What does this mean? Compare the following:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplants <- c(\"tree\", \"shrub\", \"grass\") # the index of \"tree\" is 1, \"shrub\" is 2 etc.\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nplants = [\"tree\", \"shrub\", \"grass\"] # the index of \"tree\" is 0, \"shrub\" is 1 etc. \n```\n:::\n\n\n\n\n\nBehind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I'd ask:\n\n\"Hey, R - give me the items with index 1 and 2 in `plants`\" then I'd get `tree` and `shrub`. \n\nIf I'd ask that question in Python, then I'd get `shrub` and `grass`. Fun times.\n:::\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can use square brackets `[ ]` to extract values. Let's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\"\n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2:4] # extract the second to fourth value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[c(3, 1)] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"partial_cloud\" \"sunny\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[-1] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \"sunny\" \n[5] \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nLet's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'cloudy'\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:4] # extract the second to fourth value (end index is exclusive)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[2], weather[0] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n('partial_cloud', 'sunny')\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n\n\n## Dealing with missing data\n\n* LO: why is missing data important?\n* LO: good practices of dealing with missing data\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n:::\n", + "markdown": "---\ntitle: Data types & structures\n---\n\n\n\n\n::: {.callout-tip}\n#### Learning objectives\n\n- Create familiarity with the most common data types\n- Know about basic data structures\n- Create, use and make changes to objects\n- Create, use and make changes to collections of data\n- Deal with missing data\n:::\n\n\n## Context\n\nWe’ve seen examples where we entered data directly into a function. Most of the time we have data from elsewhere, such as a spreadsheet. In the previous section we created single objects. We’ll build up from this and introduce vectors and tabular data. We'll also briefly mention other data types, such as matrices, arrays.\n\n## Explained: Data types & structures\n\n### Data types\n\nProgramming languages are able to deal with different data types - and they need to. For example, it makes little sense to perform mathematical operations on text! To ensure that your data is viewed in the appropriate way, you need to be aware of some of the different **data types**.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nR has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| numeric | Represents numbers; can be whole (integers) or decimals \\\n(e.g., `19`or `2.73`).|\n| integer | Specific type of numeric data; can only be an integer \\\n(e.g., `7L` where `L` indicates an integer). |\n| character | Also called *text* or *string* \\\n(e.g., `\"Rabbits are great!\"`).|\n| logical | Also called *boolean values*; takes either `TRUE` or `FALSE`.|\n| factor | A type of categorical data that can have inherent ordering \\\n(e.g., `low`, `medium`, `high`).|\n\n\n## Python\n\nPython has the following main data types:\n\n| Data type | Description|\n|-----------|--------------------------------------------------------------|\n| int | Specific type of numeric data; can only be an integer \\\n(e.g., `7` or `56`).|\n| float | Decimal numbers \\\n(e.g., `3.92` or `9.824`).|\n| str | *Text* or *string* data \\\n(e.g., `\"Rabbits are great!\"`).|\n| bool | *Logical* or *boolean* values; takes either `True` or `False`.|\n\n:::\n\n### Data structures\n\nIn the section on [running code](#running-code) we saw how we can run code interactively. However, we frequently need to save values so we can work with them. We've just seen that we can have different *types* of data. We can save these into different *data structures*. Which data structure you need is often determined by the type of data and the complexity.\n\nIn the following sections we look at simple data structures.\n\n## Objects\n\nWe can store values into *objects*. To do this, we *assign* values to them. An object acts as a container for that value.\n\nTo create an object, we need to give it a name followed by the\nassignment operator and the value we want to give it, for example:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`<-`) to the object `temperature`. Note that when you run this line of code the object you just created appears on your environment tab (top-right panel).\n\nWhen assigning a value to an object, R does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 23\n```\n:::\n\n\n\n\nWe can read the code as: the value `23` is assigned (`=`) to the object `temperature`.\n\nWhen assigning a value to an object, Python does not print anything on the console. You can print the value by typing the object name on the console or within your script and running that line of code.\n\n:::\n\n::: {.callout-important}\n## The assignment operator\n\nWe use an assignment operator to assign values on the right to objects on the left.\n\n::: {.panel-tabset group=\"language\"}\n## R\nIn R we use `<-` as the assignment operator.\n\nIn RStudio, typing Alt + - (push Alt at the same time as the - key) will write ` <- ` in a single keystroke on a PC, while typing Option + - (push Option at the same time as the - key) does the same on a Mac.

\n\n## Python\nIn Python we use `=` as the assignment operator.

\n\n:::\n\\\n:::\n\nObjects can be given almost any name such as `x`, `current_temperature`, or\n`subject_id`. You want the object names to be explicit and short. There are some exceptions / considerations (see below).\n\n::: {.callout-warning}\n## Restrictions on object names\n\nObject names can contain letters, numbers, underscores and periods.\n\nThey *cannot start with a number nor contain spaces*.\nDifferent people use different conventions for long variable names, two common ones being:\n\nUnderscore: `my_long_named_object`\n\nCamel case: `myLongNamedObject`\n\nWhat you use is up to you, but be consistent. Programming languages are **case-sensitive** so `temperature` is different from `Temperature.`\n\n* Some names are reserved words or keywords, because they are the names of core functions (e.g., `if`, `else`, `for`, see [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) or [Python](https://docs.python.org/3/reference/lexical_analysis.html#keywords) for a complete list).\n* Avoid using function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`), even if allowed. If in doubt, check the help to see if the name is already in use.\n* Avoid full-stops (`.`) within an object name as in `my.data`. Full-stops often have meaning in programming languages, so it's best to avoid them.\n* Use consistent styling. In R, popular style guides are:\n * [R's tidyverse's](http://style.tidyverse.org/).\n * [Google's](https://google.github.io/styleguide/Rguide.xml)\n\n**Whatever style you use, be consistent!**\n:::\n\n### Using objects\n\nNow that we have the `temperature` in memory, we can use it to perform operations. For example, this might the temperature in Celsius and we might want to calculate it to Kelvin.\n\nTo do this, we need to add `273.15`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 296.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n296.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nWe can change an object's value by assigning a new one:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 36\ntemperature + 273.15\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\nFinally, assigning a value to one object does not change the values of other objects. For example, let’s store the outcome in Kelvin into a new object `temp_K`:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_K <- temperature + 273.15\n```\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_K = temperature + 273.15\n```\n:::\n\n\n\n:::\n\nChanging the value of `temperature` does not change the value of `temp_K`.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemperature <- 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 309.15\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemperature = 14\ntemp_K\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n309.15\n```\n\n\n:::\n:::\n\n\n\n:::\n\n### Updating objects\n\n> LO: update objects in R\n> LO: update objects in Python & demonstrate lack of updates in tuples\n\n## Collections of data\n\nIn the examples above we have stored single values into an object. Of course we often have to deal with more than tat. Generally speaking, we can create **collections** of data. This enables us to organise our data, for example by creating a collection of numbers or text values.\n\n### Creating collections\n\nCreating a collection of data is pretty straightforward, particularly if you are doing it manually.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThe simplest collection of data in R is called a **vector**. This really is the workhorse of R.\n\nA vector is composed by a series of values, which can numbers, text or any of the data types described.\n\nWe can assign a series of values to a vector using the `c()` function. For example, we can create a vector of temperatures and assign it to a new object `temp_c`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_c <- c(23, 24, 31, 27, 18, 21)\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21\n```\n\n\n:::\n:::\n\n\n\n\nA vector can also contain text. For example, let's create a vector that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather <- c(\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\")\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nThe simplest collection of data in Python is either a **list** or a **tuple**. Both can hold items of the same of different types. Whereas a tuple *cannot* be changed after it's created, a *list* can.\n\nWe can assign a collection of numbers to a list:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c = [23, 24, 31, 27, 18, 21]\n\ntemp_c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21]\n```\n\n\n:::\n:::\n\n\n\n\n\nA list can also contain text. For example, let's create a list that contains weather descriptions:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather = [\"sunny\", \"cloudy\", \"partial_cloud\", \"cloudy\", \"sunny\", \"rainy\"]\n\nweather\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\nWe can also create a *tuple*. Remember, this is like a list, but it cannot be altered after creating it. Note the difference in the type of brackets, where we use `( )` round brackets instead of `[ ]` square brackets:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c_tuple = (23, 24, 31, 27, 18, 21)\n```\n:::\n\n\n\n\n:::\n\nNote that when we define text (e.g. `\"cloudy\"` or `\"sunny\"`), we need to use quotes.\n\nWhen we deal with numbers - whole or decimal (e.g. `23`, `18.5`) - we do not use quotes.\n\n\n::: {.callout-important}\n## Having a type\n\nDifferent data types result in slightly different types of objects. It can be quite useful to check how your data is viewed by the computer.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use the `class()` function to find out how R views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nWe can use the `type()` function to find out how Python views our data. This function also works for more complex data structures.\n\nLet's do this for our examples:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(weather)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(temp_c_tuple)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n\n\n\n\n:::\n:::\n\n### Making changes\n\nQuite often we would want to make some changes to a collection of data. There are different ways we can do this.\n\nLet's say we gathered some new temperature data and wanted to add this to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'd use the `c()` function to combine the new data:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, 22, 34)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23 24 31 27 18 21 22 34\n```\n\n\n:::\n:::\n\n\n\n\n\n## Python\n\nWe take the original `temp_c` list and add the new values:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + [22, 34]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 22, 34]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\nLet's consider another scenario. Again, we went out to gather some new temperature data, but this time we stored the measurements into an object called `temp_new` and wanted to add these to the original `temp_c` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntemp_new <- c(5, 16, 8, 12)\n```\n:::\n\n\n\n\nNext, we wanted to combine these new data with the original data, which we stored in `temp_c`.\n\nAgain, we can use the `c()` function:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(temp_c, temp_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 23 24 31 27 18 21 5 16 8 12\n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_new = [5, 16, 8, 12]\n```\n:::\n\n\n\n\nWe can use the `+` operator to add the two lists together:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntemp_c + temp_new\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[23, 24, 31, 27, 18, 21, 5, 16, 8, 12]\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n### Number sequences\n\nWe often need to create sequences of numbers when analysing data. There are some useful shortcuts available to do this, which can be used in different situations. Run the following code to see the output.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1:10 # integers from 1 to 10\n10:1 # integers from 10 to 1\nseq(1, 10, by = 2) # from 1 to 10 by steps of 2\nseq(10, 1, by = -0.5) # from 10 to 1 by steps of -0.5\nseq(1, 10, length.out = 20) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n## Python\n\nPython has some built-in functionality to deal with number sequences, but the `numpy` library is particularly helpful. We installed and loaded it previously, but if needed, re-run the following:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport numpy as np\n```\n:::\n\n\n\n\nNext, we can create several different number sequences:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nlist(range(1, 11)) # integers from 1 to 10\nlist(range(10, 0, -1)) # integers from 10 to 1\nlist(range(1, 11, 2)) # from 1 to 10 by steps of 2\nlist(np.arange(10, 1, -0.5)) # from 10 to 1 by steps of -0.5\nlist(np.linspace(1, 10, num = 20)) # 20 equally spaced values from 1 to 10\n```\n:::\n\n\n\n\n\n:::\n\n### Subsetting\n\nSometimes we want to extract one or more values from a collection of data. We will go into more detail later, but for now we'll see how to do this on the simple data structures we've covered so far.\n\n::: {.callout-warning collapse=\"true\"}\n## Technical: Differences in indexing between R and Python\n\nIn the course materials we keep R and Python separate in most cases. However, if you end up using both languages at some point then it's important to be aware about some key differences. One of them is **indexing**.\n\nEach item in a collection of data has a number, called an *index*. Now, it would be great if this was consistent across all programming languages, but it's not.\n\nR uses **1-based indexing** whereas Python uses **zero-based indexing**. What does this mean? Compare the following:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplants <- c(\"tree\", \"shrub\", \"grass\") # the index of \"tree\" is 1, \"shrub\" is 2 etc.\n```\n:::\n\n::: {.cell}\n\n```{.python .cell-code}\nplants = [\"tree\", \"shrub\", \"grass\"] # the index of \"tree\" is 0, \"shrub\" is 1 etc. \n```\n:::\n\n\n\n\n\nBehind the scenes of any programming language there is a lot of counting going on. So, it matters if you count starting at zero or one. So, if I'd ask:\n\n\"Hey, R - give me the items with index 1 and 2 in `plants`\" then I'd get `tree` and `shrub`. \n\nIf I'd ask that question in Python, then I'd get `shrub` and `grass`. Fun times.\n:::\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can use square brackets `[ ]` to extract values. Let's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"sunny\" \"cloudy\" \"partial_cloud\" \"cloudy\" \n[5] \"sunny\" \"rainy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\"\n```\n\n\n:::\n\n```{.r .cell-code}\nweather[2:4] # extract the second to fourth value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[c(3, 1)] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"partial_cloud\" \"sunny\" \n```\n\n\n:::\n\n```{.r .cell-code}\nweather[-1] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cloudy\" \"partial_cloud\" \"cloudy\" \"sunny\" \n[5] \"rainy\" \n```\n\n\n:::\n:::\n\n\n\n\n## Python\n\nLet's explore this using our `weather` object.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nweather # remind ourselves of the data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['sunny', 'cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1] # extract the second value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'cloudy'\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:4] # extract the second to fourth value (end index is exclusive)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy']\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[2], weather[0] # extract the third and first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n('partial_cloud', 'sunny')\n```\n\n\n:::\n\n```{.python .cell-code}\nweather[1:] # extract all apart from the first value\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['cloudy', 'partial_cloud', 'cloudy', 'sunny', 'rainy']\n```\n\n\n:::\n:::\n\n\n\n\n:::\n\n\n## Dealing with missing data\n\nIt may seem weird that you have to consider what isn't there, but that's exactly what we do when we have missing data. Ideally, when we're collecting data we entries for every single thing we measure. But, alas, life is messy. That one patient may have missed an appointment, or one eppendorf tube got dropped, or etc etc.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nR includes the concept of missing data, meaning we can specify that a data point is missing. Missing data are represented as `NA`.\n\nWhen doing operations on numbers, most functions will return `NA` if the data you are working with include missing values. This makes it harder to overlook the cases where you are dealing with missing data. This is a good thing!\n\nFor example, let's look at the following data, where we have measured six different patients and recorded their systolic blood pressure.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsystolic_pressure <- c(125, 134, NA, 145, NA, 141)\n```\n:::\n\n\n\n\nWe can see that we're missing measurements for two of them. If we want to calculate the average systolic blood pressure across these patients, then we could use the `mean()` function. However, this does not result in `NA`.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(systolic_pressure)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NA\n```\n\n\n:::\n:::\n\n\n\n\nYou can add the argument `na.rm = TRUE` to various functions - including `mean()` - to calculate the result while ignoring the missing values. This stands for \"remove missing values\".\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(systolic_pressure, na.rm = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 136.25\n```\n\n\n:::\n:::\n\n\n\n\nThere are quite a few ways that you can deal with missing data and we'll discuss more of them in later sessions.\n\n## Python\n\nThe built-in functionality of Python is not very good at dealing with missing data. This means that you normally need to deal with them manually.\n\nOne of the ways you can denote missing data in Python is with `None`. Let's look at the following data, where we have measured six different patients and recorded their systolic blood pressure.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nsystolic_pressure = [125, 134, None, 145, None, 141]\n```\n:::\n\n\n\n\nNext, we'd have to filter out the missing values (don't worry about the exact meaning of the code at this point):\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfiltered_data = [x for x in systolic_pressure if x is not None]\n```\n:::\n\n\n\n\nAnd lastly we would be able to calculate the mean value:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nsum(filtered_data) / len(filtered_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n136.25\n```\n\n\n:::\n:::\n\n\n\n\nThere are quite a few (easier!) ways that you can deal with missing data and we'll discuss more of them in later sessions, once we start dealing with tabular data.\n:::\n\n::: {.callout-note}\n## To exclude or not exclude?\n\nIt may be tempting to simply remove all observations that contain missing data. It often makes the analysis easier! However, there is good reason to be more subtle: throwing away good data.\n\nLet's look at the following hypothetical data set, where we use `NA` to denote missing values. We are interested in the average weight and age across the patients.\n\n```\npatient_id weight_kg age\nN982 72 47\nN821 68 49\nN082 NA 63\nN651 78 NA\n```\n\nWe could remove all the rows that contain *any* missing data, thereby getting rid of the last two observations. However, that would mean we'd lose data on `age` from the penultimate row, and data on `weight_kg` from the last row.\n\nInstead, it would be better to tell the computer to ignore missing values on a variable-by-variable basis and calculate the averages on the data that *is* there.\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- The most common data types include numerical, text and logical data.\n- We can store data in single objects, enabling us to use the data\n- Multiple data points and types can be stored as different collections of data\n- We can make changes to objects and collections of data\n- We need to be explicit about missing data\n:::\n", "supporting": [ "02-basic-objects-and-data-types_files" ], diff --git a/materials/02-basic-objects-and-data-types.qmd b/materials/02-basic-objects-and-data-types.qmd index 0f0538c..8885432 100644 --- a/materials/02-basic-objects-and-data-types.qmd +++ b/materials/02-basic-objects-and-data-types.qmd @@ -5,7 +5,11 @@ title: Data types & structures ::: {.callout-tip} #### Learning objectives -- +- Create familiarity with the most common data types +- Know about basic data structures +- Create, use and make changes to objects +- Create, use and make changes to collections of data +- Deal with missing data ::: @@ -113,15 +117,18 @@ Objects can be given almost any name such as `x`, `current_temperature`, or ::: {.callout-warning} ## Restrictions on object names -Object names can contain letters, numbers, underscores and periods. They *cannot start with a number nor contain spaces*. Different people use different conventions for long variable names, two common ones being: +Object names can contain letters, numbers, underscores and periods. -Underscore: my_long_named_object +They *cannot start with a number nor contain spaces*. +Different people use different conventions for long variable names, two common ones being: -Camel case: myLongNamedObject +Underscore: `my_long_named_object` + +Camel case: `myLongNamedObject` What you use is up to you, but be consistent. Programming languages are **case-sensitive** so `temperature` is different from `Temperature.` -* Some names are reserved words or keywords, because they are the names of fundamental functions (e.g., `if`, `else`, `for`, see [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) or [Python](https://docs.python.org/3/reference/lexical_analysis.html#keywords) for a complete list). +* Some names are reserved words or keywords, because they are the names of core functions (e.g., `if`, `else`, `for`, see [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) or [Python](https://docs.python.org/3/reference/lexical_analysis.html#keywords) for a complete list). * Avoid using function names (e.g., `c`, `T`, `mean`, `data`, `df`, `weights`), even if allowed. If in doubt, check the help to see if the name is already in use. * Avoid full-stops (`.`) within an object name as in `my.data`. Full-stops often have meaning in programming languages, so it's best to avoid them. * Use consistent styling. In R, popular style guides are: @@ -474,16 +481,90 @@ weather[1:] # extract all apart from the first value ::: - ## Dealing with missing data -* LO: why is missing data important? -* LO: good practices of dealing with missing data +It may seem weird that you have to consider what isn't there, but that's exactly what we do when we have missing data. Ideally, when we're collecting data we entries for every single thing we measure. But, alas, life is messy. That one patient may have missed an appointment, or one eppendorf tube got dropped, or etc etc. + +::: {.panel-tabset group="language"} +## R + +R includes the concept of missing data, meaning we can specify that a data point is missing. Missing data are represented as `NA`. + +When doing operations on numbers, most functions will return `NA` if the data you are working with include missing values. This makes it harder to overlook the cases where you are dealing with missing data. This is a good thing! + +For example, let's look at the following data, where we have measured six different patients and recorded their systolic blood pressure. + +```{r} +systolic_pressure <- c(125, 134, NA, 145, NA, 141) +``` + +We can see that we're missing measurements for two of them. If we want to calculate the average systolic blood pressure across these patients, then we could use the `mean()` function. However, this does not result in `NA`. + +```{r} +mean(systolic_pressure) +``` + +You can add the argument `na.rm = TRUE` to various functions - including `mean()` - to calculate the result while ignoring the missing values. This stands for "remove missing values". + +```{r} +mean(systolic_pressure, na.rm = TRUE) +``` + +There are quite a few ways that you can deal with missing data and we'll discuss more of them in later sessions. + +## Python + +The built-in functionality of Python is not very good at dealing with missing data. This means that you normally need to deal with them manually. + +One of the ways you can denote missing data in Python is with `None`. Let's look at the following data, where we have measured six different patients and recorded their systolic blood pressure. + +```{python} +systolic_pressure = [125, 134, None, 145, None, 141] +``` + +Next, we'd have to filter out the missing values (don't worry about the exact meaning of the code at this point): + +```{python} +filtered_data = [x for x in systolic_pressure if x is not None] +``` + +And lastly we would be able to calculate the mean value: + +```{python} +sum(filtered_data) / len(filtered_data) +``` + +There are quite a few (easier!) ways that you can deal with missing data and we'll discuss more of them in later sessions, once we start dealing with tabular data. +::: + +::: {.callout-note} +## To exclude or not exclude? + +It may be tempting to simply remove all observations that contain missing data. It often makes the analysis easier! However, there is good reason to be more subtle: throwing away good data. + +Let's look at the following hypothetical data set, where we use `NA` to denote missing values. We are interested in the average weight and age across the patients. + +``` +patient_id weight_kg age +N982 72 47 +N821 68 49 +N082 NA 63 +N651 78 NA +``` + +We could remove all the rows that contain *any* missing data, thereby getting rid of the last two observations. However, that would mean we'd lose data on `age` from the penultimate row, and data on `weight_kg` from the last row. + +Instead, it would be better to tell the computer to ignore missing values on a variable-by-variable basis and calculate the averages on the data that *is* there. +::: ## Summary ::: {.callout-tip} #### Key points -- +- The most common data types include numerical, text and logical data. +- We can store data in single objects, enabling us to use the data +- Multiple data points and types can be stored as different collections of data +- We can make changes to objects and collections of data +- We need to be explicit about missing data :::