03-exploringdata1.Rmd

# Exploring data #1

The video lectures for this chapter are embedded at relevant places in the text, 
with links to download a pdf of the associated slides for each video. 
You can also access [a full playlist for the videos for this chapter](https://www.youtube.com/playlist?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk).

## Objectives

After this chapter, you should (know / understand / be able to ):

- Be able to load and use datasets from R packages
- Be able to describe and use logical vectors
- Understand how logical vectors check logical statements against other R vector(s) and store TRUE / FALSE values as 0 / 1 at a deeper level
- Be able to use the `dplyr` function `mutate` to create a logical vector as a new column in a dataframe and the `dplyr` function `filter` with that new column to filter a dataframe to a subset of rows
- Be able to use the bang operator (!) to reverse a logical vector
- Know what the "tidyverse" is and name some of its packages
- Be able to use some simple statistical functions (e.g., `min`, `max`, `mean`, `median`, `cor`, `summary`), including how to handle missing values when using these
- Be able to use the `dplyr` function `summarize` to summarize data, with and without grouping using `group_by`, including with special functions `n`, `n_distinct`, `first`, and `last`
- Understand the three basic elements of `ggplot` plots: data, aesthetics, and geoms
- Be able to create a `ggplot` object, set its data using `data = ...` and its aesthetics using `mapping = aes(...)`, and add on layers (including `geoms`) with `+`
- Be able to create some basic plots (e.g., scatterplots, boxplots, histograms) using `ggplot2` functions
- Understand the difference between setting an aesthetic by mapping it to a column of the dataframe versus setting it to a constant value
- Understand the difference between "statistical" geoms (e.g., histograms, boxplots) and geoms that add one geom element per dataframe observation (row)

<iframe width="768" height="480" src="https://www.youtube.com/embed/ntsCRNizqw4?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_1.pdf) 
a pdf of the lecture slides for this video.

## Simple statistics functions

### Summary statistics

<iframe width="768" height="480" src="https://www.youtube.com/embed/Y5G9nYQr4c8?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_2.pdf) 
a pdf of the lecture slides for this video.

To explore your data, you'll need to be able to calculate some simple statistics for vectors, including calculating the mean and range of continuous variables and counting the number of values in each category of a factor or logical vector. 

Here are some simple statistics functions you will likely use often:

Function  | Description
--------- | -----------------
`range()` | Range (minimum and maximum) of vector 
`min()`, `max()` | Minimum or maximum of vector
`mean()`, `median()` | Mean or median of vector
`sd()` | Standard deviation of vector
`table()` | Number of observations per level for a factor vector
`cor()` | Determine correlation(s) between two or more vectors
`summary()` | Summary statistics, depends on class

All of these functions take, as the main argument, the vector or vectors for which you want the statistic. If there are missing values in the vector, you'll typically need to add an argument to say what to do with the missing values. The parameter name for this varies by function, but for many of these functions it's `na.rm = TRUE` or `use="complete.obs"`.

```{r echo = FALSE}
library(tidyverse)
library(faraway)
data("worldcup")
```

```{r}
mean(nepali$wt, na.rm = TRUE)
range(nepali$ht, na.rm = TRUE)
sd(nepali$ht, na.rm = TRUE)
table(nepali$sex)
```

Most of these functions take a single vector as the input. The `cor` function, however, calculates the correlation between vectors and so takes two or more vectors. If you give it multiple values, it will give the correlation matrix for all the vectors.

```{r}
cor(nepali$wt, nepali$ht, use = "complete.obs")
cor((nepali %>% select(wt, ht, age)), use = "complete.obs")
```

R supports object-oriented programming. Your first taste of this shows up with the `summary` function. For the `summary` function, R does not run the same code every time. Instead, R first checks what type of object was input to `summary`, and then it runs a function (*method*) specific to that type of object. For example, if you input a continuous vector, like the `ht` column in `nepali`, to `summary`, the function will return the mean, median, range, and 25th and 75th percentile values: 

```{r}
summary(nepali$wt)
```

However, if you submit a factor vector, like the `sex` column in `nepali`, the `summary` function will return a count of how many elements of the vector are in each factor level (as a note, you could do the same thing with the `table` function):

```{r}
summary(nepali$sex)
```

The `summary` function can also input other data structures, including dataframes, lists, and special object types, like regression model objects. In each case, it performs different actions specific to the object type. Later in this section, we'll cover regression models, and see what the `summary` function returns when it is used with regression model objects.

### `summarize` function

You will often want to use these functions in conjunction with the `summarize` function in `dplyr`. For example, to create a new dataframe with the mean weight of children in the `nepali` dataset, you can use `mean` inside a `summarize` function: 

```{r}
library(dplyr)
nepali %>%
  summarize(mean_wt = mean(wt, na.rm = TRUE))
```

There are also some special functions that are particularly useful with `summarize` and other `dplyr` functions. For example, the `n` function will calculate the number of observations and the `first` function will return the first value of a column: 

```{r}
nepali %>%
  summarize(n_children =n(), 
            first_id = first(id))
```

See the "summary function" section of the [the RStudio Data Wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for more examples of these special functions. 

Often, you will be more interested in summaries within certain groupings of your data, rather than overall summaries. For example, you may be interested in mean height and weight by sex, rather than across all children, for the `nepali` data. It is very easy to calculate these grouped summaries using `dplyr`---you just need to group data using the `group_by` function (also a `dplyr` function) before you run the `summarize` function:

```{r}
nepali %>%
  group_by(sex) %>%
  summarize(mean_wt = mean(wt, na.rm = TRUE),
            n_children =n(), 
            first_id = first(id))
```

```{block, type = "rmdnote"}
Don't forget that you need to save the output to a new object if you want to use it later. The above code, which creates a dataframe with summaries for Nepali children by sex, will only be printed out to your console if run as-is. If you'd like to save this output as an object to use later (for example, for a plot or table), you need to assign it to an R object. 
```

## Factor vectors

<iframe width="768" height="480" src="https://www.youtube.com/embed/o7rqBnvpYjU?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_3.pdf) 
a pdf of the lecture slides for this video.

## Data from a package

<iframe width="768" height="480" src="https://www.youtube.com/embed/o7rqBnvpYjU?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_4.pdf) 
a pdf of the lecture slides for this video.

So far we've covered two ways to get data into R:

1. From flat files (either on your computer or online)
2. From binary file formats like SAS and Excel. 

Many R packages come with their own data, which is very easy to load and use. For example, the `faraway` package, which complements Julian Faraway's book *Linear Models with R*, has a dataset called `worldcup` that I'll use for some examples and that you'll use for part of this week's in-course exercise. To load this dataset, first load the package with the data (`faraway`) and then use the `data()` function with the dataset name ("worldcup") as the argument to the `data` function:

```{r}
library(faraway)
data("worldcup")
```

Unlike most data objects you'll work with, datasets that are part of an R package will often have their own help files. You can access this help file for a dataset using the `?` operator with the dataset's name:

```{r, eval = FALSE}
?worldcup
```

This helpful will usually include information about the size of the dataset, as well as definitions for each of the columns.

To get a list of all of the datasets that are available in the packages you currently have loaded, run `data()` without an option inside the parentheses:

```{r, eval = FALSE}
data()
```

```{block, type = "rmdnote"}
If you run the `library` function without any arguments---`library()`---it works in a similar way. R will open a list of all the R packages that you have installed on your computer and can open with a `library` call. 
```

For this chapter, we'll be working with a modified version of the `nepali` dataset from the `faraway` package. This gives data from a study of the health of a group of Nepalese children. Each observation is a single measurement for a child; there can be multiple observations per child. We'll use a modified version of this dataframe that limits it to the columns with the child's id, sex, weight, height, and age, and limited to each child's first measurement. To create this modified dataset, run the following code: 

```{r}
library(dplyr)
library(faraway)
data(nepali)
nepali <- nepali %>%
  # Limit to certain columns
  select(id, sex, wt, ht, age) %>%
  # Convert id and sex to factors
  mutate(id = factor(id),
         sex = factor(sex, levels = c(1, 2),
                      labels = c("Male", "Female"))) %>%
  # Limit to first obs. per child
  distinct(id, .keep_all = TRUE)
```

The first few rows of the data should now look like:

```{r}
nepali %>% 
  slice(1:4)
```


## Dates in R

<iframe width="768" height="480" src="https://www.youtube.com/embed/1ksBUmcXP0g?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_5.pdf) 
a pdf of the lecture slides for this video.

As part of the data cleaning process, you may want to change the class of some
of the columns in the dataframe. For example, you may want to change a column
from a character to a date.

Here are some of the most common vector classes in R:

Class        | Example
------------ | -----------------------------------
`character`  | "Chemistry", "Physics", "Mathematics"
`numeric`    | 10, 20, 30, 40
`factor`     | Male [underlying number: 1], Female [2]
`Date`       | "2010-01-01" [underlying number: 14,610]
`logical`    | TRUE, FALSE

To find out the class of a vector (including a column in a dataframe -- remember
each column can be thought of as a vector), you can use `class()`:

```{r}
class(daily_show$date)
```

It is especially common to need to convert dates during the data cleaning
process, since date columns will usually be read into R as characters or
factors---you can do some interesting things with vectors that are in a Date
class that you cannot do with a vector in a character class.

To convert a vector to the `Date` class, if you'd like to only use base R, you
can use the `as.Date` function. I'll walk through how to use `as.Date`, since
it's often used in older R code. However, I recommend in your own code that you
instead use the `lubridate` package, which I'll talk about later in this
section.

To convert a vector to the `Date` class, you can use functions in the
`lubridate` package. This package has a series of functions based on the order
that date elements are given in the incoming character with date information.
For example, in "12/31/99", the date elements are given in the order of month
(**m**), day (**d**), year (**y**), so this character string could be converted
to the date class with the function `mdy`. As another example, the `ymd`
function from lubridate can be used to parse a column into a Date class,
regardless of the original format of the date, as long as the date elements are
in the order: year, month, day. For example:

```{r message = FALSE}
library("lubridate")
ymd("2008-10-13")
ymd("'08 Oct 13")
ymd("'08 Oct 13")
```

To convert the `date` column in the `daily_show` data into a Date
class, then, you can run:

```{r}
library(package = "lubridate")

class(x = daily_show$date) # Check the class of the 'date' column before mutating it

daily_show <- mutate(.data = daily_show,
                     date = mdy(date))
head(x = daily_show, n = 3)
class(x = daily_show$date) # Check the class of the 'date' column after mutating it
```

Once you have an object in the `Date` class, you can do things like plot by
date, calculate the range of dates, and calculate the total number of days the
dataset covers:

```{r eval = FALSE}
range(daily_show$date)
diff(x = range(daily_show$date))
```

We could have used these to transform the date in `daily_show`, using the following pipe chain: 

```{r message = FALSE}
daily_show <- read_csv(file = "data/daily_show_guests.csv",
                       skip = 4) %>%
  rename(job = GoogleKnowlege_Occupation, 
         date = Show,
         category = Group,
         guest_name = Raw_Guest_List) %>%
  select(-YEAR) %>%
  mutate(date = mdy(date)) %>%
  filter(category == "Science")
head(x = daily_show, n = 2)
```

The `lubridate` package also includes functions to pull out certain elements of a date, including: 

- `wday`
- `mday`
- `yday`
- `month`
- `quarter`
- `year`

For example, we could use `wday` to create a new column with the weekday of each show: 

```{r}
mutate(.data = daily_show,
       show_day = wday(x = date, label = TRUE)) %>%
  select(date, show_day, guest_name) %>%
  slice(1:5)
```

```{block, type = 'rmdwarning'}
R functions tend to use the timezone of **YOUR** computer's operating system by
default, or UTC, or GMT. You need to be careful when working with dates and
times to either specify the time zone or convince yourself the default behavior
works for your application.
```


## Logical vectors

<iframe width="576" height="360" src="https://www.youtube.com/embed/2t8gDG8croo" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_6.pdf) 
a pdf of the lecture slides for this video.

Last week, you learned a lot about logical statements and how to use them with the `filter` function from the `dplyr` package. You can also use logical vectors, created with these logical statements, for a lot of other things. For example, you can use them directly in the square bracket indexing (`[..., ...]`) to pull out just the rows of a dataframe that meet a certain condition. For using logical statements in either context, it is helpful to understand a bit more about logical vectors. 

When you run a logical statement on a vector, you create a logical vector the same length as the original vector:

```{r}
length(nepali$sex)
length(nepali$sex == "Male")
```

The logical vector (`nepali$sex == "Male"` in this example) will have the value `TRUE` at any position where the original vector (`nepali$sex` in this example) met the logical condition you tested, and `FALSE` anywhere else:

```{r}
head(nepali$sex)
head(nepali$sex == "Male")
```

You can "flip" this logical vector (i.e., change every `TRUE` to `FALSE` and vice-versa) using the *bang operator*, `!`:

```{r}
is_male <- nepali$sex == "Male" # Save this logical vector as the object named `is_male`

head(is_male)
head(!is_male)
```

The bang operator turns out to be very useful. You will often find cases where it's difficult to write a logical vector to get what you want, but fairly easy to write the inverse (find everything you don't want). One example is filtering down to non-missing values---the `is.na` function will return `TRUE` for any value that is `NA`, so you can use `!is.na()` to identify any non-missing values. 

You can do a few cool things with a logical vector. For example, you can use it inside a `filter` function to pull out just the rows of a dataframe where `is_male` is `TRUE`:

```{r}
nepali %>% 
  filter(is_male) %>% 
  head()
```

Or, with `!`, just the rows where `is_male` is `FALSE`:

```{r}
nepali %>% 
  filter(!is_male) %>% 
  head()
```

You can also use `sum()` and `table()` with a logical vector to find out how many of the values in the vector are `TRUE` AND `FALSE`. You can use `sum` because R saves logical vectors at a basic level as `0` for `FALSE` and `1` for `TRUE`. Therefore, if you add up all the values in a logical vector, you're adding up the number of observations with the value `TRUE`.

In the example, you can use these functions to find out how many males and females are in the dataset:

```{r}
sum(is_male)
sum(!is_male)
table(is_male)
```

Note that you could also achieve the same thing with `dplyr` functions. For example, you could use `mutate` with a logical statement to create an `is_male` column in the `nepali` dataframe, then group by the new `is_male` column and count the number of observations in each group using `count`:

```{r}
library(dplyr)
nepali %>%
  mutate(is_male = sex == "Male") %>%
  group_by(is_male) %>%
  count()
```

We will cover using `summarize`, including with data that has been grouped with `group_by`, later in this chapter. 

<iframe width="576" height="360" src="https://www.youtube.com/embed/0_EpZQKWsow?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_7.pdf) 
a pdf of the lecture slides for this video.

## Plots to explore data

<iframe width="576" height="360" src="https://www.youtube.com/embed/2E0MlcsfBmg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_8.pdf) 
a pdf of the lecture slides for this video.

Exploratory data analysis is a key step in data analysis and plotting your data in different ways is an important part of this process. In this section, I will focus on the basics of `ggplot2` plotting, to get you started creating some plots to explore your data. 
This section will focus on making **useful**, rather than **attractive** graphs, since at this stage we are focusing on exploring data for yourself rather than presenting results  to others. Next week, I will explain more about how you can customize ggplot objects, to help you make plots to communicate with others.  


All of the plots we'll make today will use the `ggplot2` package (another member of the tidyverse!). If you don't already have that installed, you'll need to install it. You then need to load the package in your current session of R:

```{r}
# install.packages("ggplot2")  ## Uncomment and run if you don't have `ggplot2` installed
library(ggplot2)
```

The process of creating a plot using `ggplot2` follows conventions that are a bit different than most of the code you've seen so far in R (although it is somewhat similar to the idea of piping I introduced in the last chapter). The basic steps behind creating a plot with `ggplot2` are:

1. Create an object of the `ggplot` class, typically specifying the **data** and some or all of the **aesthetics**; 
2. Add on **geoms** and other elements to create and customize the plot, using `+`.

You can add on one or many geoms and other elements to create plots that range from very simple to very customized. This week, we'll focus on simple geoms and added elements, and then explore more detailed customization next week. 

```{block type = "rmdwarning"}
If R gets to the end of a line and there is not some indication that the call is not over (e.g., `%>%` for piping or `+` for `ggplot2` plots), R interprets that as a message to run the call without reading in further code. A common error when writing `ggplot2` code is to put the `+` to add a geom or element at the beginning of a line rather than the end of a previous line-- in this case, R will try to execute the call too soon. To avoid errors, be sure to end lines with `+`, don't start lines with it. 
```

### Initializing a ggplot object

The first step in creating a plot using `ggplot2` is to create a ggplot object. This object will not, by itself, create a plot with anything in it. Instead, it typically specifies the data frame you want to use and which aesthetics will be mapped to certain columns of that data frame (aesthetics are explained more in the next subsection). 

Use the following conventions to initialize a ggplot object:

```{r, eval = FALSE}
## Generic code
object <- ggplot(dataframe, aes(x = column_1, y = column_2))
```

The data frame is the first parameter in a `ggplot` call and, if you like, you can use the parameter definition with that call (e.g., `data = dataframe`). Aesthetics are defined within an `aes` function call that typically is used within the `ggplot` call. 

```{block type = "rmdnote"}
While the `ggplot` call is the place where you will most often see an `aes` call, `aes` can also be used within the calls to add specific geoms. This can be particularly useful if you want to map aesthetics differently for different geoms in your plot. We'll see some examples of this use of `aes` more in later sections, when we talk about customizing plots. The `data` parameter can also be used in geom calls, to use a different data frame from the one defined when creating the original ggplot object, although this tends to be less common. 
```

### Plot aesthetics

**Aesthetics** are properties of the plot that can show certain elements of the data. For example, in Figure \@ref(fig:aesmapex), color shows (is mapped to) gender, x-position shows height, and y-position shows weight in a sample data set of measurements of children in Nepal. 

```{r aesmapex, echo = FALSE, warning = FALSE, fig.width = 6, fig.height = 4, fig.align = "center", message = FALSE, fig.cap = "Example of how different properties of a plot can show different elements to the data. Here, color indicates gender, position along the x-axis shows height, and position along the y-axis shows weight. This example is a subset of data from the `nepali` dataset in the `faraway` package."}
library(dplyr)
data("nepali") 
nepali %>%
  tbl_df() %>% 
  distinct(id, .keep_all = TRUE) %>%
  mutate(sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))) %>%
  ggplot(aes(x = ht, y = wt, color = sex)) + 
  geom_point() + 
  xlab("Height (cm)") + ylab("Weight (kg)")
```

```{block type = "rmdnote"}
Any of these aesthetics could also be given a constant value, instead of being mapped to an element of the data. For example, all the points could be red, instead of showing gender.
```

Which aesthetics are required for a plot depend on which geoms (more on those in a second) you're adding to the plot. You can find out the aesthetics you can use for a geom in the "Aesthetics" section of the geom's help file (e.g., `?geom_point`). Required aesthetics are in bold in this section of the help file and optional ones are not. Common plot aesthetics you might want to specify include: 

```{r echo = FALSE}
aes_vals <- data.frame(aes = c("`x`", "`y`", "`shape`",
                               "`color`", "`fill`", "`size`",
                               "`alpha`", "`linetype`"),
                       desc = c("Position on x-axis", 
                                "Position on y-axis", 
                                "Shape",
                                "Color of border of elements", 
                                "Color of inside of elements",
                                "Size", 
                                "Transparency (1: opaque; 0: transparent)",
                                "Type of line (e.g., solid, dashed)"))
knitr::kable(aes_vals, col.names = c("Code", "Description"))
```

### Adding geoms

Next, you'll want to add one or more `geoms` to create the plot. You can add these with `+` after the `ggplot` statement to initialize the ggplot object. Some of the most common geoms are:

```{r echo = FALSE}
plot_funcs <- data.frame(type = c("Histogram (1 numeric variable)",
                                  "Scatterplot (2 numeric variables)",
                                  "Boxplot (1 numeric variable, possibly 1 factor variable)",
                                  "Line graph (2 numeric variables)"), 
                         ggplot2_func = c("`geom_histogram`",
                                          "`geom_point`",
                                          "`geom_boxplot`",
                                          "`geom_line`"))
knitr::kable(plot_funcs, col.names = c("Plot type",
                                       "ggplot2 function"))
```

### Constant aesthetics

<iframe width="576" height="360" src="https://www.youtube.com/embed/qiTPGzqYiOI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_9.pdf) 
a pdf of the lecture slides for this video.

Instead of mapping an aesthetic to an element of your data, you can use a constant value for it. For example, you may want to make all the points green, rather than having color map to gender: 

```{r echo = FALSE, warning = FALSE, fig.align = "center", out.width = "0.6\\textwidth", message = FALSE, fig.width = 5, fig.height = 3}
nepali %>%
  tbl_df() %>% 
  distinct(id, .keep_all = TRUE) %>%
  mutate(sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))) %>%
  ggplot(aes(x = ht, y = wt)) + 
  geom_point(color = "darkgreen") + 
  xlab("Height (cm)") + ylab("Weight (kg)")
```

In this case, you'll define that aesthetic when you add the geom, outside of an `aes` statement. In R, you can specify the shape of points with a number. Figure \@ref(fig:shapeexamples) shows the shapes that correspond to the numbers 1 to 25 in the `shape` aesthetic. This figure also provides an example of the difference between color (black for all these example points) and fill (red for these examples). You can see that some point shapes include a fill (21 for example), while some are either empty (1) or solid (19).

```{r shapeexamples, echo = FALSE, fig.width = 5, fig.height = 3, fig.align = "center", fig.cap = "Examples of the shapes corresponding to different numeric choices for the `shape` aesthetic. For all examples, `color` is set to black and `fill` to red."}
x <- rep(1:5, 5)
y <- rep(1:5, each = 5)
shape <- 1:25
to_plot <- data_frame(x = x, y = y, shape = shape)
ggplot(to_plot, aes(x = x, y = y)) + 
  geom_point(shape = shape, size = 4, color = "black", fill = "red") + 
  geom_text(label = shape, nudge_x = -0.25) +
  xlim(c(0.5, 5.5)) + 
  theme_void() + 
  scale_y_reverse()
```

If you want to set color to be a constant value, you can do that in R using character strings for different colors. Figure \@ref(fig:colorexamples) gives an example of some of the different blues available in R. To find links to listings of different R colors, google "R colors" and search by "Images".

```{r colorexamples, echo = FALSE, fig.width = 5, fig.height = 3, fig.align = "center", fig.cap = "Example of available shades of blue in R."}
x <- rep(0, 6)
y <- 1:6
color <- c("blue", "blue4", "darkorchid", "deepskyblue2", 
           "steelblue1", "dodgerblue3")
to_plot <- data_frame(x = x, y = y, color = color)
ggplot(to_plot, aes(x = x, y = y)) + 
  geom_point(color = color, size = 2) + 
  geom_text(label = color, hjust = 0, nudge_x = 0.05) + 
  theme_void() + 
  xlim(c(-1, 1.5)) +
  scale_y_reverse()
```

### Useful plot additions

There are also a number of elements that you can add onto a `ggplot` object using `+`. A few that are used very frequently are: 

```{r echo = FALSE}
plot_adds <- data.frame(add = c("`ggtitle`",
                                "`xlab`, `ylab`",
                                "`xlim`, `ylim`"),
                        descrip = c("Plot title",
                                    "x- and y-axis labels",
                                    "Limits of x- and y-axis"))
knitr::kable(plot_adds, col.names = c("Element", "Description"))
```

### Example dataset

For the example plots, I'll use a dataset in the `faraway` package called `nepali`. This gives data from a study of the health of a group of Nepalese children. 

```{r}
library(faraway)
data(nepali)
```

I'll be using functions from `dplyr` and `ggplot2`, so those need to be loaded:

```{r message = FALSE, warning = FALSE}
library(dplyr)
library(ggplot2)
```

Each observation is a single measurement for a child; there can be multiple observations per child. I used the following code to select only the columns for child id, sex, weight, height, and age. I also used `distinct` to limit the dataset to only include one measurement for each chile, the child's first measurement in the dataset. 

```{r message = FALSE}
nepali <- nepali %>%
  select(id, sex, wt, ht, age) %>%
  mutate(id = factor(id),
         sex = factor(sex, levels = c(1, 2),
                      labels = c("Male", "Female"))) %>%
  distinct(id, .keep_all = TRUE)
```

After this cleaning, the data looks like this:

```{r}
head(nepali)
```

### Histograms

<iframe width="576" height="360" src="https://www.youtube.com/embed/qz5SmXkOj_k?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_10.pdf) 
a pdf of the lecture slides for this video.

Histograms show the distribution of a single variable. Therefore, `geom_histogram()` requires only one main aesthetic, `x`, the (numeric) vector for which you want to create a histogram. For example, to create a histogram of children's heights for the Nepali dataset (Figure \@ref(fig:nepalihist1)), run: 

```{r, nepalihist1, fig.width = 4, fig.height = 3, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Basic example of plotting a histogram with `ggplot2`. This histogram shows the distribution of heights for the first recorded measurements of each child in the `nepali` dataset."}
ggplot(nepali, aes(x = ht)) + 
  geom_histogram()
```

```{block type = "rmdnote"}
If you run the code with no arguments for `binwidth` or `bins` in `geom_histogram`, you will get a message saying "`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.". This message is just saying that a default number of bins was used to create the histogram. You can use arguments to change the number of bins used, but often this default is fine. You may also get a message that observations with missing values were removed. 
```

You can add some elements to the histogram now to customize it a bit. For example (Figure \@ref()), you can add a figure title (`ggtitle`) and clearer labels for the x-axis (`xlab`). You can also change the range of values shown by the x-axis (`xlim`).

```{r, nepalihist2, fig.width = 4, fig.height = 3, fig.align = "center", message = FALSE, warning = FALSE, fig.cap = "Example of adding ggplot elements to customize a histogram."}
ggplot(nepali, aes(x = ht)) + 
  geom_histogram(fill = "lightblue", color = "black") + 
  ggtitle("Height of children") + 
  xlab("Height (cm)") + xlim(c(0, 120))
```

The geom `geom_histogram` also has special argument for setting the number of width of the bins used in the histogram. Figure \@ref(fig) shows an example of how you can use the `bins` argument to change the number of bins that are used to make the histogram of height for the `nepali` dataset.  

```{r, nepalihist3, fig.width = 4, fig.height = 3, fig.align = "center", warning = FALSE, message = FALSE, fig.cap = "Example of using the `bins` argument to change the number of bins used in a histogram."}
ggplot(nepali, aes(x = ht)) + 
  geom_histogram(fill = "lightblue", color = "black",
                 bins = 40) 
```

Similarly, the `binwidth` argument can be used to set the width of bins. Figure \@ref(fig:nepalihist4) shows an example of using this function to create a histogram of the Nepali children's heights with binwidths of 10 centimeters (note that this argument is set in the same units as the x variable).

```{r, nepalihist4, fig.width = 4, fig.height = 3, fig.align = "center", warning = FALSE, message = FALSE, fig.cap = "Example of using the `binwidth` argument to set the width of each bin used in a histogram."}
ggplot(nepali, aes(x = ht)) + 
  geom_histogram(fill = "lightblue", color = "black",
                 binwidth = 10) 
```

### Scatterplots

A scatterplot shows how one variable changes as another changes. You can use the `geom_point` geom to create a scatterplot. For example, to create a scatterplot of height versus age for the Nepali data (Figure \@ref(fig:nepaliscatter1)), you can run the following code: 

```{r nepaliscatter1, fig.width = 5, fig.height = 4, warning = FALSE, fig.align = "center", fig.cap = "Example of creating a scatterplot. This scatterplot shows the relationship between children's heights and weights within the nepali dataset."}
ggplot(nepali, aes(x = ht, y = wt)) + 
  geom_point()
```

Again, you can use some of the options and additions to change the plot appearance. For example, to add a title, change the x- and y-axis labels, and change the color and size of the points on the scatterplot (Figure \@ref(fig:nepaliscatter2)), you can run:

```{r nepaliscatter2, fig.width = 5, fig.height = 4, fig.align = "center", message = FALSE, warning = FALSE, fig.cap = "Example of adding ggplot elements to customize a scatterplot."}
ggplot(nepali, aes(x = ht, y = wt)) + 
  geom_point(color = "blue", size = 0.5) + 
  ggtitle("Weight versus Height") + 
  xlab("Height (cm)") + ylab("Weight (kg)")
```

You can also try mapping another variable in the dataset to the `color` aesthetic. For example, to use color to show the sex of each child in the scatterplot (Figure \@ref(fig:nepaliscatter3)), you can run:

```{r nepaliscatter3, fig.width = 5, fig.height = 4, fig.align = "center", message = FALSE, warning = FALSE, fig.cap = "Example of mapping color to an element of the data in a scatterplot."}
ggplot(nepali, aes(x = ht, y = wt, color = sex)) + 
  geom_point(size = 0.5) + 
  ggtitle("Weight versus Height") + 
  xlab("Height (cm)") + ylab("Weight (kg)")
```

### Boxplots 

Boxplots can be used to show the distribution of a continuous variable. To create a boxplot, you can use the `geom_boxplot` geom. To plot a boxplot for a single, continuous variable, you can map that variable to `y` in the `aes` call, and map `x` to the constant `1`. For example, to create a boxplot of the heights of children in the Nepali dataset (Figure \@ref(fig:nepaliboxplot1)), you can run:

```{r nepaliboxplot1, fig.height = 4, fig.width = 4, warning = FALSE, fig.align="center", fig.cap = "Example of creating a boxplot. The example shows the distribution of height data for children in the nepali dataset."}
ggplot(nepali, aes(x = 1, y = ht)) + 
  geom_boxplot() + 
  xlab("")+ ylab("Height (cm)")
```

You can also create separate boxplots, one for each level of a factor (Figure \@ref(fig:nepaliboxplot2)). In this case, you'll need to include two aesthetics (`x` and `y`) when you initialize the ggplot object The `y` variable is the variable for which the distribution will be shown, and the `x` variable should be a discrete (categorical or TRUE/FALSE) variable, and will be used to group the variable. This `x` variable should also be specified as the grouping variable, using `group` within the aesthetic call.

```{r nepaliboxplot2, fig.height = 4, fig.width = 5, fig.align = "center", warning = FALSE, fig.cap = "Example of creating separate boxplots, divided by a categorical grouping variable in the data."}
ggplot(nepali, aes(x = sex, y = ht, group = sex)) + 
  geom_boxplot() + 
  xlab("Sex")+ ylab("Height (cm)") 
```


## In-course Exercise Chapter 3

### Loading data from an R package

Pick one person to start sharing their screen.

The data we'll be using today is from a dataset called `worldcup` in the package
`faraway`. Load that data so you can use it on your computer (note: you will
need to load and install the `faraway` package to do this). Use the help file
for the data to find out more about the dataset. Use some basic functions, like
`head`, `tail`, `slice`, `colnames`, `str`, and `summary` to check out the data a bit 
(if some of these you haven't seen before, remember you can always check their 
helpfiles!). See if you can figure out:

- What variables are included in this dataset? (Check the column names.)
- What class is each column currently? In particular, which are numbers and
which are factors?

#### Example R code:

Load the `faraway` package using `library()` and then load the data using `data()`:

```{r}
## Uncomment the next line if you need to install the package
# install.packages("faraway")
library(faraway)
data("worldcup")
```

Check out the help file for the `worldcup` dataset to find out more about the
data. (Note: Only datasets that are parts of packages will have help files.)

```{r, eval = FALSE}
?worldcup
```

Check out the data a bit:

```{r}
str(worldcup)
head(worldcup)
tail(worldcup)
colnames(worldcup)
summary(worldcup)
```


### Exploring the data using simple statistics and `summarize`

Rotate to someone else to share their screen.

Then, try checking out the data using some basic commands for simple statistics,
like `mean()`, `range()`, `max()`, and `min()`, as well as the `summarize` and
`group_by` functions from the `dplyr` package. Try to answer the following
questions:

- What is the mean number of saves that players made? 
- What is the mean number of saves just among the goalkeepers? 
- Did players from any position other than goalkeeper make a save?
- How many players were there in each position? 
- How many forwards were there on each team? Which team had the most shots in total among all its forwards?
- Which team(s) had the defender with the most tackles?

If you have extra time, continuing using the ["Data Wrangling"
cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
to come up with some other ideas for how you can explore this data, and write up
and test code to do that.

####  Example R code:

To calculate the mean number of saves among all the players, use the `mean`
function, either by itself or within a `summarize` call:

```{r}
mean(worldcup$Saves)

worldcup %>%
  summarize(mean_saves = mean(Saves))
```

There are a few ways to figure out the mean number of saves just among the
goalkeepers. One way is to filter the dataset to only goalies and then use
`summarize` to calculate the mean number of saves in this filtered subset of the
data:

```{r}
worldcup %>%
  filter(Position == "Goalkeeper") %>% 
  summarize(mean_saves = mean(Saves))
```

The next question is if players from any position other than goalkeeper made a
save. One way to figure this out is to group the data by position and then
summarize the maximum number of saves. Based on this, it looks like there were
not saves from players in any position except goalie:

```{r}
worldcup %>%
  group_by(Position) %>% 
  summarize(max_saves = max(Saves))
```

To figure out how many players were there in each position, you can can group
the data by position and then use the `count` function from `dplyr` to count the
number of observations in each group:

```{r}
worldcup %>%
  group_by(Position) %>% 
  count()
```

For the next set of questions, you can filter the data to only Forwards, then
group by team to use `summarize` to count up the number of Forwards on each
team. You can also use the same `summarize` call to figure out the total number
of shots by all Forwards on each team. To figure out which team had the most
shots in total among all its forwards, you can use the `arrange` function to
re-order the data from the team with the most total shots to the least. It turns
out that Uruguay had the most shots by forwards on its team, with a total of 46
shots.

```{r}
worldcup %>% 
  filter(Position == "Forward") %>% 
  group_by(Team) %>% 
  summarize(n_forwards = n(),
            total_forward_shots = sum(Shots)) %>% 
  arrange(desc(total_forward_shots))
```

To figure out which team(s) had the defender with the most tackles, you can
filter to only defenders and then use the `top_n` function to identify the
players with the top number of tackles. It turns out these players were on the
England, Germany, and Chile teams.

```{r}
worldcup %>% 
  filter(Position == "Defender") %>% 
  top_n(n = 1, wt = Tackles)
```

### Exploring the data using logical statements

Rotate to someone else to share their screen.

Then, try checking out the data using logical statements and some of the `dplyr`
code we covered in the last chapter (`filter` and `arrange`, for example), to help you
answer the following questions:

- What is the range of time that players spent in the game? 
- Which player or players played the most time in this World Cup? 
- How many players are goalies in this dataset?
- Create a new R object named `brazil_players` that is limited to the players in
this dataset that are (1) on the Brazil team and (2) not goalies.

If you have additional time, look over the "Data Manipulation" cheatsheet
available in RStudio's Help section. Make a list of questions you would like to
figure out from this example data, and start to plan out how you might be able
to answer those questions using functions from `dplyr`. Write the related code
and see if it works.

#### Example R code:

To figure out the range of time, you could use `arrange` twice, once with `desc`
and once without, to figure out the maximum and minimum values

```{r}
# Minimum time
arrange(worldcup, Time) %>% 
  select(Time) %>% 
  slice(1)

# Maximum time
arrange(worldcup, desc(Time)) %>% 
  select(Time) %>% 
  slice(1)
```


Later, we will learn about the `n()` function, which you can use within piped
code to represent the total number of rows in the dataframe. If you'd like to
get the full range of the `Time` column in one pipeline of code, you can use
`n()` as a reference within `slice`, to pull both the first and last rows of the
dataframe:

```{r}
arrange(worldcup, Time) %>% 
  select(Time) %>% 
  slice(c(1, n()))
```

Finally, you could also use `min()` and `max()` functions to get the minimum and
maximum values of the `Time` column in the `worldcup` dataframe (remember that
you can use the `dataframe$column_name` notation to pull a column from a
dataframe). Similarly, you there is a function called `range()` you could use to
find out the range of time these players played in the World Cup.

```{r}
range(worldcup$Time)
```

To figure out which player or players played for the most time, there are a few
approaches you can take. Here I'm showing two: (1) using `filter` from the
`dplyr` package to filter down to rows where where the `Time` for that row
equals the maximum play time that you determined from an earlier task (570
minutes); and (2) using the `top_n` function from `dplyr` to pick out the rows
with the maximum value (`n = 1`) of the `Time` column (see the help file for
`top_n` if you are unfamiliar with this function; we have not covered it in
class yet).

```{r}
worldcup %>%
  filter(Time == 570)

worldcup %>% 
  top_n(n = 1, wt = Time)
```

*Note*: You may have noticed that you lost the players names when you did this
using the `dplyr` pipechain. That's because `dplyr` functions convert the data
to a dataframe format that does not include rownames. If you want to keep
players' names, you can use a function from the `tibble` package called
`rownames_to_column` to move those names from the rownames of the data into a
column in the dataframe. Use the `var` parameter of this function to specify
what you want the new column to be named. For example:

```{r}
library(tibble)
worldcup %>%
  rownames_to_column(var = "Name") %>% 
  filter(Time == 570)
```

There are a few ways to figure out how many players are goalies in this dataset.
One way is to use `sum()` on a logical vector of whether the player's position
is "Goalkeeper":

```{r}
is_goalie <- worldcup$Position == "Goalkeeper"
sum(is_goalie)
```

Another way is to use `filter` from `dplyr`, along with a logical statement, to
filter the data to only players with the position of "Goalkeeper", and then pipe
that filtered subset into the `nrow` function to count the number of rows in the
filtered dataframe:

```{r}
worldcup %>% 
  filter(Position == "Goalkeeper") %>% 
  nrow()
```

Next, create a new R object named `brazil_players` that is limited to the
players in this dataset that are (1) on the Brazil team and (2) not goalies. You
can use a logical statement to filter to rows that meet both these conditions by
joing two logical statements in the `filter` function with an `&`:

```{r}
brazil_players <- worldcup %>% 
  filter(Team == "Brazil" & Position != "Goalkeeper") 
head(brazil_players)
```


### Exploring the data using basic plots #1

Use some basic plots to check out this data. Try the following:

- Create a scatterplot of the `worldcup` data, where each point is a player, the x-axis shows the amount of time the player played in the World Cup, and the y-axis shows the number of passes the player had. Try writing the code both with and without "piping in" the data you want to plot into the `ggplot` function.
- Create the same scatterplot, but have each point in the scatterplot show that player's position using some aesthetic besides the x or y position (e.g., color, point shape). Add "rug plots" to the margins.
- Create a scatterplot of number of shots (x-axis) versus number of tackles (y-axis) for **just** players on one of the four teams that made the semi-finals (Spain, Netherlands, Germany, Uruguay). Use color to show player's position and shape to show player's team. (Hint: you will want to use some `dplyr` code to clean the data before plotting to do this.)
- Create a scatterplot of player time versus passes. Use color to show whether the player was on one of the top 4 teams or not. (Hint: Again, you'll want to use some `dplyr` code before plotting to do this.) For an extra challenge, also try adding each player's name on top of each point. (Hint: check out the `rownames_to_column` function from the `tibble` package to help with this.)
- Did you notice any interesting features of the data when you did any of the graphs in this section?

#### Example R code:

Create a scatterplot of `Time` versus `Passes`. 

```{r, fig.align = "center", fig.width = 5, fig.height = 3}
# Without piping
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes))

# With piping
worldcup %>% 
  ggplot() +
  geom_point(mapping = aes(x = Time, y = Passes)) 
```

Create the same scatterplot, but have each point in the scatterplot show that player's position. 

```{r}
ggplot(worldcup,
       mapping = aes(x = Time, y = Passes, color = Position)) + 
  geom_point() + 
  geom_rug()
```

Create a scatterplot of number of shots (x-axis) versus number of tackles (y-axis) for **just** players on one of the four teams that made the semi-finals (Spain, Netherlands, Germany, Uruguay). Use color to show player's position and shape to show player's team. For an extra challenge, also try adding each player's name on top of each point.

```{r}
worldcup %>% 
  rownames_to_column(var = "Name") %>% 
  filter(Team %in% c("Spain", "Netherlands", "Germany", "Uruguay")) %>% 
  ggplot() + 
  geom_point(aes(x = Shots, y = Tackles, color = Position, shape = Team)) + 
  geom_text(mapping = aes(x = Shots, y = Tackles, 
                          color = Position, label = Name), 
            size = 2.5)
```

Create a scatterplot of player time versus passes. Use color to show whether the player was on one of the top 4 teams or not.

```{r}
worldcup %>% 
  mutate(top_4 = Team %in% c("Spain", "Netherlands", "Germany", "Uruguay")) %>% 
  ggplot() + 
  geom_point(aes(x = Time, y = Passes, color = top_4))
```


### Exploring the data using basic plots #2

Go back to the code you used in the previous section to create a scatterplot of the `worldcup` data, where each point is a player, the x-axis shows the amount of time the player played in the World Cup, and the y-axis shows the number of passes the player had. Try the following modifications: 

- Make all the points blue. 
- Google "R colors" to find a list of color names in R. Pick your favorite and make all the points in the scatterplot that color.
- Change the size of the points to make them smaller (hint: check out the `size` aesthetic).
- Make it so the color of the points shows the player's position and all the points are slightly transparent.
- Change the title of the x-axis to "Time (minutes)" and the y-axis to "Number of passes".
- Add the title "World Cup statistics" and the subtitle "2010 World Cup".

#### Example R code:

Make all the points blue.

```{r}
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes),
             color = "blue")
```

Google "R colors" to find a list of color names in R. Pick your favorite and make all the points in the scatterplot that color.

```{r}
# Make the points "darkseagreen4"
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes),
             color = "darkseagreen4")
```

Change the size of the points to make them smaller (hint: check out the `size` aesthetic).

```{r}
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes),
             size = 0.8)
```

Make it so the color of the points shows the player's position and all the points are slightly transparent.

```{r}
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes),
             alpha = 0.3)
```

Change the title of the x-axis to "Time (minutes)" and the y-axis to "Number of passes".

```{r}
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes)) + 
  labs(x = "Time (minutes)", y = "Number of passes")
```

Add the title "World Cup statistics" and the subtitle "2010 World Cup".

```{r}
ggplot(worldcup) + 
  geom_point(mapping = aes(x = Time, y = Passes)) + 
  ggtitle("World Cup statistics",
          subtitle = "2010 World Cup")
```


### Exploring the data using basic plots #3

Try out creating some plots using the "statistical" geoms to check out this data. Try the following:

- Plot histograms of all the numeric variables (`Time`, `Shot`, `Passes`, `Tackles`, `Saves`) in the dataset. 
- Try customizing the number of bins used for one of the histograms plotted in the previous step.
- Try using constant values for some of the aesthetics (e.g., customize the color and the fill) of the histogram created in the previous step.
- Create a boxplot of `Shots` by position.
- Create a `top_teams` subset with just the four teams that made the semi-finals in the 2010 World Cup (Spain, the Netherlands, Germany, and Uruguay). Plot boxplots of `Shots` and `Saves` by team for just these teams.
- Create a histogram using data only from the four top teams for the amount of time each player played. Use the color aesthetic of the histogram to show team. 

#### Example R code

Use histograms to explore the distribution of different variables. If you want to change the number of bins in the histogram, try playing around with the `bins` and `binwidth` arguments. You can use the `bins` argument to say how many bins you want (e.g., `bins = 50`). You can use the `binwidth` argument to say how wide you want the bins to be (e.g., `binwidth = 10` if you wanted bins to be 10 units wide, in the units of the variable mapped to the `x` aesthetic. Try using `fill` and `color` to change the appearance of the plot. Google "R colors" and search the images to find links to listings of different R colors.

```{r, message = FALSE, fig.align = "center", fig.width = 5, fig.height = 3}
ggplot(worldcup, aes(x = Time)) + 
  geom_histogram()

ggplot(worldcup, aes(x = Time)) + 
  geom_histogram(bins = 50)

ggplot(worldcup, aes(x = Time)) + 
  geom_histogram(binwidth = 100)

ggplot(worldcup, aes(x = Time)) + 
  geom_histogram(binwidth = 50, color = "white", fill = "cyan4")
```


To create a boxplot of `Shots` by `Position`, you can use `geom_boxplot`:

```{r, fig.width=5, fig.height=3, fig.align = "center"}
ggplot(worldcup, aes(x = Position, y = Shots)) + 
  geom_boxplot()
```

The top four teams in this World Cup were Spain, the Netherlands, Germany, and Uruguay. Create a subset with just the data for these four teams:

```{r}
top_teams <- worldcup %>%
  filter(Team %in% c("Spain", "Netherlands", "Germany", "Uruguay")) 
```

Now, you can plot the boxplots, mapping `Team` to the `x` aesthetic and `Shots` to the `y` aesthetic:

```{r, fig.width = 6, fig.height = 3, fig.align = "center"}
ggplot(top_teams, aes(x = Team, y = Shots)) + 
  geom_boxplot() + 
  ggtitle("Shots per player in World Cup 2010")

```

Create a histogram using data only from the four top teams for the amount of time each player played. Use the color aesthetic of the histogram to show team. 

```{r}
ggplot(data = top_teams) + 
  geom_histogram(aes(x = Time, fill = Team))
```

Note that you can also explore other values for `geom_histogram` arguments. For example, you could change the binwidths to be 90 minutes (since games are 90 minutes). 

```{r}
ggplot(data = top_teams) + 
  geom_histogram(aes(x = Time, fill = Team), binwidth = 90)
```