-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-visualization.qmd
394 lines (290 loc) · 11.8 KB
/
03-visualization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
---
title: "Data Visualization and Descriptive Stats"
author: "Bella Ratmelia"
format: revealjs
---
## Today's Outline
1. Descriptive Statistics
2. Data Visualization in ggplot using Chile voting data, including guidelines on how to choose the appropriate visualization.
## Checklist when you start RStudio
- Load the project we created last session and open the R script file.
- Make sure that `Environment` panel is empty (click on broom icon to clean it up)
- Clear the `Console` and `Plots` too.
- Re-run the `library(tidyverse)` and `read_csv` portion in the previous session
## Refresher: Loading from CSV into a dataframe
Use `read_csv` from `readr` package (part of `tidyverse`) to load our data into a dataframe
```{r}
#| echo: true
#| label: load-data
#| message: false
#| output: false
# import tidyverse library
library(tidyverse)
# read the CSV with Chile voting data
chile_data <- read_csv("data-output/chile_voting_processed.csv")
chile_data <- chile_data |>
mutate(across(c("region", "sex", "education", "vote", "age_group", "support_level"), as.factor))
#reordering
chile_data <- chile_data |>
mutate(education = factor(education,
levels = c("P", "S", "PS"),
ordered = TRUE))
# peek at the data, pay attention to the data types!
glimpse(chile_data)
```
## Basic R Functions for Descriptive Statistics
Descriptive statistics provide summaries about the sample and observations made. These summaries can be quantitative (summary statistics) or visual (graphs).
Let's explore some basic R functions for descriptive statistics using our Chile voting data.
- `mean()`: arithmetic average
- `median()`: middle value
- `sd()`: standard deviation
- `var()`: variance
- `range()`: range of values
- `IQR()`: interquartile range
- `summary()`: provides a summary of descriptive statistics
- `Mode()` function from `DescTools` package: the most frequently occuring value ^[R doesn't have built-in functions for mode, so we need `DescTools` package to get this function]
## Exploring Age Distribution
Let's start by examining the `age` variable in our dataset.
```{r}
#| echo: true
#| output-location: slide
library(DescTools)
# Basic statistics
mean_age <- mean(chile_data$age, na.rm = TRUE)
median_age <- median(chile_data$age, na.rm = TRUE)
sd_age <- sd(chile_data$age, na.rm = TRUE)
range_age <- range(chile_data$age, na.rm = TRUE)
iqr_age <- IQR(chile_data$age, na.rm = TRUE)
mode_age <- DescTools::Mode(chile_data$age, na.rm = TRUE)
# Print results
cat("Mean age:", mean_age, "\n")
cat("Median age:", median_age, "\n")
cat("Standard deviation of age:", sd_age, "\n")
cat("Range of age:", range_age[1], "to", range_age[2], "\n")
cat("Interquartile range of age:", iqr_age, "\n")
cat("Most frequently occuring age:", mode_age, "\n")
# Summary statistics
summary(chile_data$age)
```
## Interpreting Age Statistics
Some possible interpretation:
- Median is lower than mean, suggesting a slight skew towards older ages
- Mode is much lower than both mean and median, indicating a cluster of young adults
- With a standard deviation of about 14.67 years, we can expect roughly two-thirds of the participants to fall between 23.62 and 52.96 years old
- Middle 50% of participants fall between 25 and 49 years old
It's much easier if we have a graph to visualize these interpretations!
## Visualizing data with ggplot
- `ggplot` is plotting package that is included inside `tidyverse` package
- works best with data in the long format, i.e., a column for all the dimensions/measures and another column for the value for each dimension/measure.
```{r}
#| echo: true
#| output-location: slide
chile_data |>
ggplot(aes(x = age)) +
geom_bar(fill = "lightblue") +
labs(title = "Age distribution of respondents",
x = "Age",
y = "Number of Respondents") +
theme_minimal()
```
## Anatomy of ggplot code
Charts built with ggplot must include the following:
```{.r code-line-numbers="|1|2|3|4-6|7"}
chile_data |> # <1>
ggplot(aes(x = age)) + # <2>
geom_bar(fill = "lightblue") + # <3>
labs(title = "Age distribution of respondents", # <4>
x = "Age", # <4>
y = "Count") + # <4>
theme_minimal() # <5>
```
1. **Data** - the dataframe/tibble to visualize.
2. **Aesthetic mappings (aes)** - describes which variables are mapped to the x, y axes, alpha (transparency) and other visual aesthetics.
3. **Geometric objects (geom)** - describes how values are rendered; as bars, scatterplot, lines, etc.
4. Provide titles and labels to your graph
5. (Optional) apply a theme/look to your graph
## Going back to our data
Our PI has asked us to generate visualizations to address these questions about the Chile voting data. The PI also has asked us to write down our interpretation of the visualizations.
1. What's the distribution of respondents' `statusquo` support?
2. What's the distribution of `votes`?
3. What's the percentage of the voting intentions on each region?
4. Is there any major divide in terms of support for the statusquo across age? Visualize the distribution of `age` and `statusquo`.
5. What does the voting intention look like across different education level? Visualize the distribution of `vote` intention on various `education` level.
6. Do people's voting intention matches their level of support to the status quo? Compare the distribution of `statusquo` across different `vote` intentions.
## Tip: open the ggplot cheatsheet
::: callout-tip
**A strategy I'd like to recommend:** briefly read over the `ggplot2` documentation and have them open on a separate tab. Figure out the type of variables you need to visualize (discrete or continuous) to quickly identify which visualization would make sense.
:::
![](images/ggplot-cheatsheet.jpg){width="60%"}
[ggplot documentation link](https://rstudio.github.io/cheatsheets/html/data-visualization.html)
## Task #1
What's the distribution of our respondents' `statusquo`?
```{r}
#| echo: true
#| output-location: slide
# One continuous variable
chile_data |>
ggplot(aes(x = statusquo)) +
geom_histogram(binwidth = 0.1, fill = "pink", color = "maroon") +
labs(title = "Distribution of Status Quo Scores",
x = "Status Quo Score",
y = "Count")
```
## Task #2
What's the distribution of `votes`?
```{r}
#| echo: true
#| output-location: slide
# One discrete/categorical variable
chile_data |>
ggplot(aes(x = vote)) +
geom_bar(fill = "skyblue") +
labs(title = "Distribution of Votes",
x = "Vote",
y = "Count")
```
## Task #3
What's the percentage of the voting intentions on each `region`?
```{r}
#| echo: true
#| output-location: slide
# Showing proportions / percentage as part of whole
chile_data %>%
ggplot(aes(x = vote, fill = region)) +
geom_bar(position = "fill") +
labs(title = "Distribution of Votes Across Regions",
x = "Vote",
y = "Proportion",
fill = "Region") +
theme_minimal()
```
## Group exercise #1 (solo attempts also ok)
Visualize the `vote` intention by `sex`. Make sure it is shown as proportion.
```{r}
#| echo: true
#| output-location: slide
#| code-fold: true
chile_data %>%
ggplot(aes(x = vote, fill = sex)) +
geom_bar(position = "fill") +
labs(title = "Proportion of Sex for each voting intention",
x = "vote intention",
y = "percentage") +
theme_minimal()
```
## Task #4
Is there any major divide in terms of support for the statusquo across age? Visualize the distribution of `age` and `statusquo`.
```{r}
#| echo: true
#| output-location: slide
# Two variables - both continuous
chile_data |>
ggplot(aes(x = statusquo, y = age, color=support_level)) +
geom_point(alpha = 0.5) +
labs(title = "Age vs Status Quo Score",
x = "Status Quo Score",
y = "Age")
```
## Task #5
What does the voting intention look like across different education level? Visualize the distribution of `vote` intention on various `education` level.
We can achieve this in two ways!
```{r}
#| echo: true
#| output-location: slide
# Two variables - both discrete/categorical
chile_data |>
ggplot(aes(x = education, y = vote)) +
geom_jitter(color="tomato", alpha=0.5) +
labs(title = "Voting intention Distribution by Education level",
x = "Education Level",
y = "Voting intention")
```
## Alternative way
```{r}
#| echo: true
#| output-location: slide
# Two variables - both discrete/categorical
# Alternative way
chile_data |>
ggplot(aes(x = education, fill = vote)) +
geom_bar(position = "dodge") +
labs(title = "Voting intention Distribution by Education Level",
x = "Education Level",
y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## Facets for multiple variables
```{r}
#| echo: true
#| output-location: slide
# Faceting and multiple variables
chile_data |>
ggplot(aes(x = age, y = statusquo, color = vote)) +
geom_point(alpha = 0.5) +
facet_wrap(~ education) +
labs(title = "Income vs Status Quo Score by Education Level",
x = "Age",
y = "Status Quo Score")
```
## Task #6
Do people's voting intention matches their level of support to the status quo? Compare the distribution of `statusquo` across different `vote` intentions.
Let's layer two kinds of visualization here:
```{r}
#| echo: true
#| output-location: slide
# Two variables - one categorical, one continuous
chile_data |>
ggplot(aes(x = vote, y = statusquo)) +
geom_boxplot(fill = "pink") +
geom_jitter(alpha = 0.4, color = "maroon") +
labs(title = "Distribution Status Quo Score by voting intention",
x = "Voting intention",
y = "Status Quo Support level")
```
## Group Exercise #2 (solo attempts ok)
Update the visualization from task #6 to the following:
- Replace the boxplot with violin plot (check out ggplot documentation)
- Change the fill color for both the violin plot and jitters to other color of your choice.
```{r}
#| echo: true
#| output-location: slide
#| code-fold: true
# Two variables - one categorical, one continuous
chile_data |>
ggplot(aes(x = vote, y = statusquo)) +
geom_violin(fill = "lightblue") +
geom_jitter(alpha = 0.4, color = "navy") +
labs(title = "Distribution Status Quo Score by voting intention",
x = "Voting intention",
y = "Status Quo Support level")
```
## Group Exercise #3 (solo attempts ok)
Visualize the distribution of `age_group` for each `vote` intention. To show the difference between sex, show the visualization with `sex` facet.
```{r}
#| echo: true
#| output-location: slide
#| code-fold: true
chile_data |>
ggplot(aes(x = vote, fill = age_group)) +
geom_bar(alpha = 0.7, position = "fill") +
facet_wrap( ~ sex) +
labs(title = "Proportion of age groups for each voting intentions by sex",
x = "Voting intention",
y = "Proportion",
fill = "Age Group") +
theme_minimal()
```
## Is fancier = better?
Fancier, more complicated visualization does not necessarily mean better!
Take a look at this award-winning visualization by [Simon Scarr](http://www.simonscarr.com/iraqs-bloody-toll)
```{=html}
<iframe width="1500" height="600" src="http://www.simonscarr.com/iraqs-bloody-toll" title="Simon Scarr's visualization"></iframe>
```
## Strategy for data visualization with ggplot
1. Have the ggplot documentation/cheatsheet open
2. Decide on how many variables are involved. Is it just one? two? more than two?
3. Determine whether the variables are categorical or continuous. If you have more than one, are they both categorical? one categorical + one continuous?
4. Refer to the documentation to see which type of visualization would make sense for your variables.
# End of Session 3!
Check out the [R Graph gallery](https://r-graph-gallery.com/) for inspiration and code samples!
Next session: statistical tests in R using Chile voting data