-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-visualization.qmd
549 lines (362 loc) · 16.8 KB
/
03-visualization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
---
title: "Data Visualization and Descriptive Stats"
author: "Bella Ratmelia"
format: revealjs
---
## Today's Outline
1. Descriptive Statistics
2. Data Visualization in ggplot using World Values Survey data, including guidelines on how to choose the appropriate visualization.
## Checklist when you start RStudio
- Load the project we created last session and open the R script file.
- Make sure that `Environment` panel is empty (click on broom icon to clean it up)
- Clear the `Console` and `Plots` too.
- Re-run the `library(tidyverse)` and `read_csv` portion in the previous session
## Refresher: Loading from CSV into a dataframe
Use `read_csv` from `readr` package (part of `tidyverse`) to load our data into a dataframe
```{r}
#| echo: true
#| label: load-data
#| message: false
#| output: false
# import tidyverse library
library(tidyverse)
# read the CSV with WVS data
wvs_cleaned <- read_csv("data-output/wvs_cleaned_v1.csv")
# Convert categorical variables to factors
columns_to_convert <- c("country", "religiousity", "sex", "marital_status", "employment")
wvs_cleaned <- wvs_cleaned |>
mutate(across(all_of(columns_to_convert), as_factor))
# peek at the data, pay attention to the data types!
glimpse(wvs_cleaned)
```
## Recap: Descriptive Statistics
::: incremental
- Univariate (i.e. single variable) Descriptive Stats
- Measures of central tendency: `mean()`, `median()`, `Mode()`
- Measures of Variability: `min()`, `max()`, `range()`, `IQR()`, `sd()` (standard deviation), `var()` (variance)
- Distribution shape: `skewness()` and `kurtosis()` from `moments` library. This is easier to see with histogram
- Bivariate (i.e. two variables) Descriptive Stats
- Contingency table / cross tab (for categorical data)
- Covariance - describe how two variables vary together
- Correlation - describe relationship strength and direction **in a sample**. (If we want to use this to infer about a population from a sample, it would fall under inferential stats)
- Visualizations e.g. Scatterplots, side-by-side boxplots, stacked bar charts, etc.
:::
## From last week: Basic R Functions for Descriptive Stats
Last week, we explored some basic R functions for descriptive statistics.
- `mean()`: arithmetic average
- `median()`: middle value
- `sd()`: standard deviation
- `var()`: variance
- `range()`: range of values
- `IQR()`: interquartile range
- `summary()`: provides a summary of descriptive statistics
- `Mode()` function from `DescTools` package: the most frequently occuring value [^1]
[^1]: R doesn't have built-in functions for mode, so we need `DescTools` package to get this function
# Univariate Descriptive Stats
## Measures of Central Tendency
Let's start by examining the `age` variable in our dataset.
```{r}
#| echo: true
#| output-location: fragment
library(DescTools)
# Basic statistics
mean_age <- mean(wvs_cleaned$age, na.rm = TRUE)
median_age <- median(wvs_cleaned$age, na.rm = TRUE)
mode_age <- DescTools::Mode(wvs_cleaned$age, na.rm = TRUE)
# Print results
cat("Mean age:", mean_age, "\n")
cat("Median age:", median_age, "\n")
cat("Most frequently occuring age:", mode_age, "\n") # this is here just for demo purposes
```
------------------------------------------------------------------------
How we can interpret / report this:
*"The age distribution of this sample is fairly symmetrical, as indicated by the very close mean (48 years) and median (48 years) values. The mode of 54 years suggests a slight right-skew in the age distribution, with a cluster of participants in their mid-50s."*
## Measures of Variability or Dispersion
```{r}
#| echo: true
#| output-location: fragment
var_age <- var(wvs_cleaned$age, na.rm = TRUE)
sd_age <- sd(wvs_cleaned$age, na.rm = TRUE)
range_age <- range(wvs_cleaned$age, na.rm = TRUE)
iqr_age <- IQR(wvs_cleaned$age, na.rm = TRUE)
cat("Variance of age:", var_age, "\n")
cat("Standard deviation of age:", sd_age, "\n")
cat("Range of age:", range_age[1], "to", range_age[2], "\n")
cat("Interquartile range of age:", iqr_age, "\n")
```
------------------------------------------------------------------------
How we can interpret / report this:
*"The age distribution of this sample is fairly wide spread. With a standard deviation of 16.72144, suggesting that most individuals' ages deviate from the mean by approximately 16.72 years. The range of ages spans from 18 to 93 years, which covers a wide range of age groups within the sample. The interquartile range (IQR) of 28 years, which represents the middle 50% of the data, indicates a moderately wide distribution of ages in the central portion of the dataset."*
## Distribution Shape
The function `skewness()` and `kurtosis()` is available through R package called `moments`. You may need to install it first before calling the library and its functions like in this code below.
```{r}
#| echo: true
#| output-location: fragment
library(moments)
skew_age <- skewness(wvs_cleaned$age, na.rm = TRUE)
kurtosis_age <- kurtosis(wvs_cleaned$age, na.rm = TRUE)
cat("Skewness of age:", skew_age, "\n")
cat("Kurtosis of age:", kurtosis_age, "\n")
```
------------------------------------------------------------------------
How we can interpret / report this:
*"The age distribution has a very slight right skew (skewness = 0.10), meaning there are slightly more outliers toward older ages, but the skew is minimal since values between -0.5 and 0.5 are considered approximately symmetric. The kurtosis of 2.02 is lower than a normal distribution's kurtosis of 3, indicating this distribution is platykurtic - it has lighter tails and is more uniform or "flatter" than a normal distribution."*
## Visualizing with ggplot
- Describing the spread and shape of distribution with just words is not very productive, so typically it is accompanied with visualization.
- `ggplot` is plotting package that is included inside `tidyverse` package
- works best with data in the long format, i.e., a column for all the dimensions/measures and another column for the value for each dimension/measure.
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |>
ggplot(aes(x = age)) +
geom_histogram(binwidth = 1, fill = "lightblue", color = "navy") +
labs(title = "Age distribution of respondents",
x = "Age",
y = "Count") +
theme_minimal()
```
## Anatomy of ggplot code
Charts built with ggplot must include the following:
``` {.r code-line-numbers="|1|2|3|4-6|7"}
wvs_cleaned |> # <1>
ggplot(aes(x = age)) + # <2>
geom_histogram(binwidth = 1, fill = "lightblue", color = "navy") + # <3>
labs(title = "Age distribution of respondents", # <4>
x = "Age", # <4>
y = "Count") + # <4>
theme_minimal() # <5>
```
1. **Data** - the dataframe/tibble to visualize.
2. **Aesthetic mappings (aes)** - describes which variables are mapped to the x, y axes, alpha (transparency) and other visual aesthetics.
3. **Geometric objects (geom)** - describes how values are rendered; as bars, scatterplot, lines, etc.
4. Provide titles and labels to your graph
5. (Optional) apply a theme/look to your graph
## Tip: open the ggplot cheatsheet
::: callout-tip
**A strategy I'd like to recommend:** briefly read over the `ggplot2` documentation and have them open on a separate tab. Figure out the type of variables you need to visualize (discrete or continuous) to quickly identify which visualization would make sense.
:::
{width="60%"}
[ggplot documentation link](https://rstudio.github.io/cheatsheets/html/data-visualization.html)
## Going back to our univariate descriptive stats on `age` variable
Let's visualize the variability with boxplot to get a better sense of the spread.
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |>
ggplot(aes(x = age)) +
geom_boxplot(fill = "lightblue", color = "navy") +
labs(title = "Age distribution of respondents",
x = "Age") +
theme_minimal()
```
## Categorical Data - Frequency Distribution
- The `age` variable is a numerical / continuous data. We can't apply `mean()`, `median()` and other central tendency measures to categorical data such as `age_group` or `employment_status`. We can, however, visualize them.
- When dealing with categorical data, first take note on whether you want to visualize the **proportion** or the **frequency distribution**.
- Let's visualize the frequency distribution of survey participants by `country`:
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = country, fill = country)) +
geom_bar() +
labs(title = "Participants by Country",
x = "Country",
y = "Participants") +
theme_minimal()
```
## Categorical Data - Proportion
When we want to show proportion (i.e. in terms of "parts of whole"), we must first quickly calculate the proportion with `count()`
Let's create a new dataframe called `wvs_country_proportion` to hold this data.
```{r}
#| echo: true
wvs_country_proportion <- wvs_cleaned |>
group_by(country) |>
summarize(n = n()) |> # count the number of participants each country
mutate(proportion = n/sum(n)) # calculate proportion
print(wvs_country_proportion)
```
## Categorical Data - Proportion (cont'd)
And then, we use this proportion table to create a pie chart by adding `coord_polar()` layer after `geom_bar()` and some changes in `aes()` and `geom_bar()`
```{r}
#| echo: true
#| output-location: slide
wvs_country_proportion |> ggplot(aes(x = "", y = proportion, fill = country)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Proportion of Participants by Country") +
theme_minimal()
```
::: aside
Note: if you have lots of categories, pie chart is not always the best option. The code for this pie chart is from [R Graph Gallery](https://r-graph-gallery.com/piechart-ggplot2.html)
:::
## Learning Check 1A
Using the `wvs_cleaned` dataset:
Create a histogram that visualizes the distribution of `financial_satisfaction`
```{r}
#| echo: true
#| output-location: slide
#| code-fold: true
#| code-summary: "Show answer"
wvs_cleaned |> ggplot(aes(x = financial_satisfaction)) +
geom_histogram(fill = "steelblue", color = "white", binwidth = 1) +
labs(title = "Distribution of Financial Satisfaction",
x = "Financial Satisfaction",
y = "Count") +
theme_minimal()
```
## Learning Check 1B
Create a barchart that visualizes the frequency of `religiousity`
```{r}
#| echo: true
#| output-location: slide
#| code-fold: true
#| code-summary: "Show answer"
wvs_cleaned |> ggplot(aes(x = religiousity, fill = religiousity)) +
geom_bar() +
labs(title = "Frequency of Religiosity",
x = "Religiosity",
y = "Count") +
theme_minimal()
```
# Bivariate Descriptive Stats
## Three Combinations in Bivariate Descriptive Stats
Bivariate descriptive statistics describe and summarize relationships between two variables in your dataset **without making inferences about a larger population**. They include numeric measures like correlation or covariance, and visualizations like scatterplots, side-by-side boxplots, or contingency tables.
Think of them as taking a snapshot of how two variables relate to each other in your current data.
Since data can be continuous or categorical, there can be three combinations when we deal with bivariate descriptive stats:
1. Both categorical (e.g. `age_group` and `country`)
2. Both continuous (e.g. `financial_satisfaction` and `life_satisfaction`)
3. One continuous, one categorical (e.g. `country` and `life_satisfaction`)
## Both categorical
- Examine relationships between categorical variables
- Look at joint distributions and proportions
- Compare group compositions
First, let's create a contingency table of `age_group` and `country`!
```{r}
#| echo: true
table(wvs_cleaned$age_group, wvs_cleaned$country)
```
## Both categorical (cont'd)
We can also create a proportion table just like we did earlier
```{r}
#| echo: true
wvs_cleaned |>
group_by(country, age_group) |>
summarise(n = n()) |> # count the frequency of participants by age group and country
mutate(prop = n/sum(n)) # calculate proportion
```
## Both categorical (cont'd)
For categorical data like this, we can use barchart to visualize the frequency distribution. Stacked bar chart can be used to visualize proportion.
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = country, fill = age_group)) +
geom_bar(position = "dodge") +
labs(y = "Count", title = "Age Groups by Country") +
theme_minimal()
```
Change `position = "dodge"` to `position = "stack"` to stack the bar chart
## Both categorical (cont'd)
To get a better sense of the proportion for each country, we can use percent stacked bar chart.
The code is similar to previous bar charts; we just have to change the `position` argument to `position = "fill"`
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = country, fill = age_group)) +
geom_bar(position = "fill") +
labs(y = "Proportion", title = "Age Groups by Country") +
theme_minimal()
```
## Both continuous
- Examine linear relationships
- Look for patterns and trends
- Identify potential outliers
Let's first examine the correlation between `financial_satisfaction` and `life_satisfaction`
```{r}
#| echo: true
cor(wvs_cleaned$financial_satisfaction, wvs_cleaned$life_satisfaction)
```
## Both continuous (cont'd)
Let's visualize the two variables together with a jitter / scatterplot!
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = financial_satisfaction, y = life_satisfaction)) +
geom_jitter(alpha = 0.3) +
geom_smooth(method = "lm") + # layer with geom_smooth
labs(title = "Financial vs Life Satisfaction") +
theme_minimal()
```
## One continuous, one categorical
- Compare distributions across groups
- Identify group differences
- Examine spread within groups
Let's do a recap from last week and get the summary stats for `life_satisfaction` for each `country`
```{r}
#| echo: true
wvs_cleaned |>
group_by(country) |>
summarise(
mean_satisfaction = mean(life_satisfaction, na.rm = TRUE),
median_satisfaction = median(life_satisfaction, na.rm = TRUE),
sd_satisfaction = sd(life_satisfaction, na.rm = TRUE)
)
```
## One continuous, one categorical (cont'd)
To get a better sense of how the data is varied and spread, let's visualize them with a side-by-side boxplot
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = country, y = life_satisfaction)) +
geom_boxplot() +
labs(title = "Life Satisfaction by Country") +
theme_minimal()
```
## One continuous, one categorical (cont'd)
We could also layer our boxplots with violin plots to get a better sense of the distribution of each group.
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = country, y = life_satisfaction)) +
geom_violin(fill = "lightblue", alpha = 0.5) +
geom_boxplot(width = 0.1, fill = "white") +
labs(title = "Life Satisfaction Distribution by Country") +
theme_minimal()
```
## Learning Check #2
Create a side-by-side boxplots that visualizes `political_scale` for each `sex`.
```{r}
#| echo: true
#| output-location: true
#| code-fold: true
#| code-summary: "Show answer"
wvs_cleaned |> ggplot(aes(x = political_scale, y = sex)) +
geom_violin(fill = "lightblue", alpha = 0.5) +
geom_boxplot(width = 0.1, fill = "white") +
labs(title = "Political scale Distribution by Sex") +
theme_minimal()
```
## Using Facets for more complex visual
- Compare patterns across multiple subgroups
- Identify interaction effects
- Maintain visual clarity with complex relationships
Facet grids are useful when we have more than two variables to visualize. However, if used excessively they may become too complex
```{r}
#| echo: true
#| output-location: slide
wvs_cleaned |> ggplot(aes(x = financial_satisfaction, y = life_satisfaction)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
facet_grid(country ~ religiousity)
```
## Is fancier = better?
Fancier, more complicated visualization does not necessarily mean better!
Take a look at this award-winning visualization by [Simon Scarr](http://www.simonscarr.com/iraqs-bloody-toll)

# End of Session 3!
Remember the strategy:
1. Have the ggplot documentation/cheatsheet open
2. Decide on how many variables are involved. Is it just one? two? more than two?
3. Determine whether the variables are categorical or continuous. If you have more than one, are they both categorical? one categorical + one continuous?
4. Refer to the documentation to see which type of visualization would make sense for your variables.
Check out the [R Graph gallery](https://r-graph-gallery.com/) for inspiration and code samples!
Next session: inferential stats in R using WVS data