-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy path90-ccl.Rmd
397 lines (306 loc) · 13.7 KB
/
90-ccl.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
# Conclusions {#sec-ccl}
In this course, we have seen the importance of structured data. Good
data structure starts with simple, tidy tabular data, whether it is
manually encoded in spreadsheet, or handled in R as dataframes or
tibbles. More complex data, that doesn't fit in tabular data, can be
modelled into dedicated objects that display specialised
behaviour. Structured data allows us to reason on that data, without
having to look it at. Reasoning on and generalisation of data in turn
allows to manipulate and visualise it, i.e. to explore, analyse and
understand it. The cherry on top of the data analysis cake is to be
able to reproduce an analysis, either oneself or share it in a way
that others can.
As mentioned in the preamble, the goal of this course is obviously not
for students that take it to qualify as bioinformaticians at the
end. However, what is important is to appreciate the importance of
data and their analysis, and to become fluent in exploring, discussing
and communicating around data. A shared appreciation of data and their
complexity will hopefully reduce the distinction between
bioinformaticians and experimental scientists. Indeed, at the end of
the day, it's useful to remember that
> We are all biologists, in that we study biology. Some use wet lab
> experiments, others dry lab techniques.
## Next steps
- Statistics and machine learning (see your statistics courses and the
follow course
[WSBIM1322](https://github.com/UCLouvain-CBIO/WSBIM1322)).
- Getting better at programming and data analysis. See
[@r4ds:2017] and [@advancedR].
- Evolving scripts into tools/packages [@rpkgs:2015].
- Other tools: unix command line and git/GitHub [@Perez-Riverol:2016].
See also this [short
tutorial](https://lgatto.github.io/github-intro/).
- Omics data analysis (see upcoming
[WSBIM2122](https://github.com/UCLouvain-CBIO/WSBIM2122) course).
## Additional exercises
To answer the following exercises, you'll need to resort to what you
have learnt in various chapters.
`r msmbstyle::question_begin()`
Make sure you have `rWSBIM1207` version >= 0.1.16 and load the 2022
Belgian road accidents statistics and the associated metadata,
describing the variables. The path to the former as an `rds` file is
available with `road_accidents_be_2022.rds()`. The
`road_accidents_be_meta.csv()` returns the path to the metadata csv
file.
The data provides the Number of killed, seriously injured, slightly
injured and uninjured victims of road accidents, by age group, type of
user, sex and various characteristics of the accident in Belgium in
20222.
- Using the appropriate functions, load both files into R and
familiarise yourself with the data.
- Visualise the numbers for men and women over the hours of the day
for all age classes. Ignore any unknown information. Do you see a
difference between man and women?
- Visualise the number of victims in the different provinces. Do this
comparison for the different type of victims. Ignore any unknown
information. Use lines and points for this visualisation.
- Come up with additional visualisations that you could produce with
these data. Use bar plots for this visualisation.
`r msmbstyle::question_end()`
```{r accidents, echo=FALSE, include=FALSE}
library(rWSBIM1207)
library(tidyverse)
x <- readRDS(road_accidents_be_2022.rds())
## Compare the numbers for man and women over the hours of the day for
## all age classes.
dplyr::count(x, TX_SEX_DESCR_FR, DT_HOUR, TX_AGE_CLS_DESCR_FR) |>
filter(TX_SEX_DESCR_FR != "Inconnu",
TX_AGE_CLS_DESCR_FR != "Inconnu") |>
ggplot(aes(x = DT_HOUR, y = n, colour = TX_SEX_DESCR_FR)) +
geom_point() +
geom_line() +
facet_wrap(~ TX_AGE_CLS_DESCR_FR)
## Compare the number of victims in the different provinces. Do this
## comparison for the different type of victims.
dplyr::count(x, TX_VICT_TYPE_DESCR_FR, TX_PROV_DESCR_FR) |>
filter(TX_VICT_TYPE_DESCR_FR != "Inconnu",
TX_PROV_DESCR_FR != " ") |>
ggplot(aes(x = TX_VICT_TYPE_DESCR_FR, y = n)) +
geom_bar(stat = "identity") +
facet_wrap(~ TX_PROV_DESCR_FR) +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
res <- filter(x, TX_VICT_TYPE_DESCR_FR != "Inconnu", TX_PROV_DESCR_FR != " ") |>
group_by(TX_VICT_TYPE_DESCR_FR, TX_PROV_DESCR_FR) |>
summarise(m = sum(MS_VICT), n = n())
## Dans mon code, j'ai réalisé un group_by et un summarize (sum())
## parce que si on regarde le nombre de victimes, on peut observer
## qu'il y a parfois 2 victimes pour le même "type de victime" pour la
## même date. Je suppose qu'avec un summarize(n()), on risque de
## perdre ces informations? Ou alors, j'ai mal compris les données, ou
## la question?
pm <- res |>
ggplot(aes(x = TX_PROV_DESCR_FR, y = m)) +
geom_point() + geom_line(aes(group = TX_VICT_TYPE_DESCR_FR)) +
facet_wrap(~ TX_VICT_TYPE_DESCR_FR, scale = "free_y") +
ggtitle("sum()") +
theme(axis.text.x = element_text(angle = 90))
pn <- res |>
ggplot(aes(x = TX_PROV_DESCR_FR, y = n)) +
geom_point() + geom_line(aes(group = TX_VICT_TYPE_DESCR_FR)) +
facet_wrap(~ TX_VICT_TYPE_DESCR_FR, scale = "free_y") +
ggtitle("sum()") +
theme(axis.text.x = element_text(angle = 90))
pn + pm
```
`r msmbstyle::question_begin()`
For this exercise, make sure you have `rWSBIM1207` version >=
0.1.17. Using the `population_be.csv()` function, get the path to
`r length(rWSBIM1207::population_be.csv())` files with the population
numbers across multiple regions of Belgium up to 2023.
**Tip:** given that all files contain equivalent data (i.e. for the
same variables), you can use `read_csv()` to load all files as once
into a long table.
- How many regions have been survey over the years?
- What differences are there in terms of region?
Focusing on Belgium and the Brussels, Walloon and Flemish regions, and
from 1991 on, generate one figure (possibly with multiple facets) that
allows to answer the following questions:
- Has the population increased since 1991?
- Are there more women or men living in these regions? Have this
changed since 1991?
- Have the changes in population been driven by men, women or both
equally?
- What region has the biggest (lowest) population?
`r msmbstyle::question_end()`
```{r pop, echo=FALSE, include=FALSE}
library(tidyverse)
library(rWSBIM1207)
x <- read_csv(population_be.csv())
## How many regions have been survey over the years?
dplyr::count(x, annee)
## What differences are there in terms of region?
dplyr::count(x, annee) |>
pull(n) |>
unique()
x2023 <- filter(x, annee == 2023) |>
select(lieu_de_residence, total) |>
dplyr::rename("total2023" = "total")
x2018 <- filter(x, annee == 2018) |>
select(lieu_de_residence, total) |>
dplyr::rename("total2018" = "total")
## unique 2023
full_join(x2023, x2018) |> filter(is.na(total2018))
## unique 2018
full_join(x2023, x2018) |> filter(is.na(total2023))
## Focusing on Belgium and the Brussels, Walloon and Flemish regions, and
## from 1991 on, generate one figure (possibly with multiple facets) that
## allows to answer the following questions:
lieux <- x[[2]][1:4]
x |>
pivot_longer(names_to = "variable",
values_to = "value",
3:5) |>
filter(lieu_de_residence %in% lieux) |>
filter(annee > 1990) |>
ggplot(aes(x = annee, y = value,
colour = variable)) +
geom_line() +
facet_grid(. ~ lieu_de_residence)
x |>
pivot_longer(names_to = "variable",
values_to = "value",
3:5) |>
filter(lieu_de_residence %in% lieux) |>
filter(annee > 1990) |>
ggplot(aes(x = annee, y = value,
colour = variable)) +
geom_line() +
facet_wrap(~ lieu_de_residence, scale = "free_y")
```
`r msmbstyle::question_begin()`
Here, we are going to examining the evolution of bankruptcies in
different region in Belgium between 2005 and 2023. The [original
data](https://statbel.fgov.be/fr/open-data/evolution-mensuelle-des-faillites-par-nace)
are available on the Belgian office for statistics, statbel, webpage.
Two files are needed:
- The fill dataset, `TF_BANKRUPTCIES.zip`, contains over 175000
records and is relatively large. Instead, you can use
`TF_BANKRUPTCIES_subset.txt` that contains just over 10000 records,
and made available through the `rWSBIM1207` package (see below).
- The `Method_BANKRUPTCIES.xlsx` file describes the variables in the
bankruptcies dataset. It can be either downloaded from the [statbel
page](https://statbel.fgov.be/fr/open-data/evolution-mensuelle-des-faillites-par-nace),
or read from the `rWSBIM1207` package (see below). Note that you
are not asked to load the xlsx file in R (although you can with
`readxl::read_xlsx()`); feel free to open it with your favourite
spreadsheet programme.
You will see that there are some minor discrepancies between the
metadata file and the data variables, such as the `CD_WEEK` variable,
that is documented in the metadata file, but absent from the actual
data. The other, non-documented, time-related variables are
self-explanatory.
The (compressed) `TF_BANKRUPTCIES_subset.txt` and
`Method_BANKRUPTCIES.xlsx` are available in the `rWSBIM1207` package
(version >= 0.1.19). Their paths is available with the
`faillites_be()` function.
You are asked to:
- Visualise the number of bankruptcies over time in the different
regions. Make sure to remove observations that don't have any
region. Note that the dates are recorded in a different format as
what we have seen so far, but the usual conversion functions still
apply.
- The improve the readability of these busy plots, computer the number
of bankruptcies per year and repeat the above visualisation.
- We expect that bankruptcies of large companies will lead to more
(full-time) unemployment. Check this hypothesis by (1) compute and
display a table of average job losses by company size, and (2)
produce a figure that uses the distribution of job losses (as
opposed as only the averages). To facilitate the interpretation,
order the companies by their size, in the table and figure.
`r msmbstyle::question_end()`
```{r faillites, include=FALSE, echo=FALSE}
suppressPackageStartupMessages({
library(tidyverse)
library(rWSBIM1207)
})
x <- read_delim(faillites_be()[1], delim = "|") |>
mutate(time = dmy(CD_WEEK_START))
## Q1 fig
filter(x, !is.na(TX_PROV_DESCR_FR)) |>
ggplot(aes(x = time,
y = MS_COUNTOF_BANKRUPTCIES)) +
geom_point() +
facet_wrap(~ TX_PROV_DESCR_FR)
## Q2 fig
filter(x, !is.na(TX_PROV_DESCR_FR)) |>
group_by(CD_YEAR, TX_PROV_DESCR_FR) |>
summarise(faillites = sum(MS_COUNTOF_BANKRUPTCIES)) |>
ggplot(aes(x = CD_YEAR,
y = faillites)) +
geom_point() +
facet_wrap(~ TX_PROV_DESCR_FR)
## Q3 table
group_by(x, TX_EMPLOYMENT_CLASS_DESCR_FR) |>
summarise(n = mean(MS_COUNTOF_FULL_TIME_WORKERS)) |>
arrange(n)
## Q3 fig
ggplot(x, aes(x = MS_COUNTOF_FULL_TIME_WORKERS,
y = TX_EMPLOYMENT_CLASS_DESCR_FR)) +
geom_boxplot()
ggplot(x, aes(x = MS_COUNTOF_FULL_TIME_WORKERS,
y = factor(CD_EMPLOYMENT_CLASS))) +
geom_boxplot()
k <- select(x, CD_EMPLOYMENT_CLASS,
TX_EMPLOYMENT_CLASS_DESCR_FR) |>
unique() |>
arrange(CD_EMPLOYMENT_CLASS)
k
x[["TX_EMPLOYMENT_CLASS_DESCR_FR"]] <-
factor(x[["TX_EMPLOYMENT_CLASS_DESCR_FR"]],
levels = k[[2]])
ggplot(x, aes(y = TX_EMPLOYMENT_CLASS_DESCR_FR,
x = MS_COUNTOF_FULL_TIME_WORKERS)) +
geom_boxplot()
```
`r msmbstyle::question_begin()`
This question was contributed by Mr William Vanloo in August 2024.
We will explore the road accidents that occurred in Belgium in 2023,
as available from the
[Statbel
website](https://statbel.fgov.be/fr/open-data/accidents-de-la-circulation-2023): the `.txt` file contains the dataset and the `.xlsx` file
contains the metadata describing the different variables.
- Among this data set, visualise the number of accidents occurring in
different light conditions by severity of injury in the province of
Liège over time. Be sure to remove the 'not available' data.
- Visualise the number of accidents in the different provinces during
the month of January according to the different light conditions and
only for fatal accidents.
`r msmbstyle::question_end()`
```{r accidents2023, include=FALSE, echo=FALSE}
tdir <- tempdir()
url <- "https://statbel.fgov.be/sites/default/files/files/opendata/Verkeersongevallen/TF_ACCIDENTS_2023.zip"
dest <- file.path(tdir, "TF_ACCIDENTS_2023.zip")
download.file(url, dest)
acc <- read_delim(dest)
acc |>
select(DT_DAY,TX_LIGHT_COND_DESCR_FR,TX_PROV_DESCR_FR,MS_ACCT_WITH_MORY_INJ,
MS_ACCT_WITH_SERLY_INJ,MS_ACCT_WITH_SLY_INJ) |>
filter(TX_PROV_DESCR_FR== "Province de Liège",
!TX_LIGHT_COND_DESCR_FR == "Non disponible" ) |>
pivot_longer(names_to = "injury_type",
values_to = "ncase",
-c(1:3)) |>
group_by(DT_DAY,TX_LIGHT_COND_DESCR_FR,injury_type) |>
summarise(sumcase=sum(ncase)) |>
arrange(sumcase) |>
ggplot(aes(x = DT_DAY,
y = sumcase,
color= injury_type)) +
geom_line() +
facet_wrap(~TX_LIGHT_COND_DESCR_FR, scales = "free_y") +
theme(axis.text.x = element_text(angle = 90,hjust = 0.5, vjust = 0.5))
acc |>
select(DT_DAY,TX_LIGHT_COND_DESCR_FR,TX_PROV_DESCR_FR,
MS_ACCT_WITH_DEAD) |>
filter(!TX_LIGHT_COND_DESCR_FR == "Non disponible") |>
group_by(DT_DAY,TX_LIGHT_COND_DESCR_FR,TX_PROV_DESCR_FR) |>
summarise(sumcase=sum(MS_ACCT_WITH_DEAD)) |>
arrange(sumcase) |>
filter(grepl("2023-01",DT_DAY)) |>
ggplot(aes( x= DT_DAY,
y = sumcase,
color = TX_LIGHT_COND_DESCR_FR)) +
geom_line()+
facet_wrap(~TX_PROV_DESCR_FR, scales = "free_y") +
theme(axis.text.x = element_text(angle = 90,hjust = 0.5, vjust = 0.5))
```