-
Notifications
You must be signed in to change notification settings - Fork 11
/
wrangling.Rmd
149 lines (97 loc) · 4.44 KB
/
wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "bupaR Docs | Mutate logs"
---
```{r echo = F, out.width="25%", fig.align = "right"}
knitr::include_graphics("images/icons/manipulate.PNG")
```
***
# Wrangling
```{r eval = F}
library(bupaverse)
library(dplyr)
```
```{r include = F}
library(bupaverse)
library(dplyr)
```
In order to easily manipulate logs, well-known [dplyr](https://dplyr.tidyverse.org/)-verbs have been adapted. This page serves as a general introduction of the wrangling verbs. Their usage is illustrated throughout the documentation in _Manipulation_, _Analysis_ and _Visualization_.
## group_by
Using the `group_by()` function, event logs can be grouped according to (a set of) variables, such that all further computations happen for each of these different groups.
In the next example, the number of cases are computed for each value of _vehicleclass_.
```{r}
traffic_fines %>%
group_by(vehicleclass) %>%
n_cases()
```
### Predefined groupings
For specific groupings, some auxiliary functions are available.
* `group_by_case` - group by cases
* `group_by_activity` - group by activity types
* `group_by_resource` - group by resources
* `group_by_activity_resource` - group by activity resource pair
* `group_by_activity_instance` - group by activity instances.
For example, the number of cases in which a specific resource occurs, can be computed as follows:
```{r}
sepsis %>%
group_by_resource %>%
n_cases()
```
Note that each of the descriptive metrics discussed [here](http://www.bupar.net/exploring.html) can be rewritten using these lower-level functions. The example above is equal to the `resource_involvement` metric at case-level.
### Grouping on id
When you want to group on a combination of mapping variables, for example, for each combination of _case_ and _activity_, you can use `group_by_ids()`. The following examples counts the number of events per case and per activity:
```{r}
patients %>%
group_by_ids(case_id, activity_id) %>%
n_events()
```
Note that the arguments of `group_by_ids()` are not the variable names of case (_patient_) and activity (_handling_) columns, but unquoted mapping id-functions. You can thus use this function while being agnostic of the precise variable names.
### Remove grouping
When a grouping is no longer needed, it can be removed using `ungroup_eventlog()`.
## mutate
You can use `mutate()` to add new variables to an event log, possibly by using existing variables. In the next example, the total amount of _lacticacid_ is computed for each case. [Read more.](mutate.html)
```{r echo = F}
sepsis %>%
mutate(lacticacid = as.numeric(lacticacid)) -> sepsis
```
```{r}
sepsis %>%
group_by_case() %>%
mutate(total_lacticacid = sum(lacticacid, na.rm = T))
```
## filter
Generic filtering of events can be done using `filter()`, which takes an event log and any number of logical conditions. The example below filters events where "C" is the vehicle class and an amount greater than 300. [Read more. ](generic_filtering.html).
```{r}
traffic_fines %>%
filter(vehicleclass == "C", amount > 300)
```
## select
Variables on a event log can be _selected_ using `select()`. By default, `select()` will always make sure that the mapping-variables are retained. Otherwise, it would no longer function as an `eventlog` object.
```{r}
traffic_fines %>%
select(vehicleclass)
```
By setting the argument `force_df = TRUE`, the mapping-variables will not be retained, and the output will be a data.frame, and not an `eventlog` object. Note that doing so will hold even in the case that all mapping variables are selected.
```{r}
traffic_fines %>%
select(case_id, vehicleclass, amount, force_df = TRUE)
```
### Selecting id
Similar to `group_by_ids()`, `select_ids()` can be used to select the mapping variables.
```{r}
patients %>%
select_ids(case_id, activity_id)
```
Note again how the arguments are unquoted id-functions instead of raw variable names. The result of `select_ids()` will __always__ result in a `data.frame` object, as typically not all id's in the mapping will be selected.
## arrange
Event data can be sorted using the `arrange()`. `desc()` argument can be used to sort descending on an attribute.
```{r}
#sort descending on time
patients %>%
arrange(desc(time))
```
```{r footer, results = "asis", echo = F}
CURRENT_PAGE <- stringr::str_replace(knitr::current_input(), ".Rmd",".html")
res <- knitr::knit_expand("_button_footer.Rmd", quiet = TRUE)
res <- knitr::knit_child(text = unlist(res), quiet = TRUE)
cat(res, sep = '\n')
```