-
Notifications
You must be signed in to change notification settings - Fork 11
/
predict_workflow.Rmd
336 lines (265 loc) · 13.4 KB
/
predict_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
---
title: "bupaR Docs | processpredictR workflow"
---
```{r, include = FALSE}
knitr::opts_chunk$set(
eval = FALSE
)
```
```{r echo = F, out.width="25%", eval = T, fig.align = "right"}
knitr::include_graphics("images/icons/predict.PNG")
```
***
# Prediction Workflow
```{r setup, message = F, eval = T, collapse = TRUE, warning = F}
library(processpredictR)
library(bupaverse)
library(dplyr)
```
The goal of `processpredictR` is to perform prediction tasks on processes using event logs and Transformer models.
The 6 process monitoring tasks available are defined as follows:
* __outcome__: predict the case outcome, which can be the last activity, or a manually defined variable
* __next activity__: predict the next activity instance
* __remaining trace__: predict the sequence of all next activity instances, where each entire sequence is regarded as a separate output class
* __remaining trace s2s__: predict the sequence of all next activity instances using __encoder-decoder__ architecture
* __next time__: predict the start time of the next activity instance
* __remaining time__: predict the remaining time till the end of the case
The overall approach using `processpredictR` is shown in the Figure below. `prepare_examples()` transforms logs into a dataset that can be used for training and prediction, which is thereafter split into train and test set. Subsequently a model is made, compiled and fitted. Finally, the model can be used to predict and can be evaluated.
```{r echo = F, eval = T, out.width = "100%", fig.align = "center", fig.cap="processpredictR workflow"}
knitr::include_graphics("images/processpredictR.jpg")
```
Different levels of customization are offered. Using `create_model()`, a standard off-the-shelf model can be created for each of the supported tasks, including standard features.
A first customization is to include additional features, such as case or event attributes. These can be configured in the `prepare_examples()` step, and they will be processed automatically (normalized for numerical features, or hot-encoded for categorical features). Furthermore, the dimensions of the model can be modified.
A further way to customize your model, is to only generate the input layer of the model with `create_model()`, and define the remainder of the model yourself by adding `keras` layers using the provided `stack_layers()` function. More information about customization can be found [here](predict_adapt.html).
Going beyond that, you can also create the model entirely yourself using `keras`, including the preprocessing of the data. Auxiliary functions are provided to help you with, e.g., tokenizing activity sequences. More information on this approach can be found [here](predict_keras.html).
In the remainder of this tutorial, the general workflow will be described in more detail.
## Preprocessing
As a first step in the process prediction workflow we use `prepare_examples()` to obtain a dataset, where:
* each row/observation is a unique activity instance id,
* the prefix(_list) column stores the sequence of activities already executed in the case,
* necessary features and target variables are calculated and/or added
The returned object is of class `ppred_examples_df`, which inherits from `tbl_df`.
In this tutorial we will use the `traffic_fines` event log from `eventdataR`. Note that both `eventlog` and `activitylog` objects, as defined by `bupaR` are supported.
```{r, eval = T}
df <- prepare_examples(traffic_fines, task = "outcome")
df
```
We split the transformed dataset `df` into train- and test sets for later use in `fit()` and `predict()`, respectively. The proportion of the train set is configured with the `split` argument.
```{r, eval = T}
split <- df %>% split_train_test(split = 0.8)
split$train_df %>% head(5)
split$test_df %>% head(5)
```
It's important to note that the split is done at case level (a case is fully part of either the train data or either the test data). Furthermore, the split is done chronologically, meaning that the train set contains the split\% first cases, and the test set contains the (1-split)\% last cases.
Note that because the split is done at case level, the percentage of all examples in the train set can be slightly different, as cases differ with respect their length.
```{r, eval = T}
nrow(split$train_df) / nrow(df)
n_distinct(split$train_df$case_id) / n_distinct(df$case_id)
```
## Define model
The next step in the workflow is to build a model. `processpredictR` provides a default set of functions that are wrappers of generics provided by `keras`. For ease of use, the preprocessing steps, such as tokenizing of sequences, normalizing numerical features, etc. happen within the `create_model()` function and are abstracted from the user.
Based on the train set we define the default transformer model, using `create_model()`.
```{r}
model <- split$train_df %>% create_model(name = "my_model")
# pass arguments as ... that are applicable to keras::keras_model()
model # is a list
```
```
#> Model: "my_model"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> input_1 (InputLayer) [(None, 9)] 0
#> token_and_position_embedding (Toke (None, 9, 36) 792
#> nAndPositionEmbedding)
#> transformer_block (TransformerBloc (None, 9, 36) 26056
#> k)
#> global_average_pooling1d (GlobalAv (None, 36) 0
#> eragePooling1D)
#> dropout_3 (Dropout) (None, 36) 0
#> dense_3 (Dense) (None, 64) 2368
#> dropout_2 (Dropout) (None, 64) 0
#> dense_2 (Dense) (None, 6) 390
#> ================================================================================
#> Total params: 29,606
#> Trainable params: 29,606
#> Non-trainable params: 0
#> ________________________________________________________________________________
```
Some useful information and metrics are stored for tracebility and an easy extraction if needed.
```{r}
model %>% names() # objects from a returned list
```
```
#> $names
#> [1] "model" "max_case_length" "number_features" "task"
#> [5] "num_outputs" "vocabulary"
```
Note that `create_model()` returns a list, in which the actual keras model is stored under the element name `model`. Thus, we can use functions from the keras-package as follows:
```{r}
model$model$name # get the name of a model
```
```
#> [1] "my_model"
```
```{r}
model$model$non_trainable_variables # list of non-trainable parameters of a model
```
```
#> list()
```
The result of `create_model()` is assigned it's own class (`ppred_model`) for which the `processpredictR` provides the methods _compile()_, _fit()_, _predict()_ and _evaluate()_.
## Compilation
The next step is to compile the model. By default, the loss function is the log-cosh or the categorical cross entropy, for regression tasks (next time and remaining time) and classification tasks, respectively. Naturally, it is possible to override these defaults.
```{r}
model %>% compile() # model compilation
```
```
#> ✔ Compilation complete!
```
## Training
Training of the model is done with the `fit()` function. During training, a visualization window will open in the Viewer-pane to show the progress in terms of loss. Optionally, the result of `fit()` can be assigned to an object to access the training metrics specified in _compile()_. The number of epochs to train for can be configured using the `epochs` argument.
```{r}
hist <- fit(object = model, train_data = split$train_df, epochs = 5)
```
```{r}
hist$params
```
```
#> $verbose
#> [1] 1
#>
#> $epochs
#> [1] 5
#>
#> $steps
#> [1] 2227
```
```{r}
hist$metrics
```
```
#> $loss
#> [1] 0.7875332 0.7410239 0.7388409 0.7385073 0.7363014
#>
#> $sparse_categorical_accuracy
#> [1] 0.6539739 0.6713067 0.6730579 0.6735967 0.6747193
#>
#> $val_loss
#> [1] 0.7307042 0.7261314 0.7407018 0.7326428 0.7317348
#>
#> $val_sparse_categorical_accuracy
#> [1] 0.6725934 0.6727730 0.6725934 0.6725934 0.6722342
```
### Make predictions
The method `predict()` can return 3 types of output, by setting the argument `output` to "append", "y_pred" or "raw".
Test dataset with appended predicted values (`output = "append"`):
```{r}
# make predictions on the test set
predictions <- model %>% predict(test_data = split$test_df,
output = "append") # default
predictions %>% head(5)
```
```
#> # A tibble: 5 × 13
#> ith_case case_id prefix prefix_…¹ outcome k activ…² resou…³
#> <int> <chr> <chr> <list> <fct> <dbl> <chr> <fct>
#> 1 8001 A24869 Create Fine <chr [1]> Payment 0 Create… 559
#> 2 8001 A24869 Create Fine - Payment <chr [2]> Payment 1 Payment <NA>
#> 3 8002 A24871 Create Fine <chr [1]> Payment 0 Create… 559
#> 4 8002 A24871 Create Fine - Payment <chr [2]> Payment 1 Payment <NA>
#> 5 8003 A24872 Create Fine <chr [1]> Send f… 0 Create… 559
#> # … with 5 more variables: start_time <dttm>, end_time <dttm>,
#> # remaining_trace_list <list>, y_pred <dbl>, pred_outcome <chr>, and
#> # abbreviated variable names ¹prefix_list, ²activity, ³resource
```
<details>
<summary>raw predicted values (`output = "raw"`)</summary>
<p>
```
#> Payment Send for Credit Collection Send Fine
#> [1,] 4.966056e-01 0.344094276 1.423686e-01
#> [2,] 9.984029e-01 0.001501600 8.890528e-05
#> [3,] 4.966056e-01 0.344094276 1.423686e-01
#> [4,] 9.984029e-01 0.001501600 8.890528e-05
#> [5,] 4.966056e-01 0.344094276 1.423686e-01
#> [6,] 1.556145e-01 0.518976271 2.884890e-01
#> [7,] 2.345311e-01 0.715000629 5.147375e-06
#> [8,] 2.627363e-01 0.726804197 5.480492e-06
#> [9,] 3.347774e-05 0.999961376 2.501280e-08
#> [10,] 4.966056e-01 0.344094276 1.423686e-01
```
</details>
</p>
<details>
<summary>predicted values with postprocessing (`output = "y_pred"`)</summary>
```
#> [1] "Payment" "Payment"
#> [3] "Payment" "Payment"
#> [5] "Payment" "Send for Credit Collection"
#> [7] "Send for Credit Collection" "Send for Credit Collection"
#> [9] "Send for Credit Collection" "Payment"
#> [11] "Send for Credit Collection" "Payment"
#> [13] "Send for Credit Collection" "Payment"
#> [15] "Send for Credit Collection" "Send for Credit Collection"
#> [17] "Send for Credit Collection" "Send for Credit Collection"
#> [19] "Payment" "Send for Credit Collection"
```
</details>
</p>
### Visualize predictions
For the classification tasks outcome and next activity a `confusion_matrix()` function is provided to visualize the results.
```{r}
predictions %>% class
```
```
#> [1] "ppred_predictions" "ppred_examples_df" "ppred_examples_df"
#> [4] "ppred_examples_df" "tbl_df" "tbl"
#> [7] "data.frame"
```
```{r}
# print confusion matrix
confusion_matrix(predictions)
```
```
#>
#> Payment Send Appeal to Prefecture
#> Appeal to Judge 2 6
#> Notify Result Appeal to Offender 0 0
#> Payment 1903 7
#> Send Appeal to Prefecture 34 90
#> Send Fine 387 0
#> Send for Credit Collection 688 22
#>
#> Send for Credit Collection
#> Appeal to Judge 10
#> Notify Result Appeal to Offender 0
#> Payment 617
#> Send Appeal to Prefecture 89
#> Send Fine 387
#> Send for Credit Collection 2644
```
Plot method for the confusion matrix (classification) or a scatter plot (regression).
```{r, out.width="100%", fig.width = 7}
# plot confusion matrix in a bupaR style
plot(predictions) +
theme(axis.text.x = element_text(angle = 90))
```
```{r, out.width="80%", fig.width = 7, eval = T, echo = F}
knitr::include_graphics("images/confusion_matrix.PNG")
```
## Evaluate model
Returns loss and metrics specified in _compile()_.
```{r}
model %>% evaluate(split$test_df)
```
```
#> loss sparse_categorical_accuracy
#> 0.7779053 0.6716526
```
***
Read more:
```{r footer, results = "asis", echo = F, eval = T, collapse = F}
source("htmlbuttons.R")
create_buttons(df, "predict_workflow.html")
```