multi_dimensional_analysis.Rmd

---
title: "bupaR Docs | Multi-dimensional analysis"
output: 
  html_document: 
    toc: false
---

```{r echo = F, out.width="25%", fig.align = "right"}
knitr::include_graphics("images/icons/multi.PNG")
```

***

# Multi-dimensional analysis

```{r include = F}
library(bupaverse)
knitr::opts_chunk$set(fig.height=3)
```

```{r eval = F}
library(bupaverse)
```

By combining metrics with [`group_by()`](wrangling.html#group_by) and [`augment()`](augment.html), you can perform analysis that combine multiple perspectives. 

For example, let's say we want to compare throughput time (performance) with trace length (control-flow). We start by computing the throughput time per case. 

```{r}
traffic_fines %>%
	throughput_time(level = "case", units = "weeks") %>%
	head()
```
Subsequently, we add this information to the log using `augment()`, and store the result as `tmp`. 

```{r}
traffic_fines %>%
	throughput_time(level = "case", units = "weeks") %>%
	augment(traffic_fines) -> tmp
```

Now, we have 2 options. Option 1 is to calculate the `trace_length()` as well, and adding it again to the log. 

```{r}
tmp %>%
	trace_length(level = "case") %>%
	augment(tmp, prefix = "length") -> tmp2
```

We then have both the performance and coverage information in the log, and can visualize their relationship. 

```{r}
library(ggplot2)
tmp2 %>%
	ggplot(aes(length_absolute, throughput_time)) +
	geom_boxplot(aes(group = length_absolute))
```
However, this requires that we have some knowledge about `ggplot2`. Alternatively, we can use `group_by()` and the default `plot()` method provided by `bupaR`.

Going back to `tmp`, recall we have the continuous _throughput_time_ variable. Let's use `cut()` to create multiple categories of throughput time. We start by looking at the throughput time distribution.

```{r}
tmp %>%
	throughput_time(units = "weeks")
```

Let's say use Q1, Q3 and the median to create 4 groups. 

```{r}
tmp %>%
	mutate(throughput_time_bin = cut(as.numeric(throughput_time), breaks = c(-Inf, 1, 17.85, 85.14, Inf)))
```

Observe that we have now 4 groups under _throughput_time_bin_ with the roughly same number of cases. 

```{r}
tmp %>%
	mutate(throughput_time_bin = cut(as.numeric(throughput_time), breaks = c(-Inf, 1, 17.85, 85.14, Inf))) %>%
	group_by(throughput_time_bin) %>%
	n_cases()
```

Now, instead of using `group_by()` and `n_cases()`, we can use `group_by()` followed by `trace_length()`. This gives us the distribution of the trace length, for each of the groups. 

```{r}
tmp %>%
	mutate(throughput_time_bin = cut(as.numeric(throughput_time), breaks = c(-Inf, 1, 17.85, 85.14, Inf))) %>%
	group_by(throughput_time_bin) %>%
	trace_length()
```

We plot this new information:

```{r}
tmp %>%
	mutate(throughput_time_bin = cut(as.numeric(throughput_time), breaks = c(-Inf, 1, 17.85, 85.14, Inf))) %>%
	group_by(throughput_time_bin) %>%
	trace_length() %>%
	plot()
```
While the resulting default plot might not be ideal for your situation (as here, it doesn't work well with trace lengths discrete characteristic), you can get a first insight without needing any additional visualization expertise. 

Note that, if we turn the analysis the other way around (e.g. what is the impact of trace length on throughput time?), things get even easier as the discrete trace length variable can be directly fed to `group_by()`. 

```{r}
traffic_fines %>%
	trace_length(level = "case") %>%
	augment(traffic_fines, prefix = "length") %>%
	group_by(length_absolute) %>%
	throughput_time() %>%
	plot()
```

Finally, note that you are not restricted to combining calculated metrics. You can also combine metrics with data attributes. 

```{r}
eventdataR::hospital %>%
	group_by(group) %>%
	trace_length()
```