vignettes/intro.Rmd

---
title: "Introduction"
output:
  html_document:
    toc: true
    toc_depth: 4
    toc_float: true
    number_sections: true
    code_folding: show
    collapse: false
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
```

# Load
```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(simaerep))
```

# Introduction

Simulate adverse event reporting in clinical trials with the goal of detecting under-reporting sites.

Monitoring of Adverse Event (AE) reporting in clinical trials is important for patient safety. We
use bootstrap-based simulation to assign an AE under-reporting probability to each site in a clinical trial.
The method is inspired by the 'infer' R package and Allen Downey's blog article: ["There is only one test!"](
http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html).

## Adverse Events

An adverse event (AE) is any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have a causal relationship with this treatment. It is important for patient safety that AEs are reported back to the sponsor. An important part of quality monitoring is to detect clinical trial sites that are not propagating all of the AEs reported to them by their patients to the sponsor. In a clinical trial patients follow a strict visiting schedule at which times treatments are given and exams are being performed. Typically AEs get reported by a patient when they are at an on-site visit at the clinic. So the total number of AEs reported by a site depends on the number of patients enrolled at the site and the total number of visits. 

## Algorithm

In a nutshell we will perform the following steps.

1) Record visit_med75, number of patients that have reached visit_med75, mean cumulative AE count at visit_med75 for a clinical trial site.
2) Create patient pool with all patients in study that have reached visit_med75 as determined in 1) and their cumulative AE count at visit_med75.
3) Draw with replacement as many patients from 2) as determined in 1) and calculate the mean cumulative AE count of the draw (figure 1b).
4) Repeat 3) 1000 times.
5) Calculate probability of obtaining mean cumulative AE count or lower as obtained in 1) based on results of 4).
6) Repeat 1-5) for all sites in trial.
7) Adjust probabilities using the [Benjamin Hochberg Procedure](https://www.statisticshowto.com/benjamini-hochberg-procedure/) in order to correct alpha error.


## Sample Data

### Patient

Patient level AE data is characterized by the number of consecutive visits and the number of AEs that have been reported each time. For the maximum consecutive visit we sample from a normal distribution and for the AEs reported at each visit we sample from a poisson distribution.

Here we simulate the AEs generated by 3 patients

```{r}

set.seed(1)

replicate(
   3,
   sim_test_data_patient(
    .f_sample_max_visit = function() rnorm(1, mean = 20, sd = 4),
    .f_sample_ae_per_visit = function(max_visit) rpois(max_visit, 0.5)
  )
)

```

### Study

In order to simulate patient data for an entire study we assume make the simplification that all sites have the same number of patients. Further we specify a fraction of sites that is under-reporting AEs.

```{r}
df_visit <- sim_test_data_study(
  n_pat = 120,
  n_sites = 6,
  frac_site_with_ur = 0.4,
  ur_rate = 0.6,
  max_visit_mean = 20,
  max_visit_sd = 4,
  ae_per_visit_mean = 0.5
)

df_visit$study_id <- "A"

df_visit %>%
  head(10) %>%
  knitr::kable()

df_visit %>%
  select(site_number, is_ur) %>%
  distinct() %>%
  knitr::kable()
```

In our sample data 2 sites (S0001 and S0002) are under-reporting AEs

## Algorithm Execution

### S3 interface

We will describe the internal details of the algorithm next. However, it is recommended to use the S3 interface, which manages the execution of the internal functions and stores the intermediate results.

```{r}
aerep <- simaerep(df_visit)
aerep
```


### Specifying the Evaluation Point visit_med75

In an ongoing trial all patients will have a different number of consecutive visits. To find a cut-off visit to normalize the data we specify a single evaluation point for a given site based on the number of visits of the patient population. 

For determining the visit number at which we are going to evaluate AE reporting we take the maximum visit of each patient at a site and take the median. Then we multiply with 0.75 which will give us a cut-off point determining which patient will be evaluated. Of those patients we will evaluate we take the minimum of all maximum visits hence ensuring that we take the highest visit number possible without excluding more patients from the analysis. In order to ensure that the sampling pool for that visit is large enough we ensure that at least 20% of all patients at the study are available for sampling. We limit visit_med75 to the 80% quantile of all patient maximum visits of the entire study.

For this we use `site_aggr` to aggregate the simulated visit level data to site level. Adjusting `min_pat_pool` will change the minimum ratio of patients available for sampling that determines the maximum values for visit_med75.

```{r warning= FALSE}

df_site <- site_aggr(df_visit, method = "med75_adj", min_pat_pool = 0.2)

df_site %>%
  kable()

plot_visit_med75(df_visit, df_site, study_id_str = "A", n_site = 6)
```

Using the adjusted visit_med75 will drop a few patients (dashed lines) but will by definition at least include 50% of all patients favoring patients that have higher number of visits. By looking at the mean ae development (purple line) we can already easily spot the two under-reporting sites. The same goes for looking at the `mean_ae_site_med75` column in `df_site`. Next we will use bootstrap simulations to determine the probability for obtaining each mean AE value or lower at visit_med75 by chance.

### Bootstrap Simulations

#### Advantage Over Classic Statistical Tests

We could use a classical parametric test and calculate the probability with which we can reject the NULL hypothesis for the AE counts that we observe at the visit_med75.

As we sample the AEs from a poisson distribution the R implementation of the poisson.test would be appropriate.

But there are four major problems that we often encounter when we try to describe real life count data with the poisson distribution which is only described by a single parameter:

- over and underdispersion
- skewness (long right tail)
- inflated zeros
- variance mean relationship might not be fixed

(see [Distribution for Modelling Location Scale and Shape 5.1.2](https://www.gamlss.com/wp-content/uploads/2018/01/DistributionsForModellingLocationScaleandShape.pdf))

The true distribution of the AE counts will vary from study to study with the degree of the influence of these 4 problems being unknown unless thoroughly investigated.

With the non-parametric approach that we propose we do not need to worry about these statistical assumptions. Since the distribution of the AE count from the simulated patient pool we draw from will be very close to the unknown true distribution of the underlying AE generating process.


#### Disadvantages Over Classic Statistical Tests

- **Upper Limit for Uncompliant Sites**. We simulate an underlying compliant patient population using the data we are given. If the fraction of uncompliant examples becomes too high, detection rates will decrease. We find that detection rates start decreasing with 30-50% of under-reporting sites and are not usable anymore if a majority > 50% of all sites are under-reporting [see article on usability limits](https://openpharma.github.io/simaerep/articles/usability_limits.html).
- **Lower Probability Limit** The number of repeats determines the smallest probability greater than zero, for example a for r=1000 the smallest value greater than zero is 0.001
- **Computationally Expensive**


#### Methodology

*How likely is it to get a mean AE value that is equal or lower to what we observe at site C with the same number of patients?*

For illustration purposes we start to simulate 10 hypothetical patient groups for site C by drawing (with replacement) from all patients in the study marking those for which we have obtained an equal or lower mean AE value than initially observed (middle).


Instead of 10 times we simulate 1000 times and count how many times we have observed an equal or lower mean AE value than initially observed and convert it to a percentage (right).

To illustrate the effect of under-reporting we repeat the same process after having removed 2 AEs per patient from site C. We see how the probability for obtaining an equal or lower mean AE value than initially observed decreases (bottom).

```{r fig.width=11, fig.height=9}
plot_sim_examples(size_dot = 4, size_raster_label = 10)
```

#### Application

We can run the above described simulation and as a benchmark also perform a poisson test.

```{r}
df_sim_sites <- sim_sites(df_site, df_visit, r = 1000, poisson_test = TRUE)

df_sim_sites %>%
  kable()
```

We find that the probability of getting a mean ae at visit_med75 for sites S0001 and S0002 is 0 or near zero.

## Alpha Error Correction

Our simulated test data set consists of just 6 sites. However, it is not uncommon for 100 sites or more to participate in the same clinical trial. This would mean that we need to perform 100 statistical tests, and applying a 5% significance threshold would lead to on average of 5 False Positives (FP). We therefore need to adjust the calculated p-values and bootstrapped probabilities using `stats::p.adjust(p, method = "BH")`, which applies the [Benjamin Hochberg Procedure](https://www.statisticshowto.com/benjamini-hochberg-procedure/). `eval_sites()` uses the the inverted adjusted values to calculates the final bootstrapped AE under-reporting probability (prob_low_prob_ur) and includes the poisson test derived under-reporting probability as a reference (pval_prob_ur).

```{r}
df_eval <- eval_sites(df_sim_sites, method = "BH")

df_eval %>%
  kable()
```


## Plot Results

`plot_study` will plot mean ae development of all sites and visit level data for each flagged site with an AE under-reporting threshold of 95%.

```{r fig.width = 10}
plot_study(df_visit, df_site, df_eval, study = "A")
```

# Applying {simaerep}

Instead of executing all previous steps manually we can also use the S3 interface which manages all intermediate results and provides more convenient plotting using the `plot()` generic function.

```{r}
aerep <- simaerep(df_visit)
aerep
str(aerep)
plot(aerep)
```

For modifying the default parameters please check out the `simaerep()` documentation.