Skip to content

Commit

Permalink
feature: Adds imputation and improves missing data summary
Browse files Browse the repository at this point in the history
Closes #28

**IMPORTANT** Currently there is a [bug in the stable release of
Quarto](quarto-dev/quarto-cli#10196) which prevents rendering of the missing data figures. It
is fixed in development version [`v1.6.1`](https://github.com/quarto-dev/quarto-cli/releases/tag/v1.6.1) (currently
available as pre-release, so if things don't render upgrade to this version).

+ Uses the [mice](https://amices.org/mice/index.html) package to summarise missing data graphically and undertake three
  different methods of multiple imputation. Functions are defined to aid with the plotting of imputed data for
  comparison to the original dataset. Notes on tasks that could be done to augment this such as tabulation.
  This is via the `sections/_interpolation.qmd` file. Includes citation for the mice R package.
+ Moves data dictionary to Appendix.
+ Tidies up tables adding missing captions and removing `print()`
+ Moves tables to [panel-tabset](https://quarto.org/docs/interactive/layout.html#tabset-panel) as document was getting
  long and cluttered. This makes it shorter and easier to navigate. Used for plots that summarise imputation.
+ Introduces caching to the document so that computationally expensive sections of code are not re-run on every render.
+ Some house keeping wrapping lines to 120 characters.
+ Moves summary of missing data patterns to `sections/_missing.qmd`.
+ Removes `dark_theme_minimal()` from plot of final lasso.
+ Tidies up `sections/_logistic.qmd` to explicitly use `family = binomial(link = "logit")` (**NB** Previous work ensured
  the `train` data frame is used in all logistic regression rather the raw `df` which includes individuals with missing
  `final_pathology`).
  • Loading branch information
ns-rse committed Jul 11, 2024
1 parent e333f02 commit 3b97677
Show file tree
Hide file tree
Showing 6 changed files with 115 additions and 89 deletions.
175 changes: 94 additions & 81 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ citation:
number-sections: true
notebook-preview-options:
preview: false
execute:
cache: true
---

{{< include sections/_setup.qmd >}}
Expand All @@ -64,30 +66,37 @@ biochemical factors have been shown to be
associated with thyroid cancer in patients with thyroid nodules. This has been utilised in studies evaluating
predictors of thyroid cancer with a view of creating a model to aid prediction.
Standard practice on the management of thyroid nodules does not utilise these non ultrasound and non cytological
factors. Combination of these variables considered to be significant with ultrasound and cytological characteristics may improve management of patients with thyroid nodules.
Thyroid nodules are increasingly being incidentally detected with increased use of imaging in the evaluation of non thyroid related
pathologies. Thus, leading to increase investigation of thyroid nodules and subsequent increased number of thyroid
operations in non diagnostic cases.
factors. Combination of these variables considered to be significant with ultrasound and cytological characteristics may
improve management of patients with thyroid nodules.
Thyroid nodules are increasingly being incidentally detected with increased use of imaging in the evaluation of non
thyroid related pathologies. Thus, leading to increase investigation of thyroid nodules and subsequent increased number
of thyroid operations in non diagnostic cases.
There are morbidities associated with thyroid surgery including scar, recurrent laryngeal nerve injury,
hypothyroidism and hypoparathyroidism.
We performed a systematic review to evaluate for predictors of thyroid cancer specifically in patients presenting
with thyroid nodules.
The systematic review a number of potential important variables that may be useful in the prediction of thyroid cancer in patients with thyroid nodules. The aim of this study was to evaluate the predictors of thyroid cancer with a view of
The systematic review a number of potential important variables that may be useful in the prediction of thyroid cancer
in patients with thyroid nodules. The aim of this study was to evaluate the predictors of thyroid cancer with a view of
improving prediction of thyroid cancer using machine learning techniques.


## Methods

This study was reported as per the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines

### Study design

This was a retrospective cohort study.

### Setting

The study was conducted at the Sheffield Teaching hospitals NHS Foundation Trusts. This is a tertiary referral centre
for the management of thyroid cancer

### Participants
We included all consecutive patients who presented with thyroid nodule(s) or that were found to have thyroid nodule(s) on ultrasound done for thyroid pathology or for other non thyroid related pathologies

We included all consecutive patients who presented with thyroid nodule(s) or that were found to have thyroid nodule(s)
on ultrasound done for thyroid pathology or for other non thyroid related pathologies

### Variables
Variable evaluated was based on findings from a systematic review evaluating predictors of thyroid cancer in patients
Expand Down Expand Up @@ -118,6 +127,16 @@ exists.
Data was cleaned and analysed using the R Statistical Software @r_citation and the Tidyverse (@tidyverse), Tidymodels
(@tidymodels) collection of packages.

### Imputation

The dataset is incomplete and there are missing observations across all variables to varying degrees. In order to
maximise the sample available for analysis imputation was used to infer missing values. The Multivariat Imputation via
Chained Equations ([MICE][mice] and implemented in the eponymous R package @vanBuuren2011Dec) was employed which assumes data is
missing at random (a difficult assumption to formally test). The approach takes each variable with missing data and
attempts to predict it using statistical modelling based on the observed values. In essence it is the same approach as
the statistical methods being employed to try and predict Thyroid Cancer and there are a range of statistical techniques
available which include

### Modelling

We used a selection of statistic modelling techniques to evaluate association between variables and thyroid cancer in
Expand Down Expand Up @@ -156,14 +175,19 @@ variables @steyerberg2001
#### Random Forest

To add reference
The random forest plot is an extension of the decision tree methodology to reduce variance. Decision trees are very sensitive to the training data set and can lead to high variance; thus potential issues with generalisation of the model. The random forest plot selects random observation of the dataset to create multiple decision trees. Random variables are selected for each tree in the training of the data set. The aggregated output of the generated decision trees is then used to create an estimate.
The random forest plot is an extension of the decision tree methodology to reduce variance. Decision trees are very
sensitive to the training data set and can lead to high variance; thus potential issues with generalisation of the
model. The random forest plot selects random observation of the dataset to create multiple decision trees. Random
variables are selected for each tree in the training of the data set. The aggregated output of the generated decision
trees is then used to create an estimate.


#### Gradient Boosting

Gradient boosting is a machine learning algorithm that uses decision tree as a base model. The data is initially trained
on this decision tree, but the initial prediction is weak, thus termed a weak based model. In gradient boosting the process
is iterative; a sequence of decision trees is added to the initial tree. Each tree learns from the prior tree(s) to improve the model, increasing strength and minimising error.
is iterative; a sequence of decision trees is added to the initial tree. Each tree learns from the prior tree(s) to
improve the model, increasing strength and minimising error.

#### SVM

Expand All @@ -188,6 +212,10 @@ n_obs <- nrow(df)
```


::: {.panel-tabset}

## Demographics

```{r}
#| label: tbl-patient-demographics
#| eval: true
Expand All @@ -197,14 +225,13 @@ n_obs <- nrow(df)
patient_demo <- df |>
dplyr::ungroup() |>
dplyr::select(c("age_at_scan", "gender", "ethnicity")) |>
gtsummary::tbl_summary() |>
gtsummary::tbl_summary() |>
gtsummary::modify_caption("Demographics of study population")
patient_demo
print(colnames(patient_demo))
```

@tbl-patient-demographics shows the demographics of patients included in this study. A total of `r n_obs` patients were included in
this study with a median (IQR) age of `r gtsummary::inline_text(patient_demo, variable="age_at_scan")`.
## Clinical Characteristics

```{r}
#| label: tbl-clinical-characteristics
Expand All @@ -228,17 +255,16 @@ clinical_charac <- df |>
"exposure_radiation",
"final_pathology",
)) |>
gtsummary::tbl_summary(by = final_pathology) |> add_p() |>
gtsummary::tbl_summary(by = final_pathology) |> add_p() |>
gtsummary::modify_caption("Clinical characteristics between benign and malignant thyrioid nodules")
clinical_charac
print(colnames(clinical_charac))
```
```

@tbl-clinical-characteristics shows the distribution of clinical variables evaluated between benign and malignant thyroid nodules.
## Biomarkers

```{r}
#| label: tbl-biochem-variables
#| label: tbl-biomarkers-variables
#| eval: true
#| echo: false
#| warning: false
Expand All @@ -250,12 +276,12 @@ biochem_vars <- df |>
"lymphocytes",
"monocyte",
"final_pathology")) |>
gtsummary::tbl_summary(by = final_pathology) |>
gtsummary::tbl_summary(by = final_pathology) |>
gtsummary::modify_caption("Biochemical variables evaluated between benign and malignant thyroid nodules")
biochem_vars
print(colnames(biochem_vars))
```
```
## Ultrasound

```{r}
#| label: tbl-ultrasound-characteristics
Expand All @@ -271,12 +297,13 @@ ultrasound_char <- df |>
"consistency_nodule",
"cervical_lymphadenopathy",
"final_pathology")) |>
gtsummary::tbl_summary(by = final_pathology) |>
gtsummary::tbl_summary(by = final_pathology) |>
gtsummary::modify_caption("Ultrasound characteristics of benign and malignant nodules")
ultrasound_char
print(colnames(ultrasound_char))
```

## BTA U
```{r}
#| label: u-class-final-path
#| eval: true
Expand All @@ -286,11 +313,14 @@ print(colnames(ultrasound_char))
ultrasound_final_path <- df |>
dplyr::ungroup() |>
dplyr::select(c("bta_u_classification", "final_pathology")) |>
gtsummary::tbl_summary(by = bta_u_classification)
gtsummary::tbl_summary(by = bta_u_classification) |>
gtsummary::modify_caption("BTA U classification by final pathology.")
ultrasound_final_path
print(colnames(ultrasound_final_path))
```

## Thyroid Classification

```{r}
#| label: thy-class-final-path
#| eval: true
Expand All @@ -300,12 +330,13 @@ print(colnames(ultrasound_final_path))
thy_class_final_path <- df |>
dplyr::ungroup() |>
dplyr::select(c("thy_classification", "final_pathology")) |>
gtsummary::tbl_summary(by = thy_classification)
gtsummary::tbl_summary(by = thy_classification) |>
gtsummary::modify_caption("Thyroid classification by final pathology.")
thy_class_final_path
print(colnames(thy_class_final_path))
```
```

## Cytology

```{r}
#| label: tbl-cytology-characteristics
Expand All @@ -317,29 +348,24 @@ cytology_char <- df |>
dplyr::ungroup() |>
dplyr::select(c("thy_classification",
"final_pathology")) |>
gtsummary::tbl_summary(by = final_pathology) |>
gtsummary::tbl_summary(by = final_pathology) |>
gtsummary::modify_caption("Cytological characteristics of benign and malignant nodules")
cytology_char
print(colnames(cytology_char))
```

:::

@tbl-patient-demographics shows the demographics of patients included in this study. A total of `r n_obs` patients were
included in this study with a median (IQR) age of `r df$age_at_scan |> stats::quantile(probs=c(0.5))` (
`r df$age_at_scan |> stats::quantile(probs=c(0.25))`-`r df$age_at_scan |> stats::quantile(probs=c(0.75))`).
@tbl-clinical-characteristics shows the distribution of clinical variables evaluated between benign and malignant
thyroid nodules.

### Data Description

Details of data completeness and other descriptive aspects go here.


```{r}
#| label: tbl-variables
#| purl: true
#| eval: true
#| echo: false
#| warning: false
#| tbl-caption: "Description of variables in the Sheffield Thyroid dataset."
var_labels |>
as.data.frame() |>
kable(col.names = c("Description"),
caption="Description of variables in the Sheffield Thyroid dataset.")
```

A summary of the variables that are available in this data set can be found in @tbl-variables.

Expand Down Expand Up @@ -375,7 +401,7 @@ The completeness of the data is shown in @tbl-data-completeness . Where
variables continuous (e.g. `age` or `size_nodule_mm`) basic summary statistics in the form of mean, standard deviation,
median and inter-quartile range are given. For categorical variables that are logical `TRUE`/`FALSE`
(e.g. `palpable_nodule`) the number of `TRUE` observations and the percentage (of those with observed data for that
variable) are shown along with the number that are _Unknown_. For categorical variables such as `gender` and percentages
variable) are shown along with the number that are _Unknown_. For categorical variables such as `gender` percentages
in each category are reported. For all variables an indication of the number of missing observations is also given and
it is worth noting that there are `r gtsummary::inline_text(df_summary, variable="final_pathology", level="Unknown")`
instances where the `final_pathology` is not known which reduces the sample size to
Expand All @@ -384,50 +410,16 @@ instances where the `final_pathology` is not known which reduces the sample size

#### Missing Data

More detailed tabulations of missing data by variable are shown in @tbl-naniar-miss-var-summary which shows the number
and percentage of missing data for each variable and by case in @tbl-naniar-miss-case-table which shows how much missing
data each case has. A visualisation of this is shown in @fig-visdat-vis-missing .

```{r}
#| label: tbl-naniar-miss-var-summary
#| tbl-caption: Summary of missing data by variable.
#| purl: true
#| eval: true
#| echo: false
#| output: true
naniar::miss_var_summary(df_complete) |>
knitr::kable(col.names=c("Variable", "N", "%"),
caption="Summary of missing data by variable.")
```

```{r}
#| label: tbl-naniar-miss-case-table
#| tbl-caption: Summary of missing data by case, how much missing data is there per person?
#| purl: true
#| eval: true
#| echo: false
#| output: true
naniar::miss_case_table(df_complete) |>
knitr::kable(col.names=c("Missing Variables", "N", "%"),
caption="Summary of missing data by case, how much missing data is there per person?")
```

```{r}
#| label: fig-visdat-vis-missing
#| purl: true
#| eval: true
#| echo: true
#| output: true
## This prevents the document from preview/rendering for some reason???
## visdat::vis_miss(df_complete)
```


{{< include sections/_missing.qmd >}}

#### Imputation

{{< include sections/_imputation.qmd >}}

### Modelling

**TODO** - This table feels like duplication of @tbl-data-completeness, perhaps have just one? (`@ns-rse` 2024-07-11).

The predictor variables selected to predict `final_pathology` are shown in @tbl-predictors

```{r}
Expand Down Expand Up @@ -544,3 +536,24 @@ Comparing the sensitivity of the different models goes here.
## Conclusion

The take-away message is....these things are hard!


## Appendix

### Data Dictionary

```{r}
#| label: tbl-variables
#| purl: true
#| eval: true
#| echo: false
#| warning: false
#| tbl-caption: "Description of variables in the Sheffield Thyroid dataset."
var_labels |>
as.data.frame() |>
kable(col.names = c("Description"),
caption="Description of variables in the Sheffield Thyroid dataset.")
```


[mice]: https://amices.org/mice/
15 changes: 14 additions & 1 deletion references.bib
Original file line number Diff line number Diff line change
@@ -1,4 +1,17 @@
@article{alcaraz2022,
@article{vanBuuren2011Dec,
author = {van Buuren, Stef and Groothuis-Oudshoorn, Karin},
title = {{mice: Multivariate Imputation by Chained Equations in R}},
journal = {J. Stat. Soft.},
volume = {45},
pages = {1--67},
year = {2011},
month = dec,
issn = {1548-7660},
doi = {10.18637/jss.v045.i03},
abstract = {{The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.}}
}

@article{alcaraz2022,
author = {Alcaraz, Javier and Anton-Sanchez, Laura and Monge, Juan Francisco},
title = {{The Concordance Test, an Alternative to Kruskal-Wallis Based on the Kendall-$\tau$ Distance: An R Package}},
journal = {R Journal},
Expand Down
5 changes: 2 additions & 3 deletions sections/_lasso.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,7 @@ final_lasso_kfold |>
Variable = fct_reorder(Variable, Importance)
) |>
ggplot(mapping = aes(x = Importance, y = Variable, fill = Sign)) +
geom_col() +
dark_theme_minimal()
geom_col()
```

**NB** - We may wish to inspect the coefficients at each step of tuning. A related example of how to do this can be found in
Expand All @@ -78,7 +77,7 @@ Tidymodels framework the model `fit` is wrapped up inside (hence the above artic


``` {r}
#| label: lasso-save
#| label: lasso-save
#| purl: true
#| eval: true
#| echo: true
Expand Down
Loading

0 comments on commit 3b97677

Please sign in to comment.