README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: inline
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# offsetreg

<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/offsetreg)](https://CRAN.R-project.org/package=offsetreg)
[![R-CMD-check](https://github.com/mattheaphy/offsetreg/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mattheaphy/offsetreg/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->

This package extends the [tidymodels](https://www.tidymodels.org) ecosystem to enable usage of predictive models with offset terms. Offset terms are predictors in a linear model that with a known *a priori* value. In other words, these terms do not have an associated coefficient ($\beta_i$) that needs to be determined. In a generalized linear model (GLM), an offset specification looks like:

$$
\hat{Y} = g^{-1}(offset + \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p)
$$

Models with offsets are most useful when working with count data or when fitting an adjustment model on top of an existing model with a prior expectation. The former situation is common in insurance where data is often aggregated or weighted by exposures. The latter is common in life insurance where industry mortality tables are often used as a starting point for setting assumptions on a particular block of business.

In general, offsetreg functions are named after existing functions from tidymodels or other modeling packages suffixed by `_offset` (or `_exposure`). The modeling engines in this package are wrappers around existing, well-known modeling functions. These engines all include the argument `offset_col` (or `exposure_col`) which is used to specify which column in the data passed to the engine contains offsets.

Currently, the following model specifications and engines are available:

- `poisson_reg_offset()` - create a Poisson GLM spec. Engines:

  - `glm_offset` - a wrapper around `stats::glm()`
  - `glmnet_offset` - a wrapper around `glmnet::glmnet()`

- `boost_tree_offset()` - create an ensemble of boosted Poisson decision trees. Engines:

  - `xgboost_offset` - a wrapper around `xgboost::xgb.train()`

- `decision_tree_exposure()` - create a Poisson decision tree with weighted exposures. Engines:

  - `rpart_exposure` - a wrapper around `rpart::rpart()`

## Installation

The offsetreg package can be installed from CRAN with:

``` r
install.packages("offsetreg")
```

You can install the development version of offsetreg from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("mattheaphy/offsetreg")
```

## Basic usage

The `us_deaths` data set contains United States deaths, population estimates, and crude mortality rates for ages 25+ from the CDC Multiple Causes of Death Files for the years 2011-2020.

```{r data, warning=FALSE, message=FALSE}
library(offsetreg)
library(parsnip)

us_deaths
```

Assume we want to create a poisson model for the number of deaths. First, an offset term is created by taking the natural log of population, which is our exposure basis. A natural log is used because the link function in poisson regression is the exponential function.

```{r add-offset}
us_deaths$offset <- log(us_deaths$population)
```

Create a poisson regression model with an offset, and set the engine to "glm_offset". The engine-specific argument `offset_col` must refer to the name of the column in `us_deaths` that contains offsets.

**Note**: The offset term should always be included in model formulas.

```{r model}
glm_off <- poisson_reg_offset() |>
  # set the modeling engine and specify the offset column
  set_engine("glm_offset", offset_col = "offset") |>
  # always include the offset term in the model formula
  fit(deaths ~ gender + age_group + year + offset, data = us_deaths)

glm_off
```

Verify that coefficients match `stats::glm()`.

```{r coef-compare}
glm_base <- glm(deaths ~ gender + age_group + year, offset = offset,
                data = us_deaths, family = 'poisson')

identical(extract_fit_engine(glm_off) |> coef(), coef(glm_base))
```