This package extends the tidymodels
ecosystem to enable usage of predictive models with offset terms. Offset
terms are predictors in a linear model that with a known a priori
value. In other words, these terms do not have an associated coefficient
(
Models with offsets are most useful when working with count data or when fitting an adjustment model on top of an existing model with a prior expectation. The former situation is common in insurance where data is often aggregated or weighted by exposures. The latter is common in life insurance where industry mortality tables are often used as a starting point for setting assumptions on a particular block of business.
In general, offsetreg functions are named after existing functions from
tidymodels or other modeling packages suffixed by _offset
(or
_exposure
). The modeling engines in this package are wrappers around
existing, well-known modeling functions. These engines all include the
argument offset_col
(or exposure_col
) which is used to specify which
column in the data passed to the engine contains offsets.
Currently, the following model specifications and engines are available:
-
poisson_reg_offset()
- create a Poisson GLM spec. Engines:glm_offset
- a wrapper aroundstats::glm()
glmnet_offset
- a wrapper aroundglmnet::glmnet()
-
boost_tree_offset()
- create an ensemble of boosted Poisson decision trees. Engines:xgboost_offset
- a wrapper aroundxgboost::xgb.train()
-
decision_tree_exposure()
- create a Poisson decision tree with weighted exposures. Engines:rpart_exposure
- a wrapper aroundrpart::rpart()
The offsetreg package can be installed from CRAN with:
install.packages("offsetreg")
You can install the development version of offsetreg from GitHub with:
# install.packages("devtools")
devtools::install_github("mattheaphy/offsetreg")
The us_deaths
data set contains United States deaths, population
estimates, and crude mortality rates for ages 25+ from the CDC Multiple
Causes of Death Files for the years 2011-2020.
library(offsetreg)
library(parsnip)
us_deaths
#> # A tibble: 140 × 6
#> gender age_group year deaths population qx
#> <fct> <fct> <int> <dbl> <dbl> <dbl>
#> 1 Female 25-34 2011 13663 20746335 0.000659
#> 2 Female 25-34 2012 13808 20970529 0.000658
#> 3 Female 25-34 2013 14001 21203096 0.000660
#> 4 Female 25-34 2014 14480 21546290 0.000672
#> 5 Female 25-34 2015 15736 21838064 0.000721
#> 6 Female 25-34 2016 17359 22077505 0.000786
#> 7 Female 25-34 2017 18066 22351311 0.000808
#> 8 Female 25-34 2018 17980 22487065 0.000800
#> 9 Female 25-34 2019 17827 22581141 0.000789
#> 10 Female 25-34 2020 21654 22625267 0.000957
#> # ℹ 130 more rows
Assume we want to create a poisson model for the number of deaths. First, an offset term is created by taking the natural log of population, which is our exposure basis. A natural log is used because the link function in poisson regression is the exponential function.
us_deaths$offset <- log(us_deaths$population)
Create a poisson regression model with an offset, and set the engine to
“glm_offset”. The engine-specific argument offset_col
must refer to
the name of the column in us_deaths
that contains offsets.
Note: The offset term should always be included in model formulas.
glm_off <- poisson_reg_offset() |>
# set the modeling engine and specify the offset column
set_engine("glm_offset", offset_col = "offset") |>
# always include the offset term in the model formula
fit(deaths ~ gender + age_group + year + offset, data = us_deaths)
glm_off
#> parsnip model object
#>
#>
#> Call: stats::glm(formula = formula, family = family, data = data, weights = weights,
#> offset = offset)
#>
#> Coefficients:
#> (Intercept) genderMale age_group35-44 age_group45-54 age_group55-64
#> -18.337940 0.327632 0.442935 1.212463 1.990698
#> age_group65-74 age_group75-84 age_group85+ year
#> 2.713410 3.645763 4.770408 0.005683
#>
#> Degrees of Freedom: 139 Total (i.e. Null); 131 Residual
#> Null Deviance: 51700000
#> Residual Deviance: 237800 AIC: 239800
Verify that coefficients match stats::glm()
.
glm_base <- glm(deaths ~ gender + age_group + year, offset = offset,
data = us_deaths, family = 'poisson')
identical(extract_fit_engine(glm_off) |> coef(), coef(glm_base))
#> [1] TRUE