Skip to content

Fast estimation of generalized linear models with high dimensional categorical variables in Julia

License

Notifications You must be signed in to change notification settings

caibengbu/GLFixedEffectModels.jl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GLFixedEffectModels.jl

example branch parameter codecov.io DOI

This package estimates generalized linear models with high dimensional categorical variables. It builds on Matthieu Gomez's FixedEffects.jl, Amrei Stammann's Alpaca, and Sergio Correia's ppmlhdfe.

Installation

] add GLFixedEffectModels

Example use

using GLFixedEffectModels, GLM, Distributions
using RDatasets

df = dataset("datasets", "iris")
df.binary = zeros(Float64, size(df,1))
df[df.SepalLength .> 5.0,:binary] .= 1.0
df.SpeciesStr = string.(df.Species)
idx = rand(1:3,size(df,1),1)
a = ["A","B","C"]
df.Random = vec([a[i] for i in idx])

m = @formula binary ~ SepalWidth + fe(Species)
x = nlreg(df, m, Binomial(), LogitLink(), start = [0.2] )

m = @formula binary ~ SepalWidth + PetalLength + fe(Species)
nlreg(df, m, Binomial(), LogitLink(), Vcov.cluster(:SpeciesStr,:Random) , start = [0.2, 0.2] )

Documentation

The main function is nlreg(), which returns a GLFixedEffectModel <: RegressionModel.

nlreg(df, formula::FormulaTerm,
    distribution::Distribution,
    link::GLM.Link,
    vcov::CovarianceEstimator; ...)

The required arguments are:

  • df: a Table
  • formula: A formula created using @formula.
  • distribution: A Distribution. See the documentation of GLM.jl for valid distributions.
  • link: A GLM.Link function. See the documentation of GLM.jl for valid link functions.
  • vcov: A CovarianceEstimator to compute the variance-covariance matrix.

The optional arguments are:

  • save::Union{Bool, Symbol} = false: Should residuals and eventual estimated fixed effects saved in a dataframe? Use save = :residuals to only save residuals. Use save = :fe to only save fixed effects.
  • method::Symbol: A symbol for the method. Default is :cpu. Alternatively, :gpu requires CuArrays. In this case, use the option double_precision = false to use Float32. This option is the same as for the FixedEffectModels.jl package.
  • double_precision::Bool = true: Uses 64-bit floats if true, otherwise 32-bit.
  • drop_singletons = true : drop observations that are perfectly classified.
  • contrasts::Dict = Dict() An optional Dict of contrast codings for each categorical variable in the formula. Any unspecified variables will have DummyCoding.
  • maxiter::Integer = 1000: Maximum number of iterations in the Newton-Raphson routine.
  • maxiter_center::Integer = 10000: Maximum number of iterations for centering procedure.
  • double_precision::Bool: Should the demeaning operation use Float64 rather than Float32? Default to true.
  • dev_tol::Real : Tolerance level for the first stopping condition of the maximization routine.
  • rho_tol::Real : Tolerance level for the stephalving in the maximization routine.
  • step_tol::Real : Tolerance level that accounts for rounding errors inside the stephalving routine
  • center_tol::Real : Tolerance level for the stopping condition of the centering algorithm. Default to 1e-8 if double_precision = true, 1e-6 otherwise.
  • separation::Vector{Symbol} = Symbol[] : Method to detect/deal with separation. Supported elements are :mu, :fe, :ReLU, and in the future, :simplex. :mu truncates mu at separation_mu_lbound or separation_mu_ubound. :fe finds categories of the fixed effects that only exist when y is at the separation point. ReLU detects separation using ReLU, with the maxiter being separation_ReLU_maxiter and tolerance being separation_ReLU_tol.
  • separation_mu_lbound::Real = -Inf : Lower bound for the separation detection/correction heuristic (on mu). What a reasonable value would be depends on the model that you're trying to fit.
  • separation_mu_ubound::Real = Inf : Upper bound for the separation detection/correction heuristic.
  • separation_ReLU_tol::Real = 1e-4 : Tolerance level for the ReLU algorithm.
  • separation_ReLU_maxiter::Integer = 1000 : Maximal number of iterations for the ReLU algorithm.
  • verbose::Bool = false : If true, prints output on each iteration.

The function returns a GLFixedEffectModel object which supports the StatsBase.RegressionModel abstraction. It can be displayed in table form by using RegressionTables.jl.

Bias correction methods

The package experimentally supports bias correction methods for the following models:

  • Binomial regression, Logit link, Two-way, Classic (Fernández-Val and Weidner (2016, 2018))
  • Binomial regression, Probit link, Two-way, Classic (Fernández-Val and Weidner (2016, 2018))
  • Binomial regression, Logit link, Two-way, Network (Hinz, Stammann and Wanner (2020) & Fernández-Val and Weidner (2016))
  • Binomial regression, Probit link, Two-way, Network (Hinz, Stammann and Wanner (2020) & Fernández-Val and Weidner (2016))
  • Binomial regression, Logit link, Three-way, Network (Hinz, Stammann and Wanner (2020))
  • Binomial regression, Probit link, Three-way, Network (Hinz, Stammann and Wanner (2020))
  • Poisson regression, Log link, Three-way, Network (Weidner and Zylkin (2021))
  • Poisson regression, Log link, Two-way, Network (Weidner and Zylkin (2021))

Things that still need to be implemented

  • Better default starting values
  • Weights
  • Better StatsBase interface & prediction
  • Better benchmarking

Related Julia packages

  • FixedEffectModels.jl estimates linear models with high dimensional categorical variables (and with or without endogeneous regressors).
  • FixedEffects.jl is a package for fast pseudo-demeaning operations using LSMR. Both this package and FixedEffectModels.jl build on this.
  • Alpaca.jl is a wrapper to the Alpaca R package, which solves the same tasks as this package.
  • GLM.jl estimates generalized linear models, but without explicit support for categorical regressors.
  • Econometrics.jl provides routines to estimate multinomial logit and other models.
  • RegressionTables.jl supports pretty printing of results from this package.

References

Correia, S. and Guimarães, P, and Zylkin, T., 2019. Verifying the existence of maximum likelihood estimates for generalized linear models. Working paper, https://arxiv.org/abs/1903.01633

Fernández-Val, I. and Weidner, M., 2016. Individual and time effects in nonlinear panel models with large N, T. Journal of Econometrics, 192(1), pp.291-312.

Fernández-Val, I. and Weidner, M., 2018. Fixed effects estimation of large-T panel data models. Annual Review of Economics, 10, pp.109-138.

Fong, DC. and Saunders, M. (2011) LSMR: An Iterative Algorithm for Sparse Least-Squares Problems. SIAM Journal on Scientific Computing

Hinz, J., Stammann, A. and Wanner, J., 2021. State dependence and unobserved heterogeneity in the extensive margin of trade.

Stammann, A. (2018) Fast and Feasible Estimation of Generalized Linear Models with High-Dimensional k-way Fixed Effects. Mimeo, Heinrich-Heine University Düsseldorf

Weidner, M. and Zylkin, T., 2021. Bias and consistency in three-way gravity models. Journal of International Economics, 132, p.103513.

About

Fast estimation of generalized linear models with high dimensional categorical variables in Julia

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Julia 100.0%