Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other methods for PIs #6

Closed
azev77 opened this issue Oct 7, 2022 · 14 comments
Closed

Other methods for PIs #6

azev77 opened this issue Oct 7, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@azev77
Copy link
Collaborator

azev77 commented Oct 7, 2022

When you recover from COVID, can we discuss implementing other methods for prediction Intervals besides the naive method?

I have code for the Jacknife+
https://www.stat.cmu.edu/~ryantibs/papers/jackknife.pdf

and additional conformal methods, that can handle heteroskedasticity ….

@pat-alt
Copy link
Member

pat-alt commented Oct 8, 2022

Sure thing! I'll probably turn back to this later next week - happy to have a chat then. Obviously do also feel free to create a PR in the meantime

@pat-alt pat-alt added the enhancement New feature or request label Oct 8, 2022
@azev77
Copy link
Collaborator Author

azev77 commented Oct 11, 2022

Here is some code for the Jackknife+ (and friends) Prediction intervals:

using Distributions, Plots, Random, Tables, GLMNet, MLJ;

n= 1_000; σ=1.0;
X = [ones(n) sin.(1:n) cos.(n:-1:1) exp.(sin.(1:n))]
θ = [9.0; 7.0; 3.0; 21]
Noise = randn(MersenneTwister(49), n) # n rows 
cef(x;θ=θ)  = x*θ;      # Conditional Expectation Function 
skedastic(x;σ=σ) = σ;   # Skedastic Function. Homoskedastic.
y = cef(X;θ=θ) + skedastic(X;σ=σ) .* Noise

train, calibration, test = partition(eachindex(y[:,1]), 0.4, 0.4)
#train, calibration, test = partition(eachindex(y[:,1]), 0.4, 0.4, shuffle=true, rng=444)

i_new = test[1] # index of new data point.
y_train = y[train]; X_train = X[train,:];

scfit(x,y) = GLMNet.glmnetcv(x, y, nlambda=1_000, alpha = 1.0) # Ridge/Lasso alpha = 0/1
scpr(m,x)  = GLMNet.predict(m, x);

α = 0.05 # miscoverage rate α∈[0,1]

# compute in-sample residuals in the training data 
m  = scfit(X_train, y_train);
ŷ = scpr(m, X); 
ŷ_new_is = ŷ[i_new];
res_is = y_train .- ŷ[train] # in-sample residuals.

# compute LOO residuals in the training data 
res_LOO, ŷ_new_LOO = [], [];
for tt in train 
    println(tt)
    #
    ty = y_train[train .!= tt]    # y_train leave out tt
    tX = X_train[train .!= tt,:]  # X_train leave out tt
    #
    tm = scfit(tX, ty)             # fit on training data minus row==tt
    tŷ = scpr(tm, X)               # predict on dataset
    #
    push!(res_LOO, y[tt] - tŷ[tt]) # LOO residual 
    push!(ŷ_new_LOO,    tŷ[i_new]) # LOO prediction 
end 
res_LOO
ŷ_new_LOO



"Naive PI: use in-sample residuals to estimate oos residuals"
"problem: in-sample residuals usually smaller than oos residuals"
err_Naive = quantile(abs.(res_is), 1 - α)
LB_Naive = ŷ_new_is - err_Naive 
UB_Naive = ŷ_new_is + err_Naive

"Jackknife PI: use training LOO residuals to estimate oos residuals"
"problem: sensitive to instability"
err_LOO = quantile(abs.(res_LOO), 1 - α)
LB_J = ŷ_new_is - err_LOO 
UB_J = ŷ_new_is + err_LOO

"Jackknife+ PI: use sample quantile of each ŷ_new_LOO[tt] adjusted by its own res_LOO[tt]"
LB_Jp = quantile(ŷ_new_LOO - abs.(res_LOO),   α)
UB_Jp = quantile(ŷ_new_LOO + abs.(res_LOO), 1-α) 

"Jackknife-minmax PI"
LB_Jmm = minimum(ŷ_new_LOO) - err_LOO
UB_Jmm = maximum(ŷ_new_LOO) + err_LOO
#
#v = ŷ_new_LOO - abs.(res_LOO);
#quantile(v,α) == -quantile(-v,1-α)  # SANITY CHECK 

LB_Naive, LB_J, LB_Jp, LB_Jmm
UB_Naive, UB_J, UB_Jp, UB_Jmm


#TODO:
"Split Conformal/Full Conformal/K-Fold CV+/K-fold cross-conformal"

@ablaom (I had trouble implementing in MLJ so I used GLMNet.jl)
The idea is pretty simple (as explained in the paper)
image

Naive prediction intervals (ignores overfitting):
image

image

Properties:
image

Notation (sample quantile):
image

image
image
image
image
image

Also:
image
image
image

Some theoretical results:
image
image
image

@ryantibs this is prob a dumb question (I prob misunderstood the paper), but suppose I want a prediction interval that contains Y_{n+1} 90% of the time (alpha=.10).
Does Theorem 1 imply the 90% Jackknife+ PI will contain it at least 80% of the time?
Does Theorem 2 imply the 90% Jackknife-minmax PI will contain it at least 90% of the time?
Then would that mean Jackknife-minmax PIs are "better" than Jackknife+?

@ryantibs
Copy link

Yes, that is correct. Of course if you're seeking guaranteed 90% coverage, so \alpha = 0.1, then you could always use the JK+ with \alpha = 0.05.

However I wouldn't say the interpretation is that the JK-minimax is better. The JK+ at level \alpha often has close to level 1-\alpha coverage in practice, and provably so under stability conditions (as the paper shows). In practice, the JK-minimax is often overly conservative. So practically, I would favor the JK+ in most scenarios.

@azev77
Copy link
Collaborator Author

azev77 commented Oct 12, 2022

Thanks @ryantibs! I re-read the paper.
If we assume exchangeability, coverage for JK+ is >= 1-2α
If we assume exchangeability & oos-stability, coverage for JK+ is >= 1-α {+ a term that vanishes as n gets big}

  1. are there analogous theoretical results that bound maximum coverage for these prediction intervals?
    I've seen "valid PIs" defined image

  2. IIUC what the paper calls "assumption-free theory" relies on the assumption of exchangeability. If we are predicting time-series data, this can be a very strong assumption...

@pat-alt
Copy link
Member

pat-alt commented Oct 12, 2022

Nice one @azev77 👍🏽 should be straight-forward to add this here. Happy to do that myself, but perhaps you'd prefer creating a PR so you'll show up as a contributor? (I'd like to turn to #5 first anyway and have some work to do this week on some other packages)

Edit: I've invited you as a collaborator and created a separate branch for this.

@ryantibs
Copy link

As a general comment, I would say that validity just means that the coverage is at least 1-\alpha. This just means you don't undercover. The upper bound is of course nice, and says that you don't overcover by "too much". Traditional conformal methods have this property.

  1. We don't have an overcoverage bound for JK+.

  2. "Assumption-free" is used to refer to the fact that there are no distributional assumptions on the joint distribution of (X,Y). But yes to your broader point, certainly I agree that i.i.d. (or exchangeable) sampling is key. None of the traditional conformal methods, JK+, or CV+ really apply outside of this setting. For time series you will have to look at extensions or adaptions of these techniques---there are by now several, take a look fro example at our "beyond exchangeability" paper and references therein.

@azev77
Copy link
Collaborator Author

azev77 commented Oct 12, 2022

@pat-alt thanks for being so open to collaborating.
I have some other work I gotta finish before I look into PRs.
In the mean time, here is a table I made to try to summarize some of the PI methods in @ryantibs's JK+paper:
(to do: implement heteroskedasticity robust (sorta) PIs in PNAS paper by @kwuthrich etal)
image
Notation:
image

@azev77
Copy link
Collaborator Author

azev77 commented Oct 13, 2022

I just tried to understand what the package is doing to compute Naive PIs for regression NaiveConformalRegressor.
I'll point out a small warning regarding terminology.

The JK+ paper (linked above) calls naive methods where the score is computed using in sample residuals.
-This is Naive bc we expect residuals outside the training data to be bigger...
The method this pkg calls naive uses oos residuals w/ outcomes in the calibration data
-probably different papers in the literature have different definitions for naive...

using MLJ 
import Statistics: quantile
EvoTreeRegressor = @load EvoTreeRegressor pkg=EvoTrees

#Make Data 
X, y = randn(1_000, 2), randn(1_000)
train, calibration, test = partition(eachindex(y), 0.4, 0.4)
Xcal = X[calibration,:]; ycal = y[calibration];
Xnew = X[test,:];        ynew = y[test];

#Fit training data 
model = EvoTreeRegressor() 
mach = machine(model, X, y)
fit!(mach, rows=train)

#Get PIs manually w/o a pkg (NaiveConformalRegressor)

## 2: non-conformity scores w/ calibration data ##
ŷcal = MLJ.predict(mach, Xcal) # MLJ.predict(mach, rows=calibration)
scores_cal = @.(abs(ŷcal - ycal))
scores_cal = sort(scores_cal, rev=true) # sorted non-conformity scores

## 3: get PIs ##
α=0.05; # miscoverage α ∈ [0,1]
n = length(scores_cal)
p̂ = ceil(((n+1) * (1-α))) / n
p̂ = clamp(p̂, 0.0, 1.0)
q̂ = quantile(scores_cal, p̂)
ŷnew = MLJ.predict(mach, Xnew) # MLJ.predict(mach, rows=test)
PInew = map(x -> ["lower" => x .- q̂, "upper" => x .+ q̂],eachrow(ŷnew))

some comments (to consider maybe at some point):

  1. may be useful to allow the user to specify their own score() function as input?
  2. may be allow alternative cross-validation techniques (like in my JK+ code above)?

PS: before I risk letting "scope creep" let me get too carried away.
My goals are to implement:
-Jackknife/JK+ methods
-the DCP methods in this PNAS paper
-possibly compare w/ prob prediction methods such as NGBoost.py
-hopefully the pkg will become easy to use so other authors can add their own method...

@pat-alt
Copy link
Member

pat-alt commented Oct 15, 2022

Thanks @azev77

You're right, that was a misnomer. In fact, what I had referred to as NaiveConformalRegressor originally, is in fact a simple Inductive Conformal regressor (also referred to as Split Conformal Regressor, because it relies on a separate on splitting the training data). Jackknife and related methods do not use data splitting and therefore fall into the category of what this tutorial as transductive CP.

I've changed the code base (see #10) to incorporate that broad distinction: models of type InductiveConformalModel <: ConformalModel rely on a separate calibration step involving a calibration set, while models of type TransductiveConformalModel <: ConformalModel only need to be fitted on training data. I've made that distinction explicit for now.

I've also implemented Jackknife now (see #15). Adding other approaches is very straight-forward from here. Authors/contributors just need to subtype the relevant ConformalModel (either inductive of transductive) and dispatch the scoring logic. If you want to have a look at the way I've now implemented Jackknife [here], you should see what I have in mind in terms of extensibility.

I'll close this issue and related branch now, but feel free to continue discussion here.

@pat-alt pat-alt closed this as completed Oct 15, 2022
@pat-alt
Copy link
Member

pat-alt commented Oct 19, 2022

Added multiple other approaches in #18 @azev77:

Regression:

  • Inductive
  • Naive Transductive
  • Jackknife
  • Jackknife+
  • Jackknife-minmax
  • CV+
  • CV-minmax

Classification:

  • Inductive (LABEL)
  • Adaptive Inductive

@azev77
Copy link
Collaborator Author

azev77 commented Oct 19, 2022

HI @pat-alt, I was actually hoping to add JK+ etc, I'm actually traveling (in Vienna rn, but wanna look into the code when back...)

@pat-alt
Copy link
Member

pat-alt commented Oct 19, 2022

Great - will be good to have second pair of eyes glance over it 🙏 safe travels!

@azev77
Copy link
Collaborator Author

azev77 commented Oct 25, 2022

I have a weird question about prediction interval terminology.

Are there other approaches to PIs besides conformal & JK+? @ryantibs @pat-alt @ablaom

What about packages such as ngboost.py which return an entire predicted distribution? What if we compute PIs implied by that predicted distribution, what are those PIs called?

I prob should've asked this question sooner, but is there any advantage to naming this pkg PredictionIntervals.jl & make it flexible enough to incorporate a variety of methods (including methods that haven't been thought of yet...)?

@pat-alt
Copy link
Member

pat-alt commented Oct 27, 2022

There are many other ways to compute prediction intervals, but I think we should limit our focus here on CP. I also don't think I want to change the name of this package. One reason is that it also implements conformal classification and those predictions are set-valued. That being said, PredictionIntervals.jl sounds like a good name for an umbrella-type package that is designed to generate prediction intervals (different approaches beyond CP) for any supervised model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants