Enable model selection for first stage models #808

kbattocchi · 2023-08-18T16:41:33Z

No description provided.

Signed-off-by: AnthonyCampbell208 <[email protected]> Co-authored-by: ShrutiRM97 <[email protected]> Co-authored-by: CooperGibbs <[email protected]>

Signed-off-by: AnthonyCampbell208 <[email protected]>

…d notebook to showcase some of the functionality Signed-off-by: AnthonyCampbell208 <[email protected]>

Signed-off-by: Keith Battocchi <[email protected]>

fverac · 2023-11-09T18:43:57Z

econml/dml/dml.py

-            fit_with_groups(self._model, self._combine(X, W, Target.shape[0]), Target, groups=groups)
-        return self
+class _FirstStageWrapper:
+    def __init__(self, model, discrete_target):


I like the refactoring. There is a consequence of not being able to make model-specific (Y vs T) warnings though (like the binary outcomes PR tries to do in this class).

Let's think about the cleanest way to do that as part of your PR; one thing to keep in mind is that we also use this wrapper for some Z models, so it's not just T vs. Y, which is what we assumed when we first implemented this logic when we only had DML but not the IV classes.

fverac · 2023-11-09T19:09:31Z

econml/dml/dml.py



-class _FirstStageWrapper:
-    def __init__(self, model, is_Y, featurizer, linear_first_stages, discrete_treatment):


Is linear_first_stages used anymore?

No; linear_first_stages has always been an awkward hack and this gives us a good opportunity to get rid of it. Given

T = f(X, W) + error
Y = T <alpha phi(X)> + g(X,W) + error

if both f and g are linear, then because we multiply T by theta in the equation for Y, Y is not actually a linear function of X and W, we need to include interaction terms. So we introduced linear_first_stages because our default y and t models were linear and we wanted to make sure we could estimate the correct model if f and g were linear. Now, since we do model selection, the y and t models are not (necessarily) linear, so this makes less sense.

There's also a potentially big performance cost if users accidentally leave linear_first_stages set to True, because we can end up generating a huge number of interaction terms, and we see this happen fairly often in practice.

Additionally, linear models aren't the only time that specifying the right models for f and g as your t- and y-models wouldn't give the correct estimate - if f and g are both degree-n polynomials then you'd need to interact the degree-n interaction terms of [X;W] with the featurized X, for example, and we've never supported that; users can always supply a more complicated pipelined model that does something like that themselves if they'd like, but I'd argue it doesn't make sense to try to solve it on our end.

fverac · 2023-11-09T19:13:05Z

econml/sklearn_extensions/model_selection.py

+class ModelSelector(metaclass=abc.ABCMeta):
+    """
+    This class enables a two-stage fitting process, where first a model is selected
+    by calling `train` with `is_selecting=True`, and then the selected model is fit (presumably


I am wondering why we opt for a 'train' method over a 'fit' method

At first I used fit, but this makes the error messages much easier to understand if a selector is being passed where an estimator is expected or vice-versa.

fverac · 2023-11-09T19:15:59Z

econml/sklearn_extensions/model_selection.py

+            return None
+
+
+def _fit_with_groups(model, X, y, *, groups, **kwargs):


marking for discussion

fverac · 2023-11-09T19:35:21Z

econml/dml/dml.py

@@ -716,13 +741,19 @@ class LinearDML(StatsModelsCateEstimatorMixin, DML):

    def __init__(self, *,
                 model_y='auto', model_t='auto',


We should probably update our docstrings to say that users can pass something like 'forest' to model_y right?

Yes, I think we should probably create a whole new documentation section on how to specify model selectors, because there are a number of possibilities now. But I was planning on deferring that until after this PR.

fverac · 2023-11-09T19:48:00Z

econml/_ortho_learner.py

@@ -202,13 +204,17 @@ def predict(self, X, y, W=None):

    """
    model_list = []
+
+    kwargs = filter_none_kwargs(**kwargs)
+    model.train(True, *args, **kwargs)


I think in some earlier conversations we were thinking about giving the users the option to do "dirty crossfitting" i.e. picking a good est from all data before cross fitting. Am I correct in my understanding that this PR just does "dirty crossfitting" by default?

Yes, and that's definitely something we could consider making easier for users.

It's possible, though not straightforward, to do non-dirty crossfitting now, by wrapping a CV estimator in a FixedModelSelector, which will always use the estimator as is for both selecting and fitting. However, there are some changes we could make to make this more efficient, since then the selecting step is unnecessary and so we could just skip it.

I'd propose tabling that for now and implementing that as one of several future enhancements to the model selection logic.

fverac · 2023-11-09T19:54:01Z

econml/dml/dml.py



-class _FirstStageWrapper:
-    def __init__(self, model, is_Y, featurizer, linear_first_stages, discrete_treatment):


Also do we no longer featurizer our first stage models?

We only ever used featurization in the first stage models to support linear_first_stages, since then we needed to interact phi(X) with [X;W].

Logically, the featurization is only about the final model, the first stage models just get the raw X and W and can fit whatever kind of model they'd like on them, including one that pipelines featurization with any other logic.

fverac · 2023-11-09T20:08:42Z

General comment not just about this PR but of process of adding new features - I feel like a demo notebook (even if brief) would help users get up to speed with new features we add. Maybe we can add demo notebooks before moving from a beta release to an official release?
.e.g. how does one make use of the new param_list arg?

Signed-off-by: Keith Battocchi <[email protected]>

kbattocchi force-pushed the kebatt/modelSelection branch from 3a67554 to a96601a Compare August 18, 2023 17:02

kbattocchi force-pushed the kebatt/modelSelection branch from a96601a to 3a024ea Compare November 6, 2023 15:04

AnthonyCampbell208 and others added 3 commits November 6, 2023 10:40

Adding model selection functionality

f200512

Signed-off-by: AnthonyCampbell208 <[email protected]> Co-authored-by: ShrutiRM97 <[email protected]> Co-authored-by: CooperGibbs <[email protected]>

Fixed fitting with groups, fixed one param grid case, other bugs

30c290a

Signed-off-by: AnthonyCampbell208 <[email protected]>

Final commit, added encoding for categorical data (untested) and adde…

55c5858

…d notebook to showcase some of the functionality Signed-off-by: AnthonyCampbell208 <[email protected]>

kbattocchi force-pushed the kebatt/modelSelection branch 4 times, most recently from 737dad1 to 356f36e Compare November 8, 2023 06:22

Model selection WIP

fe1c5e1

Signed-off-by: Keith Battocchi <[email protected]>

kbattocchi force-pushed the kebatt/modelSelection branch from 356f36e to fe1c5e1 Compare November 8, 2023 07:00

Merge branch 'main' into kebatt/modelSelection

7104f00

kbattocchi requested a review from fverac November 8, 2023 22:20

Fix some model selection logic

6d41ada

Signed-off-by: Keith Battocchi <[email protected]>

kbattocchi force-pushed the kebatt/modelSelection branch from 27acdc0 to 6d41ada Compare November 9, 2023 15:13

Remove deprecated "normalize" param

0435b26

Signed-off-by: Keith Battocchi <[email protected]>

kbattocchi marked this pull request as ready for review November 9, 2023 15:51

Adjust tests for lack of linear_first_stages

db74413

Signed-off-by: Keith Battocchi <[email protected]>

fverac reviewed Nov 9, 2023

View reviewed changes

kbattocchi added 4 commits November 9, 2023 16:24

Remove vestigal functionality

7e61c00

Signed-off-by: Keith Battocchi <[email protected]>

Fix linting

fe63f23

Signed-off-by: Keith Battocchi <[email protected]>

Speed up tests by doing less model selection

6f6a514

Signed-off-by: Keith Battocchi <[email protected]>

Ensure use of models that can fit arrays and vectors in DMLIV tests

2451faa

Signed-off-by: Keith Battocchi <[email protected]>

kbattocchi added 2 commits November 10, 2023 11:03

Fix tests

6d4a203

Signed-off-by: Keith Battocchi <[email protected]>

Speed up tests

818ff9c

Signed-off-by: Keith Battocchi <[email protected]>

fverac approved these changes Nov 10, 2023

View reviewed changes

kbattocchi added 6 commits November 10, 2023 17:05

Make tests more reliable

ba7de62

Signed-off-by: Keith Battocchi <[email protected]>

Try to fix tests

a551c19

Signed-off-by: Keith Battocchi <[email protected]>

Fix tests

96cb47e

Signed-off-by: Keith Battocchi <[email protected]>

Fix docstrings

f454f24

Signed-off-by: Keith Battocchi <[email protected]>

Fix doctests

9b45601

Signed-off-by: Keith Battocchi <[email protected]>

Fix doctests

4618ffa

Signed-off-by: Keith Battocchi <[email protected]>

kbattocchi enabled auto-merge (squash) November 11, 2023 17:36

kbattocchi merged commit e335d15 into main Nov 11, 2023
72 checks passed

kbattocchi deleted the kebatt/modelSelection branch November 11, 2023 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable model selection for first stage models #808

Enable model selection for first stage models #808

kbattocchi commented Aug 18, 2023

fverac Nov 9, 2023

kbattocchi Nov 9, 2023

fverac Nov 9, 2023

kbattocchi Nov 9, 2023

fverac Nov 9, 2023

kbattocchi Nov 9, 2023

fverac Nov 9, 2023

fverac Nov 9, 2023

kbattocchi Nov 9, 2023

fverac Nov 9, 2023

kbattocchi Nov 9, 2023

fverac Nov 9, 2023

kbattocchi Nov 9, 2023

fverac commented Nov 9, 2023



		class _FirstStageWrapper:
		def __init__(self, model, is_Y, featurizer, linear_first_stages, discrete_treatment):

		return None


		def _fit_with_groups(model, X, y, , groups, *kwargs):

		@@ -716,13 +741,19 @@ class LinearDML(StatsModelsCateEstimatorMixin, DML):

		def __init__(self, *,
		model_y='auto', model_t='auto',

Enable model selection for first stage models #808

Enable model selection for first stage models #808

Conversation

kbattocchi commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fverac commented Nov 9, 2023