Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DoWhy Logistic Regression with Stats Api #296

Open
cleli94 opened this issue Jul 12, 2021 · 1 comment
Open

DoWhy Logistic Regression with Stats Api #296

cleli94 opened this issue Jul 12, 2021 · 1 comment
Assignees

Comments

@cleli94
Copy link

cleli94 commented Jul 12, 2021

Dear authors,

I am using dowhy for a project, and it is a GREAT tool!

Basically, I was comparing the results obtained with the method backdoor with logistic regression using stats api as suggeted by you with a method created from scratch using scikit-learn. The results were very different, and mine seemed to be the more plausible. Moreover, the result should be the same as the S-Learner with LR, If I am not mistaken. Mine was equal, while using stats api very different.

I think there could be an issue with the GLM methods: when you call .predict with GLM from stats, you do not obtain the prediction (i.e., 0 - 1) but you obtain the probability. While in scikit-learn you obtained directly the class prediction:

**

  • model_sklearn.predict_proba(X)[:,1] == model_statsmodel.predict(X)
  • model_sklearn.predict(X) == (model_statsmodel.predict(X)>0.5).astype(int)

**

So, is it true that you're actually using .predict returning the probabilites? In this case, why are you taking the probabilities for computing the ATE instead of the class prediction?

Thank you very much in advance!

@amit-sharma
Copy link
Member

For most cases, probabilities are the correct output to use for computing the causal effect on a binary output. The expression is,
E[Y|do(T=1] - E[Y|do(T=0] = P[Y=1|do(T=1] - P[Y=1|do(T=0]
so it makes sense to use the probabilities.

To see an extreme example, consider that T and Y are both binary and there are no confounders. The true generating equation for Y is, y=Bernoulli(sigmoid(t*beta + N(0,0.01)) and beta is 0. So the causal effect of T on Y is zero.

  • Using logistic regression and the score/probability as the output, estimated P(Y=1|T=1) and P(Y=1|T=0) will be nearly the same and causal estimate will be zero.
  • Using the 0/1 class as output, the causal estimate can be 1 which is incorrect. This would happen whenever one of estimated P(Y=1|T=1) and P(Y=1|T=0) is less than 0.5 and the other is more than 0.5. For example, all inputs with T=1 will be predicted as 1, and all inputs with T=0 will be predicted zero.

Still, it can be useful to add flexibility to directly output the class prediction, e.g., for comparison with a default logistic metalearner. I've added an PR #386 for adding an argument predict_score to the GLM estimator. This can be specified in method_params of estimate_effect. It is True by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants