-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomness in the binning : Getting Different Bins each time #314
Comments
See: #310 (comment) |
Hi @guillermo-navas-palencia , I'm using 'cart' methods (same as you suggested in the comment). I thought the subsample default value issue was only for sklearn.preprocessing.KBinsDiscretizer. Is it there for the cart methods too? I specified cart in the binning_fit_params parameter of BinningProcess() Thanks :-) |
Hi @priyankamishra31. I was able to replicate this behaviour, thanks for providing the dataset. Findings:
I found this error disappears using 'mip' as a solver, so this seems a solver issue (well, not necessarily, read below). However, in terms of IV, the CP-SAT returns the same value, i.e., the difference is below the solver's tolerance 1e-6, so in that sense, both solutions are equally valid. In order words, there are multiple optimal solutions. Using ortools version 9.9.3963 (latest version) I understand that from a modelling perspective, this is an issue. I will fix the random_seed parameter to enforce reproducibility. Lastly, it is worth noticing that the Google ORTools team does not guarantee the same solution across versions. |
Thanks @guillermo-navas-palencia , I really appreciate you looking into this. Is there anything I could do from my side (or a work around) if I still want to use the 'cp' solver, and have a consistent result ? This would help me till we have the next version of this package. Thanks again :-) |
Unfortunately, I don't think so. If your target is binary, and you increase the min_prebin_size a bit the MIP solver should be only slightly slower than CP. In general, keeping a reasonable min_prebin_size (i.e., 0.025 - 0.05), will reduce the number of equally optimal solutions. If I find the time, I will also experiment with other MIP solvers already supported by ortools (Highs and SCIP). Another comment about the CP solver: please bear in mind that the CP solver works with integer values, so optbinning rounds to integer after scaling (x 1e6), which incurs rounding errors if the x values are tiny. For reference: https://github.com/guillermo-navas-palencia/optbinning/blob/master/optbinning/binning/cp.py#L53 |
This issue is linked to : #299
(Sorry , I didn't find the option to reopen the issue probably because I'm not a collaborator)
Hi @guillermo-navas-palencia ,
I'm using optbinning.BinningProcess() for automatic binning of around 100-200 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern.
There is a randomness in the binning, even when the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).
The dataset used was from kaggle , and linked below.
https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv
I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since don't see an option to attach it here )
Binning Process:
binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0])
binning_process.fit(X_train,y_train,w_train)
And these are the binning result when running Binning Process , 3 times , without changing anything :
for examples , if you compare the files binning_result.csv and binning_result_2.csv you'll see the difference in bins for var_14 and var_15
![image](https://private-user-images.githubusercontent.com/92120649/326139137-c80b13b3-8187-4d3f-81c2-234c1019447b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMTUxOTYsIm5iZiI6MTczOTMxNDg5NiwicGF0aCI6Ii85MjEyMDY0OS8zMjYxMzkxMzctYzgwYjEzYjMtODE4Ny00ZDNmLTgxYzItMjM0YzEwMTk0NDdiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIzMDEzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWEwOTcyYmY3MjA1NGFlNGJjNTNjYmQ3NjI1NTRkODRiNDA2NGRmZDQ2OGJiYmNlZWQyMWI3OGUzNDRiZjc5YzUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.m1WZGSXBpxiIgRE0Z0qHYA4HpGvSP5c2q-4bIZaugU0)
similarly on comparing the 3 files , I got the following difference :
![image](https://private-user-images.githubusercontent.com/92120649/326139193-0c419be3-b50e-4372-b5f0-b2e843257fb0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMTUxOTYsIm5iZiI6MTczOTMxNDg5NiwicGF0aCI6Ii85MjEyMDY0OS8zMjYxMzkxOTMtMGM0MTliZTMtYjUwZS00MzcyLWI1ZjAtYjJlODQzMjU3ZmIwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIzMDEzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU5NzljYmVmNzllYmM4N2NjMDI2YzU1ZjZiMWE3ZTY5ZmI0MGNlYWI2ZmIxYTY1OTgzZDBkYTEyODRkNmNlMzUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.vFNIknW9FDJ0E8cgocp3zTllrBlxIwOD27L1XA7uO1U)
I'm using optbinning==0.18.0
Can we prevent this from happening and make sure we get the same consistent bins each time ?
I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.
Thanks!
The text was updated successfully, but these errors were encountered: