- _And is there not also the case where we play and--make up the rules as we go along?
- -Ludwig Wittgenstein_
![the duck-rabbit](https://github.com/imoscovitz/wittgenstein/blob/master/duck-rabbit.jpg)
## Summary
This package implements two iterative coverage-based ruleset algorithms: IREP and RIPPERk.
Performance is similar to sklearn's DecisionTree CART implementation (see [Performance Tests](https://github.com/imoscovitz/ruleset/blob/master/Performance%20Tests.ipynb)).
For explanation of the algorithms, see my article in _Towards Data Science_, or the papers below, under [Useful References](https://github.com/imoscovitz/wittgenstein#useful-references).
## Installation
To install, use
$ pip install wittgenstein
To uninstall, use
$ pip uninstall wittgenstein
## Requirements - pandas - numpy - python version>=3.6
## Usage
#### Training
Usage syntax is similar to sklearn's.
Once you have loaded and split your data...
>>> import pandas as pd
>>> df = pd.read_csv(dataset_filename)
>>> from sklearn.model_selection import train_test_split # Or any other mechanism you want to use for data partitioning
>>> train, test = train_test_split(df, test_size=.33)
We can fit a ruleset classifier using RIPPER or IREP.
>>> import wittgenstein as lw
>>> ripper_clf = lw.RIPPER() # Or irep_clf = lw.IREP() to build a model using IREP
>>> ripper_clf.fit(train, class_feat='Party') # Or pass X and y data to .fit
>>> ripper_clf
<RIPPER with fit ruleset (k=2, prune_size=0.33, dl_allowance=64)> # Hyperparameter details available in the docstrings and TDS article below
Access the underlying trained model with the .ruleset_ attribute, or output it with .out_model(). A ruleset is a disjunction of conjunctions -- 'V' represents 'or'; '^' represents 'and'.
In other words, the model predicts positive class if any of the inner-nested condition-combinations are all true:
>>> ripper_clf.ruleset_
<Ruleset [physician-fee-freeze=n] V [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n]>
### Scoring
To score our fit model:
>>> X_test = test.drop(class_feat, axis=1)
>>> y_test = test[class_feat]
>>> ripper_clf.score(test_X, test_y)
Default scoring metric is accuracy. You can pass in alternate scoring functions, including those available through sklearn:
>>> from sklearn.metrics import precision_score, recall_score
>>> precision = clf.score(X_test, y_test, precision_score)
>>> recall = clf.score(X_test, y_test, recall_score)
>>> print(f'precision: {precision} recall: {recall}')
precision: 0.9914..., recall: 0.9953...
### Model selection
wittgenstein classifiers are also compatible with sklearn model_selection tools such as cross_val_score and GridSearchCV, as well
as ensemblers like StackingClassifier.
Cross validation:
>>> # First dummify your categorical features to make sklearn happy
>>> X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes('object').columns)
>>> y_train = y_train.map(lambda x: 1 if x=='democrat' else 0)
>>> cross_val_score(ripper, X_train, y_train)
Grid search:
>>> param_grid = {"prune_size": [0.33, 0.5], "k": [1, 2]}
>>> grid = GridSearchCV(estimator=ripper, param_grid=param_grid)
>>> grid.fit(X_train, y_train)
>>> tree = DecisionTreeClassifier(random_state=42)
>>> estimators = [("rip", ripper_clf), ("tree", tree)]
>>> ensemble_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
>>> ensemble_clf.fit(X_train, y_train)
### Prediction
To perform predictions:
>>> ripper_clf.predict(new_data)[:5]
[True, True, False, True, False]
Predict class probabilities:
>>> ripper_clf.predict_proba(test)
# Pairs of negative and positive class probabilities
array([[0.01212121, 0.98787879],
[0.01212121, 0.98787879], [0.77777778, 0.22222222], [0.2 , 0.8 ], ...
We can also ask our model to tell us why it made each positive prediction that it did:
>>> ripper_clf.predict(new_data[:5], give_reasons=True)
([True, True, False, True, True]
[<Rule [physician-fee-freeze=n]>],
[<Rule [physician-fee-freeze=n]>, <Rule [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n]>], # This example met multiple sufficient conditions for a positive prediction
[<Rule object: [physician-fee-freeze=n]>],
## Changelog
##### v0.2.1: 5/19/2020 - Binning bugfix and optimization
#### v0.7.0: 5/4/2020 - Algorithmic optimizations to improve training speed (~10x - ~100x) - Support for training on iterable datatypes besides DataFrames, such as numpy arrays and python lists - Compatibility with sklearn ensembling metalearners and sklearn model_selection - .predict_proba returns probas in neg, pos order - Certain parameters (hyperparameters, random_state, etc.) should now be passed into IREP/RIPPER constructors rather than the .fit method. - Sundry bugfixes
