Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make leaves to be placeholders when not enough samples to fill them (… #299

Closed
wants to merge 1 commit into from

Conversation

StepanTita
Copy link

@StepanTita StepanTita commented Jun 6, 2023

Fixes the bug of graphviz erroring out due to file not found
Mentioned in that issue:
#298

Code to reproduce:

import sys
import pandas as pd
import numpy as np

import dtreeviz
import graphviz

from sklearn.model_selection import train_test_split

import xgboost as xgb

np.random.seed(42)

dataset_url = "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/titanic/titanic.csv"
data = pd.read_csv(dataset_url, index_col=0)

data['Age'] = data['Age'].fillna(data['Age'].median())

cat_features = ['Sex', 'Embarked']

X, y = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']], data['Survived']

X = pd.get_dummies(X, columns=cat_features)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

params = {'max_depth':10, 'eta':0.05, 'objective':'binary:logistic', 'subsample':1}
model_xgb = xgb.XGBClassifier(**params, random_state=42)

model_xgb.fit(X_train, y_train)

# would work fine
viz_model = dtreeviz.model(
    model_xgb, tree_index=5,
    X_train=X_train, y_train=y_train,
    feature_names=list(X_train.columns),
    target_name='Survived', class_names=['perish', 'survive']
)

viz_model.view(fancy=False)

X_sample = X_test.sample(5)
y_sample = y_test.loc[X_sample.index]

viz_model = dtreeviz.model(
    model_xgb, tree_index=4,
    X_train=X_sample, y_train=y_sample,
    feature_names=list(X_sample.columns),
    target_name='Survived', class_names=['perish', 'survive']
)

# would fail due to file not found
viz_model.view(fancy=False)

Error message:

CalledProcessError: Command '['dot', '-Tsvg', '-o', '/tmp/DTreeViz_720.svg', '/tmp/DTreeViz_720']' returned non-zero exit status 1. [stderr: b'Warning: No such file or directory while opening /tmp/leaf33_720.svg\nError: No or improper image file="/tmp/leaf33_720.svg"\nin label of node leaf33\nWarning: No such file or directory while opening /tmp/leaf23_720.svg\nError: No or improper image file="/tmp/leaf23_720.svg"\nin label of node leaf23\nWarning: No such file or directory while opening /tmp/leaf39_720.svg\nError: No or improper image file="/tmp/leaf39_720.svg"\nin label of node leaf39\nWarning: No such file or directory while opening /tmp/leaf49_720.svg\nError: No or improper image file="/tmp/leaf49_720.svg"\nin label of node leaf49\nWarning: No such file or directory while opening /tmp/leaf42_720.svg\nError: No or improper image file="/tmp/leaf42_720.svg"\nin label of node leaf42\nWarning: No such file or directory while opening /tmp/leaf43_720.svg\nError: No or improper image file="/tmp/leaf43_720.svg"\nin label of node leaf43\nWarning: No such file or directory while opening /tmp/leaf53_720.svg\nError: No or improper image file="/tmp/leaf53_720.svg"\nin label of node leaf53\nWarning: No such file or directory while opening /tmp/leaf46_720.svg\nError: No or improper image file="/tmp/leaf46_720.svg"\nin label of node leaf46\n']

Colab example of failing: https://colab.research.google.com/drive/1TTX4m7H-S1y5BMqKy_YcWzmlaqJjYkn9?usp=sharing

Colab example of working after the fix: https://colab.research.google.com/drive/1xxPYYAKNwvkcF4Yj6cLK6fUJGxz-W0j6?usp=sharing

Rendered tree after fix:
Screenshot 2023-06-06 at 19 17 40

It might be worth adding some kind of a warning message, but I couldn't find anything like that across the package, so decided not to add it myself.

…fixes the bug of graphviz erroring out due to file not found)
@tlapusan
Copy link
Collaborator

tlapusan commented Jun 7, 2023

Thanks @StepanTita for this PR. I managed to reproduce it.
It was a little confusing first because the tree structures were different, but this was because of different tree index values.

Do we still want to display the nodes/leaves which are not part from the new dataset (those one from simple oval shapes) ?

I think we should fix it also for regression trees... right ?

@StepanTita
Copy link
Author

Thanks @StepanTita for this PR. I managed to reproduce it. It was a little confusing first because the tree structures were different, but this was because of different tree index values.

Do we still want to display the nodes/leaves which are not part from the new dataset (those one from simple oval shapes) ?

I think we should fix it also for regression trees... right ?

Well, I believe we still need to draw empty nodes because they form the tree structure, otherwise it would just be blank space right?

Regarding the regression trees, I tried to reproduce this issue, and then double checked it and this is not a problem there:

y = y[samples]

This would just be an empty array, which later would lead to nan for mean, but it will still plot empty plot, will not throw an error.

@tlapusan
Copy link
Collaborator

@StepanTita sorry for the late response :)

Indeed, letting them as empty nodes I think it would be a good option.

I tried to see how the tree is looking when fancy=True (and with another dataset than training) and it will raise some exception when the split node will have no data.

@parrt
Copy link
Owner

parrt commented Sep 23, 2023

looks like a merge conflict?

@parrt
Copy link
Owner

parrt commented Sep 24, 2023

Close in favor of #307

@parrt parrt closed this Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants