Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

so.Line appears to be plotting some line segments out of order for Numpy arrays #3059

Closed
joshua-ebner opened this issue Oct 7, 2022 · 1 comment · Fixed by #3064
Closed

Comments

@joshua-ebner
Copy link

I have been attempting to plot ROC curves with so.Line, and unless I'm missing something, it appears that it's plotting some line segments out of order.

It appears that these problems arise when two consecutive datapoints have the same x value but different y values.

(Note: I'm using Numpy data .... perhaps so.Plot() is not yet intended to be used with Numpy Data?)

First, here's the code to create the data that I've notice the problem with.

# roc curve and auc
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve

import seaborn.objects as so
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio

# GENERATE DATASET WITH 2 CLASSES
X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)

# SPLIT DATA INTO TRAIN/TEST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# FIT MODEL
my_logistic_reg = LogisticRegression()
my_logistic_reg.fit(X_train, y_train)

# predict probabilities
probabilities_logistic_reg = my_logistic_reg.predict_proba(X_test)

# keep probabilities for the positive outcome only
probabilities_logistic_posclass = probabilities_logistic_reg[:, 1]

# calculate roc curves
falseposrate_logistic, trueposrate_logistic, _ = roc_curve(y_test, probabilities_logistic_posclass)

Next, we plot the data as a line chart, with the Seaborn Objects interface:

# PLOT: SEABORN OBJECTS
# - the line segments between the 11th, 12th, 13th, and 14th datapoints seem to be wrong
(so.Plot()
   .add(so.Line(color = 'red'),x =falseposrate_logistic, y = trueposrate_logistic)
   .add(so.Dot(color = 'red', pointsize = 4),x = falseposrate_logistic, y = trueposrate_logistic)
   .layout(size = (10,7))
)

OUT:

seaborn-objects-lineplot_ISSUE-PLOT

You'll notice that there's a zig-zag pattern from the 11th to the 14th points and in a few other places. This appears to be incorrect.

We'd expect it to have a stair-step pattern, like this (plotted with Matplotlib):

plt.figure(figsize = (10,7))
plt.plot(falseposrate_logistic, trueposrate_logistic)

OUT:

matplotlib-ROC_no-issue

Additionally, if we subset the data down to try to isolate the issue in the 11th to the 14th points, you'll see that the original mis-plotting resolves, and a new issue appears between points 7 and 10:

# NOW, PLOT A SUBSET
# – with a different subset, 
#   the original issue dissappears
#   but a new issue arrises between points 7 and 10
subset_size = 17
true_positive_subset = trueposrate_logistic[0:subset_size]
false_positive_subset = falseposrate_logistic[0:subset_size]

(so.Plot()
   .add(so.Line(color = 'red'),x =false_positive_subset , y = true_positive_subset)
   .add(so.Dot(color = 'red'),x =false_positive_subset , y = true_positive_subset)
 )

OUT:

seaborn-objects-lineplot_ISSUE-PLOT_points-7-to-10

Unless I'm missing something obvious, it appears to be a bug in the order of how so.Line plots line segments between points.

Additionally, I'll note that if you try to plot the data with Plotly express, or sns.lineplot, the plots look fine, like the Matplotlib plot.

# PLOT: PLOTLY
# works fine!
# pio.renderers.default = 'svg'
px.line(x = falseposrate_logistic, y = trueposrate_logistic)

# PLOT: TRADITIONAL SEABORN
# works fine!
plt.figure(figsize = (10,7))
sns.lineplot(x = falseposrate_logistic
             ,y = trueposrate_logistic
             ,estimator =  None
             )

@mwaskom
Copy link
Owner

mwaskom commented Oct 7, 2022

Hi, thanks for raising and the reproducible example.

What's happening here is that Line sort observations according to the x (or actually, orient) variable. The return values from roc_curve are already sorted (by definition) but the weird behavior you're seeing is happening on ties, i.e., where the FPR is identical but you have different TPRs.

I think a reasonable action item here is for Line to use mergesort, rather than the default quicksort, which is unstable.

In the meantime, you could use Path, which does not sort but is fine for something like an ROC curve where the data is already sorted. BTW you don't need to pass data multiple times to Plot.add (and you can use a second Dot layer, but you don't to):

(
    so.Plot(x=falseposrate_logistic, y=trueposrate_logistic)
    .add(so.Path(marker="o", pointsize=2))
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants