Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scatterplot throws an error when referencing int column by name #2279

Closed
bdpedigo opened this issue Sep 14, 2020 · 3 comments · Fixed by #2386
Closed

scatterplot throws an error when referencing int column by name #2279

bdpedigo opened this issue Sep 14, 2020 · 3 comments · Fixed by #2386

Comments

@bdpedigo
Copy link

Summary

scatterplot throws an error when trying to reference column by name if the column name is an int

In seaborn 0.10 and lower, I was able to pass column names to scatterplot that were integers (e.g. scatterplot(data=data, x=0, y=1)). After updating to 0.11, I'm getting the error below.

Apologies if already posted but I did not see anything about this in the issue tracker. Thanks for the awesome package!

Environment

seaborn version: 0.11.0 (with 0.10.1, the code below will run fine)
matplotlib version: 3.1.3
matplotlib backend: TkAgg

Code to reproduce

import seaborn as sns
import matplotlib as mpl

print(f"seaborn version: {sns.__version__}")
print(f"matplotlib version: {mpl.__version__}")
print(f"matplotlib backend: {mpl.get_backend()}")

iris_data = sns.load_dataset("iris")

# this one works
sns.scatterplot(data=iris_data, x="sepal_length", y="sepal_width")

# make a new column called int(0)
iris_data[0] = iris_data["sepal_length"].copy()

# this one breaks (used to work in seaborn <0.11)
sns.scatterplot(data=iris_data, x=0, y="sepal_width")

Output

seaborn version: 0.11.0
matplotlib version: 3.1.3
matplotlib backend: TkAgg
Traceback (most recent call last):
  File "/Users/bpedigo/JHU_code/graspy_misc/graspy-misc/localization/sns_bug.py", line 16, in <module>
    sns.scatterplot(data=iris_data, x=0, y="sepal_width")
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_decorators.py", line 46, in inner_f
    return f(**kwargs)
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/relational.py", line 798, in scatterplot
    alpha=alpha, x_jitter=x_jitter, y_jitter=y_jitter, legend=legend,
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/relational.py", line 580, in __init__
    super().__init__(data=data, variables=variables)
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_core.py", line 604, in __init__
    self.assign_variables(data, variables)
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_core.py", line 668, in assign_variables
    data, **variables,
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_core.py", line 895, in _assign_variables_longform
    if val is not None and len(data) != len(val):
TypeError: object of type 'int' has no len()
@mwaskom
Copy link
Owner

mwaskom commented Sep 15, 2020

Hm I think this is the same story as #2263: the refactoring/standardization of the input data processing appears to have broken a couple of styles of data specification that are in a sort of gray zone where they weren't quite documented as part of the API but did work given how the old code functioned.

As in my answer to #2263, it may be possible to rework the new code to handle this case, although it will make things more complicated/unpredictable than considering x/y/etc as keys if they're strings and data otherwise.

How important are numeric names to your usecase?

@bdpedigo
Copy link
Author

Hm I think this is the same story as #2263: the refactoring/standardization of the input data processing appears to have broken a couple of styles of data specification that are in a sort of gray zone

Interesting!

they weren't quite documented as part of the API but did work given how the old code functioned.

Just a note on my confusion with the API here - the current docs just say x, y : vectors or keys in data which made me think that any valid key works, not just string keys. If you end up not wanting to support arbitrary key types, I think this is worth clarifying.

How important are numeric names to your usecase?

Personally I find this really nice because I can do the following:

import seaborn as sns
import numpy as np
import pandas as pd

array1 = np.random.normal(loc=0, size=(10, 2))
array2 = np.random.normal(loc=1, size=(10, 2))
array = np.concatenate((array1, array2))
plot_df = pd.DataFrame(data=array)
plot_df["labels"] = [0] * 10 + [1] * 10
sns.scatterplot(data=plot_df, x=0, y=1, hue="labels")

It is convenient because when my data columns have no real meaning other than "First dimension", "Second dimension"... etc. (which is often the case when I'm just starting from an array) the default in pandas is to just make the column names integers if you don't pass in column names. Obviously it's not essential (can work around by making the columns strings) but it is super convenient! I imagine it could matter even more if you were plotting a dataframe that was constructed by a pivot as the columns names could easily end up being int or any other datatype.

it may be possible to rework the new code to handle this case, although it will make things more complicated/unpredictable than considering x/y/etc as keys if they're strings and data otherwise.

If you end up deciding to support, let me know if I can try to help in some way!

@mwaskom
Copy link
Owner

mwaskom commented Sep 19, 2020

Just a note on my confusion with the API here - the current docs just say x, y : vectors or keys in data which made me think that any valid key works, not just string keys.

Right, the ambiguity of "key" there is why I say it's in a gray area. I meant that it's not documented in the sense that none of the API examples (to my knowledge) use non-string values for keyed data.

I have personally always considered integer pandas labels to be a bit of an antipattern, because it feels like it muddies the distinction between positional and label-based indexing. But that may be a personal preference.

Additionally, seaborn can interpret scalars as data, e.g.

sns.scatterplot(x=0, y=np.arange(10))

For now you could do

sns.scatterplot(x=plot_df[0], y=plot_df[1], hue=labels)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants