scatterplot throws an error when referencing int column by name #2279

bdpedigo · 2020-09-14T19:56:07Z

Summary

scatterplot throws an error when trying to reference column by name if the column name is an int

In seaborn 0.10 and lower, I was able to pass column names to scatterplot that were integers (e.g. scatterplot(data=data, x=0, y=1)). After updating to 0.11, I'm getting the error below.

Apologies if already posted but I did not see anything about this in the issue tracker. Thanks for the awesome package!

Environment

seaborn version: 0.11.0 (with 0.10.1, the code below will run fine)
matplotlib version: 3.1.3
matplotlib backend: TkAgg

Code to reproduce

import seaborn as sns
import matplotlib as mpl

print(f"seaborn version: {sns.__version__}")
print(f"matplotlib version: {mpl.__version__}")
print(f"matplotlib backend: {mpl.get_backend()}")

iris_data = sns.load_dataset("iris")

# this one works
sns.scatterplot(data=iris_data, x="sepal_length", y="sepal_width")

# make a new column called int(0)
iris_data[0] = iris_data["sepal_length"].copy()

# this one breaks (used to work in seaborn <0.11)
sns.scatterplot(data=iris_data, x=0, y="sepal_width")

Output

seaborn version: 0.11.0
matplotlib version: 3.1.3
matplotlib backend: TkAgg
Traceback (most recent call last):
  File "/Users/bpedigo/JHU_code/graspy_misc/graspy-misc/localization/sns_bug.py", line 16, in <module>
    sns.scatterplot(data=iris_data, x=0, y="sepal_width")
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_decorators.py", line 46, in inner_f
    return f(**kwargs)
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/relational.py", line 798, in scatterplot
    alpha=alpha, x_jitter=x_jitter, y_jitter=y_jitter, legend=legend,
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/relational.py", line 580, in __init__
    super().__init__(data=data, variables=variables)
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_core.py", line 604, in __init__
    self.assign_variables(data, variables)
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_core.py", line 668, in assign_variables
    data, **variables,
  File "/Users/bpedigo/miniconda3/envs/graspy_misc/lib/python3.7/site-packages/seaborn/_core.py", line 895, in _assign_variables_longform
    if val is not None and len(data) != len(val):
TypeError: object of type 'int' has no len()

The text was updated successfully, but these errors were encountered:

mwaskom · 2020-09-15T13:17:20Z

Hm I think this is the same story as #2263: the refactoring/standardization of the input data processing appears to have broken a couple of styles of data specification that are in a sort of gray zone where they weren't quite documented as part of the API but did work given how the old code functioned.

As in my answer to #2263, it may be possible to rework the new code to handle this case, although it will make things more complicated/unpredictable than considering x/y/etc as keys if they're strings and data otherwise.

How important are numeric names to your usecase?

bdpedigo · 2020-09-17T14:44:46Z

Hm I think this is the same story as #2263: the refactoring/standardization of the input data processing appears to have broken a couple of styles of data specification that are in a sort of gray zone

Interesting!

they weren't quite documented as part of the API but did work given how the old code functioned.

Just a note on my confusion with the API here - the current docs just say x, y : vectors or keys in data which made me think that any valid key works, not just string keys. If you end up not wanting to support arbitrary key types, I think this is worth clarifying.

How important are numeric names to your usecase?

Personally I find this really nice because I can do the following:

import seaborn as sns
import numpy as np
import pandas as pd

array1 = np.random.normal(loc=0, size=(10, 2))
array2 = np.random.normal(loc=1, size=(10, 2))
array = np.concatenate((array1, array2))
plot_df = pd.DataFrame(data=array)
plot_df["labels"] = [0] * 10 + [1] * 10
sns.scatterplot(data=plot_df, x=0, y=1, hue="labels")

It is convenient because when my data columns have no real meaning other than "First dimension", "Second dimension"... etc. (which is often the case when I'm just starting from an array) the default in pandas is to just make the column names integers if you don't pass in column names. Obviously it's not essential (can work around by making the columns strings) but it is super convenient! I imagine it could matter even more if you were plotting a dataframe that was constructed by a pivot as the columns names could easily end up being int or any other datatype.

it may be possible to rework the new code to handle this case, although it will make things more complicated/unpredictable than considering x/y/etc as keys if they're strings and data otherwise.

If you end up deciding to support, let me know if I can try to help in some way!

mwaskom · 2020-09-19T14:48:16Z

Just a note on my confusion with the API here - the current docs just say x, y : vectors or keys in data which made me think that any valid key works, not just string keys.

Right, the ambiguity of "key" there is why I say it's in a gray area. I meant that it's not documented in the sense that none of the API examples (to my knowledge) use non-string values for keyed data.

I have personally always considered integer pandas labels to be a bit of an antipattern, because it feels like it muddies the distinction between positional and label-based indexing. But that may be a personal preference.

Additionally, seaborn can interpret scalars as data, e.g.

sns.scatterplot(x=0, y=np.arange(10))

For now you could do

sns.scatterplot(x=plot_df[0], y=plot_df[1], hue=labels)

mwaskom added api mod:base labels Sep 19, 2020

mwaskom mentioned this issue Oct 22, 2020

relplot must assign a list or a column name in the "size" parameter #2330

Closed

mwaskom mentioned this issue Nov 13, 2020

A non-string or non-byte (e.g., int) pandas column name raises TypeError #2350

Closed

mwaskom added this to the v0.11.1 milestone Dec 1, 2020

mwaskom mentioned this issue Dec 16, 2020

Add/restore functionality to long-form data processing #2386

Merged

mwaskom closed this as completed in #2386 Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scatterplot throws an error when referencing int column by name #2279

scatterplot throws an error when referencing int column by name #2279

bdpedigo commented Sep 14, 2020

mwaskom commented Sep 15, 2020

bdpedigo commented Sep 17, 2020

mwaskom commented Sep 19, 2020

scatterplot throws an error when referencing int column by name #2279

scatterplot throws an error when referencing int column by name #2279

Comments

bdpedigo commented Sep 14, 2020

Summary

Environment

Code to reproduce

Output

mwaskom commented Sep 15, 2020

bdpedigo commented Sep 17, 2020

mwaskom commented Sep 19, 2020