Meaning of color argument in DataFrame.plot.scatter() #16485

Dr-Irv · 2017-05-24T19:58:35Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame.from_records([[1,1,'g'],[2,2,'r']], columns=['x','y','Color'])
df.plot.scatter(x='x',y='y',c='Color')

Problem description

The issue here is that it is not clear what the values in the column corresponding to the argument c of scatter should be. The example given at http://pandas.pydata.org/pandas-docs/stable/visualization.html#scatter-plot uses numerical values, but in this example, I just want red and green dots. With matplotlib, you can supply the colors as a vector.

IMHO, the API should be consistent. You should be able to specify the column names corresponding to the value of x, y, and the color. This would be especially useful if you have a pattern such as:

df[df.x>=1].plot.scatter(x='x', y='y', c='Color')

where you produce a scatter plot of selected rows.

The code in the simple example generates an error:

KeyError                                  Traceback (most recent call last)
C:\Anaconda3\envs\py36\lib\site-packages\matplotlib\colors.py in to_rgba(c, alpha)
    140     try:
--> 141         rgba = _colors_full_map.cache[c, alpha]
    142     except (KeyError, TypeError):  # Not in cache, or unhashable.

KeyError: ('o', None)

Expected Output

A plot with 2 points, one red and one green.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.0.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

VincentLa · 2017-05-24T23:31:46Z

Oh interesting, I might try to see if I can pick this up

amichaut · 2017-06-26T21:24:45Z

Hi, I think my comment is related (tell me if I'm wrong). Until version 0.19.2, the color argument in df.plot could be specified with a rbga tuple. It is not supported in newer versions. Is it intentional?
In more details, if I call for instance:

df.plot(color=(0.5,0.5,0.5))
plt.show

I get the following error in newer versions:

/usr/local/lib/python2.7/dist-packages/matplotlib/colors.pyc in _to_rgba_no_colorcycle(c, alpha)
    192         # float)` and `np.array(...).astype(float)` all convert "0.5" to 0.5.
    193         # Test dimensionality to reject single floats.
--> 194         raise ValueError("Invalid RGBA argument: {!r}".format(orig_c))
    195     # Return a tuple to prevent the cached value from being modified.
    196     c = tuple(c.astype(float))

ValueError: Invalid RGBA argument: 0.5

But it was supported before.

jorisvandenbossche · 2017-06-27T07:40:15Z

@amichaut I think this is fixed on master, see #16695 (and PR #16701). This will probably be released in 0.20.3

scfrank · 2018-03-26T14:37:22Z

I just ran into the original bug (color names not being recognised/used; cryptic error message) and would like to resurrect this issue. It's frustrating because this kind of example should work according to a lot of stack overflow examples (e.g. https://stackoverflow.com/questions/41069676/make-scatter-plot-and-color-points-with-colors-stored-in-data-frame) - as a new pandas user, this is going to cause a lot of confusion.

I've traced the problem (I think) to _compute_plot_data in plotting/_core.py:

pandas/pandas/plotting/_core.py

Line 340 in 2d491c3

def _compute_plot_data(self):

AFAICT this function throws out non-numeric columns - this includes the column containing the string color values, so after this function, the dataframe no longer contains the 'color' column.

The minimal example in the original issue still results in:
/venv3/lib/python3.6/site-packages/matplotlib/colors.py", line 166, in to_rgba
rgba = _colors_full_map.cache[c, alpha]
KeyError: ('o', None)
which is again confusing since one hasn't specified a 'o' color at all (I'm not sure where this default value is coming from).

sorenwacker · 2022-09-26T22:36:01Z

I think, it would make more sense if the api would interpret either the color or c argument as category/mappable that should be colored. Instead of adding column with explicit colors, the column should contain eighter categories or numeric values that are then used to color the markers. And a legend should be added. Similar to how it is done in Seaborn with the hue argument.

Current behaviour:

import pandas as pd
df = pd.DataFrame({'dataX': [3,79,90], 'dataY': [7,9,13], 'color': ['Shoe', 'Star', 'Shoe']})
df.plot.scatter('dataX', 'dataY', c='color')
> ValueError: 'c' argument must be a color, a sequence of colors, or a sequence of numbers, not ['Shoe' 'Star' 'Shoe']

Should generate something like:

sns.relplot(data=df, x='dataX', y='dataY', hue='color')

This would be way more practical than the current behaviour IMO.

michaelmannino · 2024-07-12T13:10:38Z

take

michaelmannino · 2024-07-12T23:20:42Z

Hi all who are still interested in this topic, I have completed the general functionality, and it is out in my PR if you would like to take a look

My only question is, what is the best way to choose default colors for strings here? Currently, I am pulling the largest list of mpl's colors and randomly choosing as just iterating though normally tends to pick too similar of colors

VincentLa mentioned this issue May 24, 2017

[Issue: 16485] Making Color Argument More Flexible #16490

Closed

4 tasks

TomAugspurger added Enhancement Visualization plotting labels May 25, 2017

TomAugspurger added this to the Next Major Release milestone May 25, 2017

TomAugspurger mentioned this issue Jul 5, 2017

Scatter plot with colour_by and size_by variables #16827

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

AaronStiff mentioned this issue Feb 23, 2023

Pandas backend color scatter not working plotly/plotly.py#3956

Closed

github-actions bot assigned michaelmannino Jul 12, 2024

michaelmannino mentioned this issue Jul 12, 2024

ENH: DataFrame.plot.scatter argument c now accepts a column of strings, where rows with the same string are colored identically #59239

Merged

6 tasks

WillAyd closed this as completed in #59239 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meaning of color argument in DataFrame.plot.scatter() #16485

Meaning of color argument in DataFrame.plot.scatter() #16485

Dr-Irv commented May 24, 2017

VincentLa commented May 24, 2017

amichaut commented Jun 26, 2017 •

edited

Loading

jorisvandenbossche commented Jun 27, 2017

scfrank commented Mar 26, 2018

sorenwacker commented Sep 26, 2022 •

edited

Loading

michaelmannino commented Jul 12, 2024

michaelmannino commented Jul 12, 2024

Meaning of color argument in DataFrame.plot.scatter() #16485

Meaning of color argument in DataFrame.plot.scatter() #16485

Comments

Dr-Irv commented May 24, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

VincentLa commented May 24, 2017

amichaut commented Jun 26, 2017 • edited Loading

jorisvandenbossche commented Jun 27, 2017

scfrank commented Mar 26, 2018

sorenwacker commented Sep 26, 2022 • edited Loading

michaelmannino commented Jul 12, 2024

michaelmannino commented Jul 12, 2024

Output of `pd.show_versions()`

amichaut commented Jun 26, 2017 •

edited

Loading

sorenwacker commented Sep 26, 2022 •

edited

Loading