ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885

onesandzeroes · 2014-11-24T10:47:40Z

Originally discussed in #8244. Currently this only uses the column from the dataframe if the s argument is a string, supporting int column names is ambiguous and liable to break existing code, but if people can think of a good way to do this I'll try to add it.

Example output:

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({
    'x': np.linspace(0, 50, 6),
    'y': np.linspace(0, 20, 6),
    'cat_column': random.sample('abcdabcdabcd', 6)
})
df['cat_column'] = pd.Categorical(df['cat_column'])

# Numeric column: linear scaling of point sizes
df.plot(kind='scatter', x='x', y='y', s='x')

# Categorical columns: fixed set of sizes
df.plot(kind='scatter', x='x', y='y', s='cat_column')

The minimum and maximum point sizes we scale have a default range of (50, 1000), and
can be adjusted with the size_range arg:

# Smaller range of sizes than default
df.plot(kind='scatter', x='x', y='y', s='x',
        size_range=(200, 800))

Picking good defaults for the range of sizes is pretty hard as these sizes are absolute- matplotlib
sets the size in terms of points. The ability to tweak up and down with the size_range arg is
probably going to be needed, but open to suggestions on good defaults.
.

jreback · 2014-11-24T12:40:12Z

is s really the argument name passed thru to matplotlib?

onesandzeroes · 2014-11-24T12:52:40Z

Yeah, that's what the argument is in pyplot. Other parts of the api might accept something different but pyplot doesn't even accept size as an alias.

shoyer · 2014-11-24T16:45:34Z

@onesandzeroes are you scaling the radius as proportional to value or the area as proportional to value? I think the later is generally more sensible for numeric (non-negative) columns, but perhaps this should be another option...

onesandzeroes · 2014-11-24T22:36:07Z

The s argument to plt.scatter() changes the area, so that's what it's scaling on here. There's a bit of discussion on SO here.

shoyer · 2014-11-24T22:43:04Z

@onesandzeroes Sounds good -- let's document that!

Also, I think the default scaling (at least for strictly positive variables) should probably be not be offset by the minimum value, but be absolute (relative to some scale). Right now it looks like that would be hard to achieve.

jreback · 2014-11-24T23:05:45Z

pandas/tools/plotting.py

+        min_size, max_size = self.size_range
+        size_col = self.data[col_name]
+
+        if com.is_categorical_dtype(size_col):


maybe add a comment or 2 to explain what you are doing here

onesandzeroes · 2014-11-25T01:35:01Z

Also, I think the default scaling (at least for strictly positive variables) should probably be not be offset by the minimum value, but be absolute (relative to some scale). Right now it looks like that would be hard to achieve.

@shoyer not sure if I understand what you mean here. So, for strictly positive values, have the size of points range between 0 and the max size, with the minimum data value falling at some point within that range? Like if we have data values of [50, 75, 100] and a max size of 200pt, the 50 val will be 100pt in size because it falls halfway between 0-100?

That probably works pretty well actually. I guess the reason I originally had a min size set was that I didn't want small values to completely disappear, but as long as the min size can still be bumped up to prevent that, should be OK. I'll try to test that out later today.

I think part of the reason I went with a minimum size, with the minimum observed value assigned that size, is because that's what ggplot seems to do, from what I can make of their source.

jreback · 2014-11-25T10:44:28Z

looks good. pls add a release note and squash.

@TomAugspurger ?

TomAugspurger · 2014-11-25T14:46:14Z

Re referencing int column names, what to people think about making a Column class (or maybe reuse Index) to disambiguate.

df.plot(..., s=pd.Column(0))

then we can pick out the column labeled 0 (which may not be the position 0 column). Doesn't need to go in this PR of course.

I'll look at the code tonight hopefully.

I think what @shoyer is saying is that for

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 's1': [10, 20], 's2': [100, 200]})

then df.plot(kind='scatter', x='A', y='B', s='s1'), and df.plot(kind='scatter', x='A', y='B', s='s2')

will look identical. I think this is ok since users can adjust size_range if they want them to look different.

shoyer · 2014-11-25T17:03:25Z

@onesandzeroes Yes, if we have data values of [50, 75, 100] and a max size of 200pt, they should have corresponding sizes of [100, 150, 200].

We could do a similar thing for colormaps as well, using a diverging or ascending colormap based on similar heuristics.

It's worth looking at how Seaborn handles this for heatmap. Actually, I would encourage you to consider getting some version of this patch into Seaborn as well, which could really use a scatterplot function: mwaskom/seaborn#315

jreback · 2014-12-05T13:50:41Z

@onesandzeroes pls add a release note, rebase and squash.

@shoyer @TomAugspurger @jorisvandenbossche ok on this?

jorisvandenbossche · 2014-12-05T14:10:09Z

@onesandzeroes Can you add to the test the cases where:

s is an int
s is a column (but not a string, the column itself as df['col'])
s is a numpy array (this last one broke for c, see PR BUG: allow numpy.array as c values to scatterplot #8929 that fixed it). I don't think in this case it will be a problem, as you check whether it is a string, but we should test it.

Further, I was wondering, it seems a bit strange / inconsistent that size_range is not applied when you provide a column / array but only for string. I know it is because you can't modify the values through the string directly, but still.

jreback · 2014-12-06T17:15:11Z

@onesandzeroes ? trying to get 0.15.2 finished....can you see comments above

shoyer · 2014-12-06T20:37:26Z

I think we should be sure we're happy with the size scaling behavior before merging this, so it may not make it into 0.15.2.

From his earlier comments, it sounds like @onesandzeroes was going to test out my proposal for scaling based on absolute values rather than subtracting the minimum value.

jreback · 2014-12-06T21:25:55Z

bumped

jreback · 2015-05-09T20:11:46Z

closing if/when updated pls reopen

onesandzeroes added 8 commits November 24, 2014 20:37

Allow size scaling by passing column name

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

367b648

Allow categorical size column

aab72ba

Add to release notes

cc2415a

Add sizing examples to docs

af83de3

Only use s as column name if string

9a38571

Docs: only string column names supported

69ccae1

Add tests

a103bba

Combine paragraphs at end of scatter docs

a404ad8

jreback changed the title ~~ENH/VIS: Pass DataFrame column to size argument in df.scatter~~ ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter Nov 24, 2014

jreback added Enhancement Visualization labels Nov 24, 2014

jreback reviewed Nov 24, 2014
View reviewed changes

jreback added this to the 0.15.2 milestone Nov 24, 2014

jreback modified the milestones: 0.16.0, 0.15.2 Dec 6, 2014

jreback added this to the 0.16.1 milestone Mar 5, 2015

jreback removed this from the 0.16.0 milestone Mar 5, 2015

sinhrks mentioned this pull request Apr 28, 2015

TST: Split graphics_test to main and others #9813

Merged

jreback modified the milestones: 0.17.0, 0.16.1, Next Major Release Apr 29, 2015

jreback closed this May 9, 2015

jorisvandenbossche added the Closed PR label May 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885

ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885

onesandzeroes commented Nov 24, 2014

jreback commented Nov 24, 2014

onesandzeroes commented Nov 24, 2014

shoyer commented Nov 24, 2014

onesandzeroes commented Nov 24, 2014

shoyer commented Nov 24, 2014

jreback Nov 24, 2014

onesandzeroes commented Nov 25, 2014

jreback commented Nov 25, 2014

TomAugspurger commented Nov 25, 2014

shoyer commented Nov 25, 2014

jreback commented Dec 5, 2014

jorisvandenbossche commented Dec 5, 2014

jreback commented Dec 6, 2014

shoyer commented Dec 6, 2014

jreback commented Dec 6, 2014

jreback commented May 9, 2015

ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885

ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885

Conversation

onesandzeroes commented Nov 24, 2014

jreback commented Nov 24, 2014

onesandzeroes commented Nov 24, 2014

shoyer commented Nov 24, 2014

onesandzeroes commented Nov 24, 2014

shoyer commented Nov 24, 2014

jreback Nov 24, 2014

Choose a reason for hiding this comment

onesandzeroes commented Nov 25, 2014

jreback commented Nov 25, 2014

TomAugspurger commented Nov 25, 2014

shoyer commented Nov 25, 2014

jreback commented Dec 5, 2014

jorisvandenbossche commented Dec 5, 2014

jreback commented Dec 6, 2014

shoyer commented Dec 6, 2014

jreback commented Dec 6, 2014

jreback commented May 9, 2015