Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885

Closed
wants to merge 8 commits into from

Conversation

onesandzeroes
Copy link
Contributor

Originally discussed in #8244. Currently this only uses the column from the dataframe if the s argument is a string, supporting int column names is ambiguous and liable to break existing code, but if people can think of a good way to do this I'll try to add it.

Example output:

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({
    'x': np.linspace(0, 50, 6),
    'y': np.linspace(0, 20, 6),
    'cat_column': random.sample('abcdabcdabcd', 6)
})
df['cat_column'] = pd.Categorical(df['cat_column'])

# Numeric column: linear scaling of point sizes
df.plot(kind='scatter', x='x', y='y', s='x')

image

# Categorical columns: fixed set of sizes
df.plot(kind='scatter', x='x', y='y', s='cat_column')

image

The minimum and maximum point sizes we scale have a default range of (50, 1000), and
can be adjusted with the size_range arg:

# Smaller range of sizes than default
df.plot(kind='scatter', x='x', y='y', s='x',
        size_range=(200, 800))

image

Picking good defaults for the range of sizes is pretty hard as these sizes are absolute- matplotlib
sets the size in terms of points. The ability to tweak up and down with the size_range arg is
probably going to be needed, but open to suggestions on good defaults.
.

@jreback jreback changed the title ENH/VIS: Pass DataFrame column to size argument in df.scatter ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter Nov 24, 2014
@jreback
Copy link
Contributor

jreback commented Nov 24, 2014

is s really the argument name passed thru to matplotlib?

@onesandzeroes
Copy link
Contributor Author

Yeah, that's what the argument is in pyplot. Other parts of the api might accept something different but pyplot doesn't even accept size as an alias.

@shoyer
Copy link
Member

shoyer commented Nov 24, 2014

@onesandzeroes are you scaling the radius as proportional to value or the area as proportional to value? I think the later is generally more sensible for numeric (non-negative) columns, but perhaps this should be another option...

@onesandzeroes
Copy link
Contributor Author

The s argument to plt.scatter() changes the area, so that's what it's scaling on here. There's a bit of discussion on SO here.

@shoyer
Copy link
Member

shoyer commented Nov 24, 2014

@onesandzeroes Sounds good -- let's document that!

Also, I think the default scaling (at least for strictly positive variables) should probably be not be offset by the minimum value, but be absolute (relative to some scale). Right now it looks like that would be hard to achieve.

min_size, max_size = self.size_range
size_col = self.data[col_name]

if com.is_categorical_dtype(size_col):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment or 2 to explain what you are doing here

@jreback jreback added this to the 0.15.2 milestone Nov 24, 2014
@onesandzeroes
Copy link
Contributor Author

Also, I think the default scaling (at least for strictly positive variables) should probably be not be offset by the minimum value, but be absolute (relative to some scale). Right now it looks like that would be hard to achieve.

@shoyer not sure if I understand what you mean here. So, for strictly positive values, have the size of points range between 0 and the max size, with the minimum data value falling at some point within that range? Like if we have data values of [50, 75, 100] and a max size of 200pt, the 50 val will be 100pt in size because it falls halfway between 0-100?

That probably works pretty well actually. I guess the reason I originally had a min size set was that I didn't want small values to completely disappear, but as long as the min size can still be bumped up to prevent that, should be OK. I'll try to test that out later today.

I think part of the reason I went with a minimum size, with the minimum observed value assigned that size, is because that's what ggplot seems to do, from what I can make of their source.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2014

looks good. pls add a release note and squash.

@TomAugspurger ?

@TomAugspurger
Copy link
Contributor

Re referencing int column names, what to people think about making a Column class (or maybe reuse Index) to disambiguate.

df.plot(..., s=pd.Column(0))

then we can pick out the column labeled 0 (which may not be the position 0 column). Doesn't need to go in this PR of course.

I'll look at the code tonight hopefully.

I think what @shoyer is saying is that for

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 's1': [10, 20], 's2': [100, 200]})

then df.plot(kind='scatter', x='A', y='B', s='s1'), and df.plot(kind='scatter', x='A', y='B', s='s2')

will look identical. I think this is ok since users can adjust size_range if they want them to look different.

@shoyer
Copy link
Member

shoyer commented Nov 25, 2014

@onesandzeroes Yes, if we have data values of [50, 75, 100] and a max size of 200pt, they should have corresponding sizes of [100, 150, 200].

We could do a similar thing for colormaps as well, using a diverging or ascending colormap based on similar heuristics.

It's worth looking at how Seaborn handles this for heatmap. Actually, I would encourage you to consider getting some version of this patch into Seaborn as well, which could really use a scatterplot function: mwaskom/seaborn#315

@jreback
Copy link
Contributor

jreback commented Dec 5, 2014

@onesandzeroes pls add a release note, rebase and squash.

@shoyer @TomAugspurger @jorisvandenbossche ok on this?

@jorisvandenbossche
Copy link
Member

@onesandzeroes Can you add to the test the cases where:

  • s is an int
  • s is a column (but not a string, the column itself as df['col'])
  • s is a numpy array (this last one broke for c, see PR BUG: allow numpy.array as c values to scatterplot #8929 that fixed it). I don't think in this case it will be a problem, as you check whether it is a string, but we should test it.

Further, I was wondering, it seems a bit strange / inconsistent that size_range is not applied when you provide a column / array but only for string. I know it is because you can't modify the values through the string directly, but still.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2014

@onesandzeroes ? trying to get 0.15.2 finished....can you see comments above

@shoyer
Copy link
Member

shoyer commented Dec 6, 2014

I think we should be sure we're happy with the size scaling behavior before merging this, so it may not make it into 0.15.2.

From his earlier comments, it sounds like @onesandzeroes was going to test out my proposal for scaling based on absolute values rather than subtracting the minimum value.

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Dec 6, 2014
@jreback
Copy link
Contributor

jreback commented Dec 6, 2014

bumped

@jreback jreback added this to the 0.16.1 milestone Mar 5, 2015
@jreback jreback removed this from the 0.16.0 milestone Mar 5, 2015
@jreback jreback modified the milestones: 0.17.0, 0.16.1, Next Major Release Apr 29, 2015
@jreback
Copy link
Contributor

jreback commented May 9, 2015

closing if/when updated pls reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants