-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/VIS: Pass DataFrame column to size argument in DataFrame.scatter #8885
Conversation
size
argument in df.scatter
is |
Yeah, that's what the argument is in pyplot. Other parts of the api might accept something different but pyplot doesn't even accept |
@onesandzeroes are you scaling the radius as proportional to value or the area as proportional to value? I think the later is generally more sensible for numeric (non-negative) columns, but perhaps this should be another option... |
The |
@onesandzeroes Sounds good -- let's document that! Also, I think the default scaling (at least for strictly positive variables) should probably be not be offset by the minimum value, but be absolute (relative to some scale). Right now it looks like that would be hard to achieve. |
min_size, max_size = self.size_range | ||
size_col = self.data[col_name] | ||
|
||
if com.is_categorical_dtype(size_col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment or 2 to explain what you are doing here
@shoyer not sure if I understand what you mean here. So, for strictly positive values, have the size of points range between 0 and the max size, with the minimum data value falling at some point within that range? Like if we have data values of That probably works pretty well actually. I guess the reason I originally had a min size set was that I didn't want small values to completely disappear, but as long as the min size can still be bumped up to prevent that, should be OK. I'll try to test that out later today. I think part of the reason I went with a minimum size, with the minimum observed value assigned that size, is because that's what ggplot seems to do, from what I can make of their source. |
looks good. pls add a release note and squash. |
Re referencing int column names, what to people think about making a
then we can pick out the column labeled 0 (which may not be the position 0 column). Doesn't need to go in this PR of course. I'll look at the code tonight hopefully. I think what @shoyer is saying is that for df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 's1': [10, 20], 's2': [100, 200]}) then will look identical. I think this is ok since users can adjust size_range if they want them to look different. |
@onesandzeroes Yes, if we have data values of [50, 75, 100] and a max size of 200pt, they should have corresponding sizes of [100, 150, 200]. We could do a similar thing for colormaps as well, using a diverging or ascending colormap based on similar heuristics. It's worth looking at how Seaborn handles this for heatmap. Actually, I would encourage you to consider getting some version of this patch into Seaborn as well, which could really use a scatterplot function: mwaskom/seaborn#315 |
@onesandzeroes pls add a release note, rebase and squash. @shoyer @TomAugspurger @jorisvandenbossche ok on this? |
@onesandzeroes Can you add to the test the cases where:
Further, I was wondering, it seems a bit strange / inconsistent that |
@onesandzeroes ? trying to get 0.15.2 finished....can you see comments above |
I think we should be sure we're happy with the size scaling behavior before merging this, so it may not make it into 0.15.2. From his earlier comments, it sounds like @onesandzeroes was going to test out my proposal for scaling based on absolute values rather than subtracting the minimum value. |
bumped |
closing if/when updated pls reopen |
Originally discussed in #8244. Currently this only uses the column from the dataframe if the
s
argument is a string, supporting int column names is ambiguous and liable to break existing code, but if people can think of a good way to do this I'll try to add it.Example output:
The minimum and maximum point sizes we scale have a default range of
(50, 1000)
, andcan be adjusted with the
size_range
arg:Picking good defaults for the range of sizes is pretty hard as these sizes are absolute- matplotlib
sets the size in terms of points. The ability to tweak up and down with the
size_range
arg isprobably going to be needed, but open to suggestions on good defaults.
.