Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of categorical distribution plots #410

Merged
merged 44 commits into from
Jan 22, 2015
Merged

Overhaul of categorical distribution plots #410

merged 44 commits into from
Jan 22, 2015

Conversation

mwaskom
Copy link
Owner

@mwaskom mwaskom commented Dec 29, 2014

TLDR: The boxplot and violinplot APIs are changing, for the better, but in a way that will be mildly disruptive. There is also a new function, stripplot.

There's some examples below, but to really see these functions in action, check out the new API docs that take advantage of automated figure collection for docstring examples:

boxplot | violinplot | stripplot

Changes/enhancements to boxplot and violinplot

This PR updates and unifies the API for boxplot and violinplot. Both functions maintain backwards-compatibility in terms of the kind of data they accept, but the syntax has changed. These functions are now invoked with x, y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter. You can still pass wide-form DataFrames or arrays to data, but it is no longer the first positional argument.

In other words instead of doing

sns.boxplot(tips.total_bill, groupby=tips.day)

You would now do

sns.boxplot("day", "total_bill", data=tips)

seaborn-boxplot-2

Existing code that uses these functions will probably break, but can be easily updated. I don't like these kind of disruptive API changes, but in this case the new API has a lot of virtues and creating a smoother upgrade path would have been too complicated to reasonably handle.

The upshot of this is that both functions now work seamlessly in context of a FacetGrid. Additionally, by using named variables and a data object, it's much easier to apply transformations to the data in the body of the seaborn call. It also just generally decreases the cognitive overhead of remembering that the API for boxplot/violinplot is different from that for regplot and friends.

To sweeten this change, there are a variety of other enhancements (and a few other API breaks):

  • Added a hue argument to boxplot and violinplot, which allows for nested grouping the plot elements by a third categorical variable. For violinplot, this nesting can also be accomplished by splitting the violins when there are two levels of the hue variable. To make this functionality feasible, the ability to specify where the plots will be draw in data coordinates has been removed. These plots now are drawn at set positions, like (and identical to) barplot and pointplot.
sns.violinplot("day", "total_bill", "smoker", data=tips, palette="Set1", split=True)

seaborn-violinplot-4

  • These plots now accept ordered categorical-type variables as input, and infer the orientation of the plot from which argument gets the category. Additionally, the order of the categories will determine the order of the plot elements:
sns.violinplot("orbital_period", "method", data=planets.query("orbital_period < 1000"))

seaborn-violinplot-10

  • Added a palette parameter to boxplot/violinplot. The color parameter still exists, but no longer does double-duty in accepting the name of a seaborn palette. palette supersedes color so that it can be used with a FacetGrid.
  • Added the scale and scale_hue parameters to violinplot. These control how the width of the violins are scaled. The default is area, which is different from how the violins used to be drawn. Use scale='width' to get the old behavior. You can also use scale="count" to scale by the number of observations in each bin.
  • Used a different style for the box kind of interior plot in violinplot, which shows the whisker range in addition to the quartiles. Use inner='quartile' to get the old style.

New stripplot function

This PR also introduces the stripplot function, which draws a scatterplot where one of the variables is categorical. This plot has the same API as boxplot and violinplot. It is useful both on its own and when composed with one of these other plot kinds to show both the observations and underlying distribution.

sns.violinplot("total_bill", "day", data=tips, inner=None)
sns.stripplot("total_bill", "day", data=tips, jitter=True)

seaborn-stripplot-10

Backend details

For the aficionados, this PR involves a complete rewrite of the code for these functions. It's much better organized, abstracted, and tested. That means it will be easier to keep these functions on a common API going forward, and to add enhancements with more confidence that they won't lead to regressions.

Next up will probably be to bring the barplot/pointplot code into this framework, which in some ways is better and more robust than what those run on. Also coming soon.... swarmplot.

@mwaskom mwaskom force-pushed the new_boxish_plots branch 2 times, most recently from aa86b4e to 1233017 Compare January 14, 2015 05:50
This was only really a staticmethod for early testing convenience.
This still needs a lot of cleaning up and testing. Most, but not all of it
is here.
Trying to handle cases with 0 or 1 observations in a bin, but
not every option works currently.
This commit also sucked in the new comments in the violinplot
kde estimation method
This was causing issues with the old version of matplotlib so I am just
killing it for now.
@mwaskom mwaskom changed the title WIP: Overhaul of boxplot-like plots WIP: Overhaul of categorical distribution plots Jan 19, 2015
@mwaskom mwaskom changed the title WIP: Overhaul of categorical distribution plots Overhaul of categorical distribution plots Jan 19, 2015
@mwaskom
Copy link
Owner Author

mwaskom commented Jan 19, 2015

This is mostly done from my perspective (modulo #423), but I'm looking for testers to try it out on some real data and find any weird corner-cases before I merge.

@phobson
Copy link
Contributor

phobson commented Jan 19, 2015

Due to the repeated values, I wouldn't be surprised if this was supposed to fail. If that's the case, should it fail more gracefully? Difficult to debug presently.

from io import StringIO

import numpy as np
import matplotlib.pyplot as plt
import seaborn

strfile = """\
category,epazone,parameter,station,qual,res
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.000000094994903
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.000000094994903
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,6,"Lead, Dissolved",inflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",outflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",inflow,=,0.8199999928474426
Wetland Basin,6,"Lead, Dissolved",outflow,=,2.5999999046325684
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5199999809265137
Wetland Basin,6,"Lead, Dissolved",outflow,=,7.449999809265137
Wetland Basin,6,"Lead, Dissolved",inflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",outflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",inflow,=,15.899999618530273
Wetland Basin,6,"Lead, Dissolved",outflow,=,2.190000057220459
Wetland Basin,6,"Lead, Dissolved",inflow,=,3.5899999141693115
Wetland Basin,6,"Lead, Dissolved",outflow,=,6.840000152587891
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5199999809265137
Wetland Basin,6,"Lead, Dissolved",outflow,=,3.9600000381469727
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5399999618530273
Wetland Basin,6,"Lead, Dissolved",outflow,=,3.2200000286102295
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.5600000023841858
Wetland Basin,7,"Lead, Dissolved",outflow,=,2.0899999141693115
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.46000000834465027
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.550000011920929
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.7799999713897705
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.5899999737739563
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.3700000047683716
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.27000001072883606
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.23999999463558197
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.3700000047683716
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.6399999856948853
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.8399999737739563
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.5699999928474426
Wetland Basin,7,"Lead, Dissolved",outflow,=,1.2899999618530273
"""
df = pandas.read_csv(StringIO(strfile))
df['logres'] = np.log(df['res'])
fig, ax = plt.subplots()
seaborn.violinplot(x='epazone', y='logres', hue='station', data=df, ax=ax, split=True)

Again, this is all very awesome.

@mwaskom
Copy link
Owner Author

mwaskom commented Jan 19, 2015

Ding ding, we have a winner. Fixed with bb8af70

@mwaskom
Copy link
Owner Author

mwaskom commented Jan 19, 2015

Your dataset is also doing some weird things with the area scaling when I remove the hue nesting, but I'm not entirely sure it's a "bug" or what the right way to fix it would be.

@phobson
Copy link
Contributor

phobson commented Jan 19, 2015

👏 that fixed it all on my end with all 361 datasets just like that one:
total suspended solids_bioretention

mwaskom added a commit that referenced this pull request Jan 22, 2015
Overhaul of categorical distribution plots
@mwaskom mwaskom merged commit ee59253 into master Jan 22, 2015
@mwaskom mwaskom deleted the new_boxish_plots branch January 22, 2015 05:30
@mwaskom mwaskom mentioned this pull request Mar 9, 2015
4 tasks
@sjobeek
Copy link

sjobeek commented Mar 13, 2015

Ohhhhh man, @mwaskom you are my hero. Can't wait to play around with these.

Categorical, horizontal violinplots with from long dataframes, compatible with FacetGrid... yum.

@Phlya
Copy link

Phlya commented May 21, 2015

Not sure if it's the right place for the comment, just want to say that combining e.g. violinplot with stripplot when providing hue argument causes each hue to be repeated in the legend - once for violins and once for points of stripplot. Not a big deal, but I think a way to not add points to the legend would be a good idea - if they are overlaying the violins and have the same colour mentioning them in the legend is not really necessary.
An example. Yes, it still looks OK, but with more hues would be more cluttered I think.
figure_1

P.S.
It's absolutely awesome that such complex and good-looking plots can be produced with just 2 lines of code! Great job, @mwaskom and everyone else contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants