Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ecdfplot function #2141

Merged
merged 11 commits into from
Jun 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Distribution plots

distplot
histplot
ecdfplot
kdeplot
rugplot

Expand Down
130 changes: 130 additions & 0 deletions doc/docstrings/ecdfplot.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a univariate distribution along the x axis:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns; sns.set()\n",
"penguins = sns.load_dataset(\"penguins\")\n",
"sns.ecdfplot(data=penguins, x=\"flipper_length_mm\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Flip the plot by assigning the data variable to the y axis:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, y=\"flipper_length_mm\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If neither `x` nor `y` is assigned, the dataset is treated as wide-form, and a histogram is drawn for each numeric column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins.filter(like=\"culmen_\", axis=\"columns\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also draw multiple histograms from a long-form dataset with hue mapping:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The default distribution statistic is normalized to show a proportion, but you can show absolute counts instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\", stat=\"count\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's also possible to plot the empirical complementary CDF (1 - CDF):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\", complementary=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "seaborn-refactor (py38)",
"language": "python",
"name": "seaborn-refactor"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
2 changes: 1 addition & 1 deletion doc/docstrings/histplot.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also draw multiple histograms from a long-form dataset with hue mapping:"
"You can otherwise draw multiple histograms from a long-form dataset with hue mapping:"
]
},
{
Expand Down
6 changes: 5 additions & 1 deletion doc/releases/v0.11.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,18 @@ v0.11.0 (Unreleased)
Modernization of distribution functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, a new function, :func:`histplot` has been added. :func:`histplot` draws univariate or bivariate histograms with a number of features, including:
First, two new functions, :func:`histplot` and :func:`ecdfplot` has been added.

:func:`histplot` draws univariate or bivariate histograms with a number of features, including:

- mapping multiple distributions with a ``hue`` semantic
- normalization to show density, probability, or frequency statistics
- flexible parameterization of bin size, including proper bins for discrete variables
- adding a KDE fit to show a smoothed distribution over all bin statistics
- experimental support for histograms over categorical and datetime variables. GH2125

:func:`ecdfplot` draws univariate empirical cumulative distribution functions, using a similar interface.

Second, the existing functions :func:`kdeplot` and :func:`rugplot` have been completely overhauled. Two of the oldest functions in the library, these lacked aspects of the otherwise-common seaborn API, such as the ability to assign variables by name from a ``data`` object; they had no capacity for semantic mapping; and they had numerous other inconsistencies and smaller issues.

The overhauled functions now share a common API with the rest of seaborn, they can show conditional distributions by mapping a third variable with a ``hue`` semantic, and have been improved in numerous other ways. The `github pull request (GH2104) <https://github.com/mwaskom/seaborn/pull/2104>`_ has a longer explanation of the changes and the motivation behind them.
Expand Down
3 changes: 3 additions & 0 deletions seaborn/_docstrings.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ def from_function_params(cls, func):
""",
kdeplot="""
kdeplot : Plot univariate or bivariate distributions using kernel density estimation.
""",
ecdfplot="""
ecdfplot : Plot empirical cumulative distribution functions.
""",
rugplot="""
rugplot : Plot a tick at each observation value along the x and/or y axes.
Expand Down
79 changes: 79 additions & 0 deletions seaborn/_statistics.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
"""Statistical transformations for visualization.

This module is currently private, but is being written to eventually form part
of the public API.

The classes should behave roughly in the style of scikit-learn.

- All data-independent parameters should be passed to the class constructor.
- Each class should impelment a default transformation that is exposed through
__call__. These are currently written for vector arguements, but I think
consuming a whole `plot_data` DataFrame and return it with transformed
variables would make more sense.
- Some class have data-dependent preprocessing that should be cached and used
multiple times (think defining histogram bins off all data and then counting
observations within each bin multiple times per data subsets). These currently
have unique names, but it would be good to have a common name. Not quite
`fit`, but something similar.
- Alternatively, the transform interface could take some information about grouping
variables and do a groupby internally.
- Some classes should define alternate transforms that might make the most sense
with a different function. For example, KDE usually evaluates the distribution
on a regular grid, but it would be useful for it to transform at the actual
datapoints. Then again, this could be controlled by a parameter at the time of
class instantiation.

"""
from distutils.version import LooseVersion
from numbers import Number
import numpy as np
Expand Down Expand Up @@ -345,3 +371,56 @@ def __call__(self, x1, x2=None, weights=None):
return self._eval_univariate(x1, weights)
else:
return self._eval_bivariate(x1, x2, weights)


class ECDF:
"""Univariate empirical cumulative distribution estimator."""
def __init__(self, stat="proportion", complementary=False):
"""Initialize the class with its paramters

Parameters
----------
stat : {{"proportion", "count"}}
Distribution statistic to compute.
complementary : bool
If True, use the complementary CDF (1 - CDF)

"""
_check_argument("stat", ["count", "proportion"], stat)
self.stat = stat
self.complementary = complementary

def _eval_bivariate(self, x1, x2, weights):
"""Inner function for ECDF of two variables."""
raise NotImplementedError("Bivariate ECDF is not implemented")

def _eval_univariate(self, x, weights):
mwaskom marked this conversation as resolved.
Show resolved Hide resolved
"""Inner function for ECDF of one variable."""
sorter = x.argsort()
x = x[sorter]
weights = weights[sorter]
y = weights.cumsum()

if self.stat == "proportion":
y = y / y.max()

x = np.r_[-np.inf, x]
y = np.r_[0, y]

if self.complementary:
y = y.max() - y

return y, x

def __call__(self, x1, x2=None, weights=None):
"""Return proportion or count of observations below each sorted datapoint."""
x1 = np.asarray(x1)
if weights is None:
weights = np.ones_like(x1)
else:
weights = np.asarray(weights)

if x2 is None:
return self._eval_univariate(x1, weights)
else:
return self._eval_bivariate(x1, x2, weights)
Loading