histplot common normalization ignores weights #2655

zerothi · 2021-09-06T08:45:11Z

When using weights in histplot for probability plots it seems seaborn does not take into accounts the weights when using common_norm

See this small example:

import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

data = pd.DataFrame({
    'size': [1., 2., 2.],
    'cut': ['I', 'N', 'N'],
    'price': [100, 100, 100],
})

print(data)

f, axs = plt.subplots(figsize=(7, 5), nrows=2,ncols=2)

for i, com_norm in enumerate([True, False]):
    sns.histplot(
        data,
        x="price", hue="cut",
        stat='probability',
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[0][i],
    )
    axs[0][i].set_title(f"stat=probability, common_norm={com_norm}")

    sns.histplot(
        data,
        x="price", hue="cut",
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[1][i],
    )
    axs[1][i].set_title(f"stat=count, common_norm={com_norm}")
plt.show()

Which yields.

I would have assumed that weights are taken into account when calculating the probability. From the code:

seaborn/seaborn/distributions.py

Line 480 in 78e9c08

if common_norm:

I can see that it only takes into account the length of the data set which is reflected in the plot. I would have expected that the top left plot would have shown that I has a probability of 20% 1/(1+2+2). While the documentation for weights is

If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.

it does not mention anything about other than stat='count' plots it would make sense to take this into account?

In any case being able to weigh the probability with the weights using a common-norm would be useful if the histplot does as intended.

The text was updated successfully, but these errors were encountered:

mwaskom · 2021-09-06T11:14:41Z

Yes, good catch. seaborn doesn't actually do anything with the weights other than pass them through to matplotlib, but I think you're right that they need to be used by common_norm too.

mwaskom · 2021-09-06T11:29:09Z

BTW I found the stacked bars in your example a little hard to parse, this makes more sense to me as a demonstration of the issue:

f, axs = plt.subplots(2, 2, constrained_layout=True, figsize=(7, 6))
params = dict(x=[1, 2, 3], weights=[1, 2, 2], discrete=True)
sns.histplot(**params, ax=axs.flat[0])
sns.histplot(**params, hue=["a", "b", "b"], ax=axs.flat[1])
sns.histplot(**params, hue=["a", "b", "b"], stat="proportion", ax=axs.flat[2])
sns.histplot(**params, hue=["a", "b", "b"], stat="proportion", common_norm=False, ax=axs.flat[3])

zerothi · 2021-09-06T11:53:42Z

Great, I do something like this:

    # added code:
    sum_weight = 0.
    if common_norm:
        for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
            if "weights" in self.variables:
                sum_weight += sub_data["weights"].sum()

    # First pass through the data to compute the histograms
    for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
    ...

        # Apply scaling to normalize across groups (added code)
        if common_norm and weights is None:
            hist *= len(sub_data) / len(all_data)
        elif common_norm:
            hist *= weights.sum() / sum_weight

this seems to work nicely.

mwaskom · 2021-09-06T12:02:29Z

I think it would be cleaner to have default weights of 1, then you can just proceed from there by summing the weights for each group to compute the relevant numerator/denominator of the scaling factor and don't need to repeat conditionals in multiple places.

zerothi · 2021-09-06T12:05:18Z

I think it would be cleaner to have default weights of 1, then you can just proceed from there by summing the weights for each group to compute the relevant numerator/denominator of the scaling factor and don't need to repeat conditionals in multiple places.

I don't have any preference, this was my quick hack ;)
On the other hand summing for very large data-sets where it isn't needed might be annoying/slow?

zerothi · 2021-09-09T05:45:39Z

The diff for fixing this is:

diff --git a/seaborn/distributions.py b/seaborn/distributions.py
index 5f63289..8329807 100644
--- a/seaborn/distributions.py
+++ b/seaborn/distributions.py
@@ -424,6 +424,12 @@ class _DistributionPlotter(VectorPlotter):
                 warn_singular=False,
             )
 
+        sum_weight = 0.
+        if common_norm:
+            for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
+                if "weights" in self.variables:
+                    sum_weight += sub_data["weights"].sum()
+
         # First pass through the data to compute the histograms
         for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
 
@@ -464,12 +470,21 @@ class _DistributionPlotter(VectorPlotter):
             hist = pd.Series(heights, index=index, name="heights")
 
             # Apply scaling to normalize across groups
-            if common_norm:
+            if common_norm and weights is None:
                 hist *= len(sub_data) / len(all_data)
+            elif common_norm:
+                hist *= weights.sum() / sum_weight
 
             # Store the finalized histogram data for future plotting
             histograms[key] = hist

It can be altered as needed.

mwaskom changed the title ~~histplot with stat=probability, common_norm=True and weights~~ histplot common normalization ignores weights Sep 6, 2021

mwaskom added bug mod:distributions labels Sep 6, 2021

mwaskom added this to the v0.12.0 milestone Sep 6, 2021

zerothi mentioned this issue Sep 8, 2021

histplot with normalization for each x-bin across hue #2656

Closed

mwaskom mentioned this issue May 17, 2022

Fix histplot/kdeplot normalization with weights #2812

Merged

mwaskom closed this as completed in #2812 May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

histplot common normalization ignores weights #2655

histplot common normalization ignores weights #2655

zerothi commented Sep 6, 2021

mwaskom commented Sep 6, 2021

mwaskom commented Sep 6, 2021

zerothi commented Sep 6, 2021

mwaskom commented Sep 6, 2021

zerothi commented Sep 6, 2021

zerothi commented Sep 9, 2021

histplot common normalization ignores weights #2655

histplot common normalization ignores weights #2655

Comments

zerothi commented Sep 6, 2021

mwaskom commented Sep 6, 2021

mwaskom commented Sep 6, 2021

zerothi commented Sep 6, 2021

mwaskom commented Sep 6, 2021

zerothi commented Sep 6, 2021

zerothi commented Sep 9, 2021