Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

histplot common normalization ignores weights #2655

Closed
zerothi opened this issue Sep 6, 2021 · 6 comments · Fixed by #2812
Closed

histplot common normalization ignores weights #2655

zerothi opened this issue Sep 6, 2021 · 6 comments · Fixed by #2812

Comments

@zerothi
Copy link

zerothi commented Sep 6, 2021

When using weights in histplot for probability plots it seems seaborn does not take into accounts the weights when using common_norm

See this small example:

import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

data = pd.DataFrame({
    'size': [1., 2., 2.],
    'cut': ['I', 'N', 'N'],
    'price': [100, 100, 100],
})

print(data)

f, axs = plt.subplots(figsize=(7, 5), nrows=2,ncols=2)

for i, com_norm in enumerate([True, False]):
    sns.histplot(
        data,
        x="price", hue="cut",
        stat='probability',
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[0][i],
    )
    axs[0][i].set_title(f"stat=probability, common_norm={com_norm}")

    sns.histplot(
        data,
        x="price", hue="cut",
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[1][i],
    )
    axs[1][i].set_title(f"stat=count, common_norm={com_norm}")
plt.show()

Which yields.
data

I would have assumed that weights are taken into account when calculating the probability. From the code:

if common_norm:
I can see that it only takes into account the length of the data set which is reflected in the plot. I would have expected that the top left plot would have shown that I has a probability of 20% 1/(1+2+2). While the documentation for weights is

If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.

it does not mention anything about other than stat='count' plots it would make sense to take this into account?

In any case being able to weigh the probability with the weights using a common-norm would be useful if the histplot does as intended.

@mwaskom
Copy link
Owner

mwaskom commented Sep 6, 2021

Yes, good catch. seaborn doesn't actually do anything with the weights other than pass them through to matplotlib, but I think you're right that they need to be used by common_norm too.

@mwaskom
Copy link
Owner

mwaskom commented Sep 6, 2021

BTW I found the stacked bars in your example a little hard to parse, this makes more sense to me as a demonstration of the issue:

f, axs = plt.subplots(2, 2, constrained_layout=True, figsize=(7, 6))
params = dict(x=[1, 2, 3], weights=[1, 2, 2], discrete=True)
sns.histplot(**params, ax=axs.flat[0])
sns.histplot(**params, hue=["a", "b", "b"], ax=axs.flat[1])
sns.histplot(**params, hue=["a", "b", "b"], stat="proportion", ax=axs.flat[2])
sns.histplot(**params, hue=["a", "b", "b"], stat="proportion", common_norm=False, ax=axs.flat[3])

image

@zerothi
Copy link
Author

zerothi commented Sep 6, 2021

Great, I do something like this:

    # added code:
    sum_weight = 0.
    if common_norm:
        for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
            if "weights" in self.variables:
                sum_weight += sub_data["weights"].sum()

    # First pass through the data to compute the histograms
    for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
    ...

        # Apply scaling to normalize across groups (added code)
        if common_norm and weights is None:
            hist *= len(sub_data) / len(all_data)
        elif common_norm:
            hist *= weights.sum() / sum_weight

this seems to work nicely.

@mwaskom
Copy link
Owner

mwaskom commented Sep 6, 2021

I think it would be cleaner to have default weights of 1, then you can just proceed from there by summing the weights for each group to compute the relevant numerator/denominator of the scaling factor and don't need to repeat conditionals in multiple places.

@zerothi
Copy link
Author

zerothi commented Sep 6, 2021

I think it would be cleaner to have default weights of 1, then you can just proceed from there by summing the weights for each group to compute the relevant numerator/denominator of the scaling factor and don't need to repeat conditionals in multiple places.

I don't have any preference, this was my quick hack ;)
On the other hand summing for very large data-sets where it isn't needed might be annoying/slow?

@mwaskom mwaskom changed the title histplot with stat=probability, common_norm=True and weights histplot common normalization ignores weights Sep 6, 2021
@mwaskom mwaskom added this to the v0.12.0 milestone Sep 6, 2021
@zerothi
Copy link
Author

zerothi commented Sep 9, 2021

The diff for fixing this is:

diff --git a/seaborn/distributions.py b/seaborn/distributions.py
index 5f63289..8329807 100644
--- a/seaborn/distributions.py
+++ b/seaborn/distributions.py
@@ -424,6 +424,12 @@ class _DistributionPlotter(VectorPlotter):
                 warn_singular=False,
             )
 
+        sum_weight = 0.
+        if common_norm:
+            for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
+                if "weights" in self.variables:
+                    sum_weight += sub_data["weights"].sum()
+
         # First pass through the data to compute the histograms
         for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
 
@@ -464,12 +470,21 @@ class _DistributionPlotter(VectorPlotter):
             hist = pd.Series(heights, index=index, name="heights")
 
             # Apply scaling to normalize across groups
-            if common_norm:
+            if common_norm and weights is None:
                 hist *= len(sub_data) / len(all_data)
+            elif common_norm:
+                hist *= weights.sum() / sum_weight
 
             # Store the finalized histogram data for future plotting
             histograms[key] = hist

It can be altered as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants