Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: issue#915, Error for large integers in Series #1233

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
9 changes: 6 additions & 3 deletions src/pandas_profiling/model/summary_algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,13 @@ def histogram_compute(
stats = {}
bins = config.plot.histogram.bins
bins_arg = "auto" if bins == 0 else min(bins, n_unique)
stats[name] = np.histogram(finite_values, bins=bins_arg, weights=weights)
bins = np.histogram_bin_edges(finite_values, bins=bins_arg)
stats[name] = np.histogram(finite_values, bins=bins, weights=weights)
Comment on lines +39 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting solution, but still seems to behave a bit weirdly with big numbers

> import numpy as np
> arr = np.array([716277643516076032 + i for i in range(100)])
> bins = np.histogram_bin_edges(arr, bins=5)
> np.histogram(arr, bins=bins)
(array([ 0,  0, 65,  0, 35]),
 array([7.16277644e+17, 7.16277644e+17, 7.16277644e+17, 7.16277644e+17,
        7.16277644e+17, 7.16277644e+17]))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it is still not evenly distributed like it should be for smaller numbers. What do you propose here? Leaving np.histogram_bin_edges raises an error for larger numbers. Is it better to raise error than have weird behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking from a user's perspective, for me is better to have an error being raised than an incorrect plot. If I know that there was a problem with the large integers I can preprocess that column and run again, but an incorrect result may lead me to an incorrect interpretation of my data distribution.

Copy link
Contributor Author

@Sohaib90 Sohaib90 Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is what I was thinking as well. I think I should make the changes so that it leads to raising an error rather than making an incorrect plot, right?

Also there is a check Codacy Static Code Analysis that is failing. I think that is a new one


max_bins = config.plot.histogram.max_bins
if bins_arg == "auto" and len(stats[name][1]) > max_bins:
stats[name] = np.histogram(finite_values, bins=max_bins, weights=None)
bins = np.histogram_bin_edges(finite_values, bins=max_bins)
stats[name] = np.histogram(finite_values, bins=bins, weights=None)

return stats

Expand All @@ -49,7 +51,8 @@ def chi_square(
values: Optional[np.ndarray] = None, histogram: Optional[np.ndarray] = None
) -> dict:
if histogram is None:
histogram, _ = np.histogram(values, bins="auto")
bins = np.histogram_bin_edges(values, bins="auto")
histogram, _ = np.histogram(values, bins=bins)
return dict(chisquare(histogram)._asdict())


Expand Down
33 changes: 33 additions & 0 deletions tests/issues/test_issue915.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""
Test for issue 915:
https://github.com/ydataai/pandas-profiling/issues/915

Error for series with large integers.
"""
import fnmatch

import pandas as pd

from pandas_profiling import ProfileReport


def test_issue915():
df = pd.DataFrame({"col": pd.Series([716277643516076032 + i for i in range(100)])})
df_profile = ProfileReport(df)

def test_with_value(n_extreme_obs):
"""Generate HTML and validate the tabs contain the proper tab titles."""
df_profile.config.n_extreme_obs = n_extreme_obs
df_profile.invalidate_cache()

reg_min = f"*<a href=* aria-controls=* role=tab data-toggle=tab>Minimum {n_extreme_obs} values</a>*"
reg_max = f"*<a href=* aria-controls=* role=tab data-toggle=tab>Maximum {n_extreme_obs} values</a>*"

profile_html = df_profile.to_html()

assert fnmatch.fnmatch(profile_html, reg_min)
assert fnmatch.fnmatch(profile_html, reg_max)

test_with_value(5)
test_with_value(100)
test_with_value(120)