BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972

j7168908jx · 2024-03-23T09:16:28Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import polars as pl
import pandas as pd
import numpy as np
import scipy.stats as st

data = np.array([-2.05191341e-05,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -4.10391103e-05,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00])

print(pl.Series(data).kurtosis())
print(pd.Series(data).kurt())
print(st.kurtosis(data))

Issue Description

The output of pandas kurtosis function is incorrect.

After simple debugging I found a comment at core/nanops.py line 1360, in function nankurt,
saying to fix #18044 it manually zeros out values less than 1e-14, which is in any way improper.
This affects whatever data comes with not much variance but lots of data.

Expected Behavior

Output of provided example:

14.916104870028523
0.0
14.916104870028551

Expected output: roughly 14.9161 for unbiased (pandas's default behaviour) is correct.

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-1010-nvidia-lowlatency
Version : #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.22.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

dontgoto · 2024-03-23T13:27:41Z

Good point. Reproducing your example, this does happen in your example. Trying to scale it up to larger input distributions alleviates the issue though.

Your example is a sweet spot for this error, rescaling your distribution to be larger, the zeroing out stops happening very quickly due to the O(count^2) and O(count^3) terms in the numerator and denominator equations counteracting lifting the very small m4 and m2^2 above the e-14 threshold.

Doing a check of the form (pseudocode)
count < 100 and abs(frexp(denominator) - frexp(numerator)) < 24
before doing the zeroing out should alleviate this issue, but I would like to hear someone else's opinion before putting in a PR.

dontgoto · 2024-03-23T15:29:06Z

Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.

I was not able to iron out that instability, though.

j7168908jx · 2024-03-23T15:38:47Z

Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.

I was not able to iron out that instability, though.

Do you mean that the difference of their output is roughly 3? If you have not set bias=False in scipy or polars, the difference here will be roughly 3.

dontgoto · 2024-03-23T15:41:47Z

Do you mean that the difference of their output is roughly 3?

Exactly

If you have not set bias=False in scipy or polars, the difference here will be roughly 3.

I did not, so then that's also explained. Then I see no issues with my solution anymore.

kaixiongg · 2024-09-06T06:44:08Z

Why not apply welford method for skew and kurt?

j7168908jx added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 23, 2024

dontgoto mentioned this issue Mar 26, 2024

BUG: #57972 kurtosis small sample size low variance false positive cutoff #58020

Closed

4 tasks

dontgoto mentioned this issue Apr 7, 2024

BUG: low variance arrays' kurtosis wrongfully set to zero #58176

Closed

4 tasks

Liam3851 mentioned this issue Aug 21, 2024

BUG: Wrong result of Kurtosis #59572

Closed

3 tasks

mroeschke added Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2024

Liam3851 mentioned this issue Sep 5, 2024

QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions #59715

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972

BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972

j7168908jx commented Mar 23, 2024

INSTALLED VERSIONS

dontgoto commented Mar 23, 2024 •

edited

Loading

dontgoto commented Mar 23, 2024

j7168908jx commented Mar 23, 2024

dontgoto commented Mar 23, 2024 •

edited

Loading

kaixiongg commented Sep 6, 2024

BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972

BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972

Comments

j7168908jx commented Mar 23, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

dontgoto commented Mar 23, 2024 • edited Loading

dontgoto commented Mar 23, 2024

j7168908jx commented Mar 23, 2024

dontgoto commented Mar 23, 2024 • edited Loading

kaixiongg commented Sep 6, 2024

dontgoto commented Mar 23, 2024 •

edited

Loading

dontgoto commented Mar 23, 2024 •

edited

Loading