Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean and std for float32 dataframe #22385

Open
m-rossi opened this issue Aug 16, 2018 · 9 comments
Open

Mean and std for float32 dataframe #22385

m-rossi opened this issue Aug 16, 2018 · 9 comments
Labels
Bug Dependencies Required and optional dependencies Reduction Operations sum, mean, min, max, etc.

Comments

@m-rossi
Copy link

m-rossi commented Aug 16, 2018

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame(np.ones(60000).astype('float32') * 1e6)

print(np.mean(data))
print(np.mean(data.values))

print(np.std(data))
print(np.std(data.values))

which results in the following output

0    999645.0
dtype: float32
1000000.06
0    354.936188
dtype: float32
0.0625

Problem description

I have a pandas dataframe with 32-bit floating numbers inside and try to calculate the mean and std values, where these calculations can go horribly wrong. However I am able reproduce the issue with the example above.

I know pandas uses internal functions for mean and std instead of using the numpy functions. This seems to be the issue here. If I add .values the result is a lot more accurate. The problem disappears if I use float64.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Looks like a bottleneck issue:

In [6]: import bottleneck as bn

In [7]: bn.nanmean
Out[7]: <function bottleneck.reduce.nanmean>

In [8]: bn.nanmean(data.values)
Out[8]: 999645.0

You may want to check with them to see if this is expected.

You can disable bottleneck with pd.options.compute.use_bottleneck.

@TomAugspurger TomAugspurger added this to the No action milestone Aug 16, 2018
@TomAugspurger
Copy link
Contributor

Perhaps a known issue: pydata/bottleneck#193

That references nansum / nanmean, but not nanstd.

I don't think it's worth silently astyping to float64 when people have float32 data and are using bottleneck, but I would be curious to hear other's thoughts.

@m-rossi
Copy link
Author

m-rossi commented Aug 16, 2018

Thanks, I will use your suggestion for now and we should watch the issue at bottleneck. But as I said, for my real data the values were horribly wrong, especially for the std.

@TomAugspurger
Copy link
Contributor

Indeed. My usual inclination is to fix this upstream and not add temporary workarounds, but given the incorrectness of the result, workarounds may be worth considering.

@agoodm
Copy link

agoodm commented Aug 17, 2018

I am the one who opened the associated issue in the bottleneck repo because it lead to significant errors for a workflow I was doing with xarray, see: pydata/xarray#2370

Yes it also affects nanstd, and yes it can lead to far more significant errors than what OP is showing under the right conditions (a large enough sample with a relatively narrow distribution). The xarray example I posted showed the standard deviation being nearly two orders of magnitude higher than its actual value.

@TomAugspurger
Copy link
Contributor

Thanks for the context. Let's coordinate with xarray here (cc @shoyer), though it still isn't clear to me what the right thing is to do here.

@TomAugspurger TomAugspurger reopened this Aug 17, 2018
@TomAugspurger TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations Dependencies Required and optional dependencies labels Aug 17, 2018
@gfyoung gfyoung removed this from the No action milestone Aug 19, 2018
@jbrockmendel
Copy link
Member

Any updates here?

@TomAugspurger
Copy link
Contributor

I'm not sure what's best here. Earlier I implied upcasting float32 -> float64 so that we could use bottleneck. I don't think that's smart. Rather, we would keep float32 and use NumPy.

@agoodm did Xarray make a decision on what to do for float32?

@agoodm
Copy link

agoodm commented Jun 13, 2019

I'd ping them in the associated issue thread I linked earlier, as far as I know nothing has been changed on their end to address this since the issue is still open. I think their sentiment was similar to yours in regards to automatically upcasting float32 -> float64.

@mroeschke mroeschke added the Bug label Apr 5, 2020
@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020
@mroeschke mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dependencies Required and optional dependencies Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

6 participants