Mean and std for float32 dataframe #22385

m-rossi · 2018-08-16T14:25:18Z

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame(np.ones(60000).astype('float32') * 1e6)

print(np.mean(data))
print(np.mean(data.values))

print(np.std(data))
print(np.std(data.values))

which results in the following output

0    999645.0
dtype: float32
1000000.06
0    354.936188
dtype: float32
0.0625

Problem description

I have a pandas dataframe with 32-bit floating numbers inside and try to calculate the mean and std values, where these calculations can go horribly wrong. However I am able reproduce the issue with the example above.

I know pandas uses internal functions for mean and std instead of using the numpy functions. This seems to be the issue here. If I add .values the result is a lot more accurate. The problem disappears if I use float64.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-08-16T14:33:31Z

Looks like a bottleneck issue:

In [6]: import bottleneck as bn

In [7]: bn.nanmean
Out[7]: <function bottleneck.reduce.nanmean>

In [8]: bn.nanmean(data.values)
Out[8]: 999645.0

You may want to check with them to see if this is expected.

You can disable bottleneck with pd.options.compute.use_bottleneck.

TomAugspurger · 2018-08-16T14:37:23Z

Perhaps a known issue: pydata/bottleneck#193

That references nansum / nanmean, but not nanstd.

I don't think it's worth silently astyping to float64 when people have float32 data and are using bottleneck, but I would be curious to hear other's thoughts.

m-rossi · 2018-08-16T14:58:51Z

Thanks, I will use your suggestion for now and we should watch the issue at bottleneck. But as I said, for my real data the values were horribly wrong, especially for the std.

TomAugspurger · 2018-08-16T15:21:14Z

Indeed. My usual inclination is to fix this upstream and not add temporary workarounds, but given the incorrectness of the result, workarounds may be worth considering.

agoodm · 2018-08-17T07:07:57Z

I am the one who opened the associated issue in the bottleneck repo because it lead to significant errors for a workflow I was doing with xarray, see: pydata/xarray#2370

Yes it also affects nanstd, and yes it can lead to far more significant errors than what OP is showing under the right conditions (a large enough sample with a relatively narrow distribution). The xarray example I posted showed the standard deviation being nearly two orders of magnitude higher than its actual value.

TomAugspurger · 2018-08-17T11:19:37Z

Thanks for the context. Let's coordinate with xarray here (cc @shoyer), though it still isn't clear to me what the right thing is to do here.

jbrockmendel · 2019-06-10T17:03:42Z

Any updates here?

TomAugspurger · 2019-06-13T21:15:20Z

I'm not sure what's best here. Earlier I implied upcasting float32 -> float64 so that we could use bottleneck. I don't think that's smart. Rather, we would keep float32 and use NumPy.

@agoodm did Xarray make a decision on what to do for float32?

agoodm · 2019-06-13T21:41:23Z

I'd ping them in the associated issue thread I linked earlier, as far as I know nothing has been changed on their end to address this since the issue is still open. I think their sentiment was similar to yours in regards to automatically upcasting float32 -> float64.

TomAugspurger closed this as completed Aug 16, 2018

TomAugspurger added this to the No action milestone Aug 16, 2018

TomAugspurger reopened this Aug 17, 2018

TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations Dependencies Required and optional dependencies labels Aug 17, 2018

gfyoung removed this from the No action milestone Aug 19, 2018

jbrockmendel mentioned this issue Oct 15, 2019

Add Numba as an optional dependency for rolling.apply for pandas 1.0 #28987

Closed

mroeschke added the Bug label Apr 5, 2020

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020

mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean and std for float32 dataframe #22385

Mean and std for float32 dataframe #22385

m-rossi commented Aug 16, 2018

INSTALLED VERSIONS

TomAugspurger commented Aug 16, 2018

TomAugspurger commented Aug 16, 2018

m-rossi commented Aug 16, 2018

TomAugspurger commented Aug 16, 2018

agoodm commented Aug 17, 2018

TomAugspurger commented Aug 17, 2018

jbrockmendel commented Jun 10, 2019

TomAugspurger commented Jun 13, 2019

agoodm commented Jun 13, 2019

Mean and std for float32 dataframe #22385

Mean and std for float32 dataframe #22385

Comments

m-rossi commented Aug 16, 2018

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Aug 16, 2018

TomAugspurger commented Aug 16, 2018

m-rossi commented Aug 16, 2018

TomAugspurger commented Aug 16, 2018

agoodm commented Aug 17, 2018

TomAugspurger commented Aug 17, 2018

jbrockmendel commented Jun 10, 2019

TomAugspurger commented Jun 13, 2019

agoodm commented Jun 13, 2019

Output of `pd.show_versions()`