-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mean and std for float32 dataframe #22385
Comments
Looks like a bottleneck issue: In [6]: import bottleneck as bn
In [7]: bn.nanmean
Out[7]: <function bottleneck.reduce.nanmean>
In [8]: bn.nanmean(data.values)
Out[8]: 999645.0 You may want to check with them to see if this is expected. You can disable bottleneck with |
Perhaps a known issue: pydata/bottleneck#193 That references nansum / nanmean, but not nanstd. I don't think it's worth silently astyping to float64 when people have float32 data and are using bottleneck, but I would be curious to hear other's thoughts. |
Thanks, I will use your suggestion for now and we should watch the issue at bottleneck. But as I said, for my real data the values were horribly wrong, especially for the std. |
Indeed. My usual inclination is to fix this upstream and not add temporary workarounds, but given the incorrectness of the result, workarounds may be worth considering. |
I am the one who opened the associated issue in the bottleneck repo because it lead to significant errors for a workflow I was doing with xarray, see: pydata/xarray#2370 Yes it also affects |
Thanks for the context. Let's coordinate with xarray here (cc @shoyer), though it still isn't clear to me what the right thing is to do here. |
Any updates here? |
I'd ping them in the associated issue thread I linked earlier, as far as I know nothing has been changed on their end to address this since the issue is still open. I think their sentiment was similar to yours in regards to automatically upcasting float32 -> float64. |
Code Sample
which results in the following output
Problem description
I have a pandas dataframe with 32-bit floating numbers inside and try to calculate the
mean
andstd
values, where these calculations can go horribly wrong. However I am able reproduce the issue with the example above.I know pandas uses internal functions for mean and std instead of using the numpy functions. This seems to be the issue here. If I add
.values
the result is a lot more accurate. The problem disappears if I usefloat64
.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252
pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: