-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unable to retrieve nulls in float column when reading a cudf created parquet file #8688
Comments
Here are my observations: When I added a correction code to if col_meta["numpy_type"] in ("float64"):
col_meta["numpy_type"] = "Float64" it fixed this issue. The culprit is Another area in cudf that uses this pyarrow API also shows the same behaviour: In [37]: gdf.to_arrow().to_pandas()
Out[37]:
a
0 1.0
1 NaN
2 NaN |
The difference is in our dtypes. Pandas uses its own In [5]: pdf.a.dtype
Out[5]: Float64Dtype()
In [6]: str(pdf.a.dtype)
Out[6]: 'Float64'
In [12]: type(pdf.a.dtype)
Out[12]: pandas.core.arrays.floating.Float64Dtype that wraps an np dtype @register_extension_dtype
class Float64Dtype(FloatingDtype):
type = np.float64
name = "Float64"
__doc__ = _dtype_docstring.format(dtype="float64") We directly use the np dtype for our numerical columns In [9]: gdf.a.dtype
Out[9]: dtype('float64')
In [10]: str(gdf.a.dtype)
Out[10]: 'float64'
In [13]: type(gdf.a.dtype)
Out[13]: numpy.dtype When generating pandas metadata, pyarrow uses |
Pandas used to also use numpy dtype for it's columns until v0.24 when they added null support. Here's the docs from pandas where it explains that the new type used for nullable columns is an "Extension type". Notably the difference between this and the underlying numpy type:
|
This looks like a reasonable fix to me, I don't see any downsides to doing this. Are there any that I'm missing? |
It fixes the symptom but not the issue. I filed #8707 to explain why we should use a better dtype than np.float64 for a nullable float column. |
Prevents nullable columns to be read as float columns with NaNs when reading with pandas. Fixes #8688 Authors: - Devavret Makkar (https://github.com/devavret) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #8749
Describe the bug
This looks like a parquet writer bug. When there is a mix of
np.nan
&<NA>
values in a float column, and that is written to parquet file, we are able to retrieve it correctly fromcudf
but not inpandas
. Butpandas
is able to write this column data correctly to a parquet file and that can be read fromcudf
&pandas
correctly.Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior
I'd expect the cudf written parquet file (i.e.,
cudf.parquet
) to be able to behave similar topandas.parquet
file when read by bothcudf
&pandas
backends.Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: