-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Overflow ints with nulls is being inferred as float column in CSV reader #7088
Comments
I don't think we want this behavior. Pandas infers as |
Not actually, pandas appears to be storing them as strings: >>> import pandas as pd
>>> df = pd.read_csv('a.csv')
>>> df['3']
0
1 17758512297797920768
2
3 3374168267804635136
4
...
365
366
367
368 14301866441110444032
369
Name: 3, Length: 370, dtype: object
>>> df['3'][1]
'17758512297797920768'
>>> type(df['3'][1])
<class 'str'> |
TIL... Either way, I would defer to what makes sense from a general C++ perspective instead of forcing Pandas semantics. |
IMO this is a good reason to use strings here. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This issue has been labeled |
Describe the bug
A csv files contains values greater than max(int64) and some null values. In this case, cudf CSV reader is inferring the column type to be
float
column. This will lead to loss of data partially.Steps/Code to reproduce bug
a.csv
: a.csv.zip(Rename this toa.csv
)Expected behavior
If the values overflow
int
and fit intouint
, they should be inferred asuint
.If the values overflow
int
&uint
types, they should be inferred asstring
column.Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Surfaced in fuzz testing: #6001
The text was updated successfully, but these errors were encountered: