-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] unsigned type not being inferred correctly from csv column leading to incorrect data #6274
Comments
I've done a little digging into the reader's type inference functions and it seems int64 is the only supported integer type ATM. cudf/cpp/src/io/csv/csv_gpu.cu Line 283 in 67c2034
cudf/cpp/src/io/csv/reader_impl.cu Line 590 in 67c2034
cudf/cpp/src/io/csv/reader_impl.cu Line 591 in 67c2034
cc: @jrhemstad @harrism |
…t64 range (#6446) Fix #6314, #6449 and #6274. An integer encountered in a column can belong to one of the three categories : Negative small integer - value ranges from -1 to int64_min Positive small integer - value ranges from 0 to int64_max Big integer - value ranges from int64_max+1 to uint64_max Out of range - values that are too big for uint64 or too small for int64 If a column contains a mix of negative and positive small integers then the column is assigned to be of dtype int64. If the column contains a mix of positive small integers and big integers then the column is assigned to be of dtype uint64. If the column contains a mix of negative small integers and big integers then the column is assigned to be of string type. If a column has an integer that cannot be expressed as either uint64 or int64 then the column is assigned to be of string type. Some checks are also added to correctly handle the parity sign and leading zeros. For example, if an integer is expressed as -0000, then it is still counted as a positive small integer. After removing leading zeros from an integer if we find that the number of digits is the same as int64_min, int64_max or uint64_max then it is lexicographically compared against the appropriate edge value.
Describe the bug
When there are unsigned types(say
uint64
) in a csv, pandas reads the series correctly but in cudf the dtype of the series being inferred toint64
thus leading to data corruption after loading a csv file. A smaller version of csv file is attached here generated in fuzz test: short-data.csv.zipSteps/Code to reproduce bug
Expected behavior
cudf.read_csv
should be able to read data correctly and infer unsigned types as well.Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Surfaced while running fuzz tests #6001
The text was updated successfully, but these errors were encountered: