-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] csv reader parameters keep_default_na
and na_values
not having the correct behaviour when used in combination
#6680
Comments
keep_default_na
and na_values
not having the correct behaviour when used in combinationkeep_default_na
and na_values
not having the correct behaviour when used in combination
@galipremsagar This should be the same as the other issue - have to use a list for now. |
Oh, I see now. This is very similar to #6682 - the empty fields are treated as null regardless of the parameters. I expect that these two issue can be addressed in one PR. |
Now I see that empty string is actually one of the default NA values in Pandas. We don't have it in the default |
Fixes #6682, #6680 Currently, empty fields are treated as N/A regardless on parsing options. However, the desired behavior is to handle empty fields the same way as fields with special values (apply default_na_values, na_filter logic). This PR irons out the behavior so it matches Pandas in this regard. - Tries now support matching empty strings. - The list of special NA values is now generated more robustly, so it has correct elements in any parameter combination. - Empty string is added to the list of special NA values. - Empty string string ("/"/"") is added to NA value list if empty string ("") is included (mirrors Pandas behavior). - Added tests for previously failing parameter combinations. - Reworked some of the tests to check against Pandas results instead of assumed desired behavior. Authors: - vuule <[email protected]> - vuule <[email protected]> - Vukasin Milovanovic <[email protected]> - Vukasin Milovanovic <[email protected]> Approvers: - Ram (Ramakrishna Prabhu) - Christopher Harris - Keith Kraus URL: #6922
This should have been closed via #6922. Closing now. |
Describe the bug
cudf.read_csv
supports two parameters,keep_default_na
andna_values
. When these two parameters are used in combination they are expected to show a certain behavior in selecting/determining what is/are null values. Pandas doc has detail explanation of the behavior: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.htmlHere is the relevant part from the doc:
It looks like we are not actually respecting this set of rules(except case 1 & 2) in our csv reader. See the code sample below.
Steps/Code to reproduce bug
Expected behavior
We should be following the above set of rules when these two parameters are used in combination and match pandas behavior.
Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Surfaced while running fuzz tests: #6001
The text was updated successfully, but these errors were encountered: