-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Category dtype gives unexpected hashed values for int32 when reading from CSV #3960
Comments
Since libcudf doesn't yet support true dictionary column types, this behavior is an artifact of the way it implemented categories (as integer hashes) early on. I believe dictionary column types are currently being worked on. @mjsamoht will cuIO's readers support using the @taureandyernv as a workaround, you could read the CSV file with a string column and do
|
Categories were used in cuIO before we introduced string support. Strings were mapped to a 32-bit hash. My guess is here the first column is interpreted as string (because category is specified as type) and then mapped to a 32-bit hash. So it behaves as it is currently designed. I'm not aware of any active effort to add |
Afaik this is just a string or integer column. There is no native csv support for dictionary, so the conversion from string to dictionary would have to be done as a post-processing encoding step after read_csv() (presumably using existing dictionary column support) |
Please note that this is the legacy implementation behavior. New implementation does not have support for categories, but it is under way: #3577. |
@vuule, we have an open issue #3962 to create docs for categorical accessor. If you are saying that this implementation is legacy, what is the eta of the new version and should we pursue making docs? @randerzander @davidwendt @jrhemstad . Need to know if to pause. |
@taureandyernv pause. Suspect this will be a temporary feature regression. |
"Category" type reading support is no longer available in
The workaround to cast it as int32, then astype("category") is still functional.
|
Fixes: #11977, #3960 This PR enables support for `category` dtypes in `dtype` parameter. This PR contains a workaround that enables reading columns as categorical dtypes, we can remove this workaround once `libcudf` has native support for dictionary type mapping to categorical columns. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #12571
Resolved by #12571 |
Describe the bug
I unexpectedly get hashed values of int32 instead of my original values and a categorical column when typing it a categorical column from CSV
Steps/Code to reproduce bug
Output of
print(cdf)
:Expected behavior
I expect the output to be that of
print(pdf)
Current work around is to first cast it as int32, then astype("category")
Outputs:
Environment overview (please complete the following information)
The text was updated successfully, but these errors were encountered: