Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Category dtype gives unexpected hashed values for int32 when reading from CSV #3960

Closed
taureandyernv opened this issue Jan 28, 2020 · 8 comments
Labels
cuIO cuIO issue feature request New feature or request

Comments

@taureandyernv
Copy link
Contributor

taureandyernv commented Jan 28, 2020

Describe the bug
I unexpectedly get hashed values of int32 instead of my original values and a categorical column when typing it a categorical column from CSV

Steps/Code to reproduce bug

import cudf
import pandas as pd

fn = 'test.csv'
lines = """id1,id2
1,45
2,3
3, 7
1, 25
"""
with open(fn, 'w') as fp:
    fp.write(lines)
pdf = pd.read_csv(fn, header=0, dtype={"id1":"category", "id2":"int32"})
cdf = cudf.read_csv(fn, header=0, dtype={"id1":"category", "id2":"int32"})

Output of print(cdf):

          id1  id2
0  -677418915   45
1  -197494169    3
2  1957796822    7
3  -677418915   25
cdf['id1']

0    -677418915
1    -197494169
2    1957796822
3    -677418915
Name: id1, dtype: int32

Expected behavior
I expect the output to be that of print(pdf)

          id1  id2
0  -677418915   45
1  -197494169    3
2  1957796822    7
3  -677418915   25
pdf['id1']

0    1
1    2
2    3
3    1
Name: id1, dtype: category
Categories (3, object): [1, 2, 3]

Current work around is to first cast it as int32, then astype("category")

cdf2 = cudf.read_csv(fn, header=0, dtype={"id1":"int32", "id2":"int32"})
cdf2['id1'] = cdf2['id1'].astype("category")
cdf2['id1']

Outputs:

0    1
1    2
2    3
3    1
Name: id1, dtype: category
Categories (3, int64): [1, 2, 3]

Environment overview (please complete the following information)

  • Environment location: [Docker]
  • Method of cuDF install: [Docker]
  • tested to affect cudf 0.11, 0.12
@taureandyernv taureandyernv added Needs Triage Need team to review and classify bug Something isn't working labels Jan 28, 2020
@randerzander
Copy link
Contributor

randerzander commented Jan 28, 2020

Since libcudf doesn't yet support true dictionary column types, this behavior is an artifact of the way it implemented categories (as integer hashes) early on.

I believe dictionary column types are currently being worked on. @mjsamoht will cuIO's readers support using the category dtype at read time?

@taureandyernv as a workaround, you could read the CSV file with a string column and do astype('category') after reading:

cdf = cudf.read_csv(fn, header=0)
cdf.id1.astype('category')
0    1
1    2
2    3
3    1
Name: id1, dtype: category
Categories (3, int64): [1, 2, 3]

@randerzander randerzander added feature request New feature or request cuIO cuIO issue and removed Needs Triage Need team to review and classify bug Something isn't working labels Jan 28, 2020
@mjsamoht
Copy link
Contributor

Categories were used in cuIO before we introduced string support. Strings were mapped to a 32-bit hash. My guess is here the first column is interpreted as string (because category is specified as type) and then mapped to a 32-bit hash. So it behaves as it is currently designed.

I'm not aware of any active effort to add category support in cuIO. @vuule @OlivierNV should comment.

@OlivierNV
Copy link
Contributor

Afaik this is just a string or integer column. There is no native csv support for dictionary, so the conversion from string to dictionary would have to be done as a post-processing encoding step after read_csv() (presumably using existing dictionary column support)

@vuule
Copy link
Contributor

vuule commented Jan 28, 2020

Please note that this is the legacy implementation behavior. New implementation does not have support for categories, but it is under way: #3577.
Once this is in, and Python API is set up to use libcudf++, the behavior should be as expected.

@taureandyernv
Copy link
Contributor Author

@vuule, we have an open issue #3962 to create docs for categorical accessor. If you are saying that this implementation is legacy, what is the eta of the new version and should we pursue making docs? @randerzander @davidwendt @jrhemstad . Need to know if to pause.

@harrism
Copy link
Member

harrism commented Feb 10, 2020

@taureandyernv pause. Suspect this will be a temporary feature regression.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 2, 2022

"Category" type reading support is no longer available in read_csv:

import cudf
import pandas as pd

fn = 'test.csv'
lines = r"id1,id2\n1,45\n2,3\n3, 7\n1, 25"

with open(fn, 'w') as fp:
    fp.write(lines)
pdf = pd.read_csv(fn, header=0, dtype={"id1": "category", "id2": "int32"})
cdf = cudf.read_csv(fn, header=0, dtype={"id1": "category", "id2": "int32"})
Traceback (most recent call last):
  File "cudf/_lib/csv.pyx", line 529, in cudf._lib.csv._get_cudf_data_type_from_dtype
NotImplementedError: CategoricalDtype as dtype is not yet supported in CSV reader

The workaround to cast it as int32, then astype("category") is still functional.

cdf2 = cudf.read_csv(fn, header=0, dtype={"id1":"int32", "id2":"int32"})
cdf2['id1'] = cdf2['id1'].astype("category")
cdf2['id1']

rapids-bot bot pushed a commit that referenced this issue Jan 21, 2023
Fixes: #11977, #3960

This PR enables support for `category` dtypes in `dtype` parameter. This PR contains a workaround that enables reading columns as categorical dtypes, we can remove this workaround once `libcudf` has native support for dictionary type mapping to categorical columns.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #12571
@galipremsagar
Copy link
Contributor

Resolved by #12571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants