Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Int64Index for header=None in CSV reader #12582

Open
karthikeyann opened this issue Jan 19, 2023 · 3 comments
Open

[FEA] Int64Index for header=None in CSV reader #12582

karthikeyann opened this issue Jan 19, 2023 · 3 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@karthikeyann
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Pandas read_csv without header returns integers as column names. cudf read_csv returns integer numbers as strings.

import cudf
from io import StringIO
import pandas as pd
csv="1, 2, 3\n4, 5, 6\n7, 8, 9\n"

In [11]: cudf.read_csv(StringIO(csv), header=None).columns
Out[11]: Index(['0', '1', '2'], dtype='object')

In [12]: pd.read_csv(StringIO(csv), header=None).columns
Out[12]: Int64Index([0, 1, 2], dtype='int64')

Similar issue will happen in JSON reader also after "values" support in JSON reader PR #12498 (comment)

Describe the solution you'd like
The libcudf csv reader does not have a method to return metadata as non-string types. Provide a way to indicate there is no header information in metadata. It's possible to fix this issue at Cython/Python layer.

Additional context
PR Comment #12498 (comment)

@karthikeyann karthikeyann added feature request New feature or request Needs Triage Need team to review and classify labels Jan 19, 2023
@vuule
Copy link
Contributor

vuule commented Jan 19, 2023

@galipremsagar this looks like a Python feature. Is anything needed from libcudf to implement this?

@karthikeyann
Copy link
Contributor Author

When header is not present, libcudf returns stringified integers as column names. The python layer may not know if the actual names are integers string, or they are auto-generated. So, libcudf may have to indicate somehow that column names are not present in metadata. That might be the change required from libcudf (It could simply leave the metadata.schema_info empty?)

@galipremsagar
Copy link
Contributor

When header is not present, libcudf returns stringified integers as column names. The python layer may not know if the actual names are integers string, or they are auto-generated. So, libcudf may have to indicate somehow that column names are not present in metadata. That might be the change required from libcudf (It could simply leave the metadata.schema_info empty?)

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants