Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::io::read_json does not initialize column_names field in 22.10 #12008

Closed
PeterGottesman opened this issue Oct 26, 2022 · 3 comments
Closed
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@PeterGottesman
Copy link

PeterGottesman commented Oct 26, 2022

Describe the bug
With the release of libcudf 22.10.00, read_json no longer sets the column_names field of it's return value. Instead, it sets the names in the schema_info field. This seems to have been introduced in #11364: https://github.com/rapidsai/cudf/pull/11364/files?diff=unified&w=0#diff-6e1002a16556df11f73b654cffbe18812bed655f377f52a6888c53ba6e1f04a6L566-R566.

Tests were updated to reflect the change in that PR, but I am unsure that this is intended behavior as no other file read functions (e.g. read_csv) were updated.

Steps/Code to reproduce bug
The code below shows how the behavior of read_json differs from that of read_csv.

#include <cudf/table/json.hpp>
#include <cudf/io/json.hpp>
#include <cudf/io/csv.hpp>

int main() {
  std::string json_data = "{\"0\":12,\"1\":0.0,\"2\":abc}\n{\"0\":5,\"1\":0.1,\"2\":abb}";
  std::string csv_data = "0, 1, 2\n12,0.0,abc\n5,0.1,abb";

  cudf::io::json_reader_options json_in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{json_data.data(), json_data.size()})
      .lines(true);

  cudf::io::table_with_metadata json_result = cudf::io::read_json(json_in_options);

  cudf::io::csv_reader_options csv_in_options =
      cudf::io::csv_reader_options::builder(cudf::io::source_info{csv_data.data(), csv_data.size()});

  cudf::io::table_with_metadata csv_result = cudf::io::read_csv(csv_in_options);

  std::cout << "schema_info csv:" << csv_result.metadata.schema_info.size() << "\n";
  std::cout << "schema_info json:" << json_result.metadata.schema_info.size() << "\n\n";
  std::cout << "column_names csv:" << csv_result.metadata.column_names.size() << "\n";
  std::cout << "column_names json:" << json_result.metadata.column_names.size() << "\n";
}

Output:

# With libcudf @ tag v22.10.00:

schema_info csv:0                                                                              
schema_info json:3                                                                                                                                                                             
                                                                                               
column_names csv:3                                                                                                                                                                             
column_names json:0

# With libcudf @ tag v22.08.01:

schema_info csv:0
schema_info json:0

column_names csv:3
column_names json:3

Environment overview (please complete the following information)

  • Environment location: bare metal
  • Method of cuDF install: from source, tags specified above
@PeterGottesman PeterGottesman added Needs Triage Need team to review and classify bug Something isn't working labels Oct 26, 2022
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Oct 27, 2022
@kkraus14
Copy link
Collaborator

cc @vuule

@vuule
Copy link
Contributor

vuule commented Oct 28, 2022

Thank you for opening the issue.
This is intended behavior, see #6411.

The plan is to stop using column_names in tests and Cython bindings, and to remove column_names from table_metadata once this is complete.

You can use the schema_info member to get column names when reading JSON files. It quickly replaced column_names because it supports nested columns (like the experimental JSON parser does).

Keeping the issue open in case there are follow-up questions we can help with :)

@vuule vuule self-assigned this Oct 28, 2022
@GregoryKimball
Copy link
Contributor

@PeterGottesman Please feel free to re-open the issue if you have trouble with schema_info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants