[BUG] `cudf::io::read_json` does not initialize `column_names` field in 22.10 #12008

PeterGottesman · 2022-10-26T23:18:52Z

Describe the bug
With the release of libcudf 22.10.00, read_json no longer sets the column_names field of it's return value. Instead, it sets the names in the schema_info field. This seems to have been introduced in #11364: https://github.com/rapidsai/cudf/pull/11364/files?diff=unified&w=0#diff-6e1002a16556df11f73b654cffbe18812bed655f377f52a6888c53ba6e1f04a6L566-R566.

Tests were updated to reflect the change in that PR, but I am unsure that this is intended behavior as no other file read functions (e.g. read_csv) were updated.

Steps/Code to reproduce bug
The code below shows how the behavior of read_json differs from that of read_csv.

#include <cudf/table/json.hpp>
#include <cudf/io/json.hpp>
#include <cudf/io/csv.hpp>

int main() {
  std::string json_data = "{\"0\":12,\"1\":0.0,\"2\":abc}\n{\"0\":5,\"1\":0.1,\"2\":abb}";
  std::string csv_data = "0, 1, 2\n12,0.0,abc\n5,0.1,abb";

  cudf::io::json_reader_options json_in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{json_data.data(), json_data.size()})
      .lines(true);

  cudf::io::table_with_metadata json_result = cudf::io::read_json(json_in_options);

  cudf::io::csv_reader_options csv_in_options =
      cudf::io::csv_reader_options::builder(cudf::io::source_info{csv_data.data(), csv_data.size()});

  cudf::io::table_with_metadata csv_result = cudf::io::read_csv(csv_in_options);

  std::cout << "schema_info csv:" << csv_result.metadata.schema_info.size() << "\n";
  std::cout << "schema_info json:" << json_result.metadata.schema_info.size() << "\n\n";
  std::cout << "column_names csv:" << csv_result.metadata.column_names.size() << "\n";
  std::cout << "column_names json:" << json_result.metadata.column_names.size() << "\n";
}

Output:

# With libcudf @ tag v22.10.00:

schema_info csv:0                                                                              
schema_info json:3                                                                                                                                                                             
                                                                                               
column_names csv:3                                                                                                                                                                             
column_names json:0

# With libcudf @ tag v22.08.01:

schema_info csv:0
schema_info json:0

column_names csv:3
column_names json:3

Environment overview (please complete the following information)

Environment location: bare metal
Method of cuDF install: from source, tags specified above

The text was updated successfully, but these errors were encountered:

kkraus14 · 2022-10-27T00:46:39Z

cc @vuule

vuule · 2022-10-28T06:06:47Z

Thank you for opening the issue.
This is intended behavior, see #6411.

The plan is to stop using column_names in tests and Cython bindings, and to remove column_names from table_metadata once this is complete.

You can use the schema_info member to get column names when reading JSON files. It quickly replaced column_names because it supports nested columns (like the experimental JSON parser does).

Keeping the issue open in case there are follow-up questions we can help with :)

GregoryKimball · 2022-11-19T21:45:56Z

@PeterGottesman Please feel free to re-open the issue if you have trouble with schema_info

PeterGottesman added Needs Triage Need team to review and classify bug Something isn't working labels Oct 26, 2022

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Oct 27, 2022

vuule self-assigned this Oct 28, 2022

GregoryKimball closed this as completed Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `cudf::io::read_json` does not initialize `column_names` field in 22.10 #12008

[BUG] `cudf::io::read_json` does not initialize `column_names` field in 22.10 #12008

PeterGottesman commented Oct 26, 2022 •

edited

Loading

kkraus14 commented Oct 27, 2022

vuule commented Oct 28, 2022

GregoryKimball commented Nov 19, 2022

[BUG] cudf::io::read_json does not initialize column_names field in 22.10 #12008

[BUG] cudf::io::read_json does not initialize column_names field in 22.10 #12008

Comments

PeterGottesman commented Oct 26, 2022 • edited Loading

kkraus14 commented Oct 27, 2022

vuule commented Oct 28, 2022

GregoryKimball commented Nov 19, 2022

[BUG] `cudf::io::read_json` does not initialize `column_names` field in 22.10 #12008

[BUG] `cudf::io::read_json` does not initialize `column_names` field in 22.10 #12008

PeterGottesman commented Oct 26, 2022 •

edited

Loading