Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON mixed_types_as_strings feature incorrectly returns some structs as strings #14864

Closed
andygrove opened this issue Jan 24, 2024 · 6 comments · Fixed by #14939
Closed
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug

When reading JSON containing structs where some fields are optional and when the mixed_types_as_strings feature is enabled, the data is returned as a string column instead of a struct column.

Steps/Code to reproduce bug

Input file:

{ "teacher": "Bob" }
{ "student": { "name": "Carol", "age":  21 } }
{ "teacher": "Bob", "student": { "name": "Carol", "age": 21 } }

Java test code:

  Schema schema = Schema.builder()
          .column(DType.STRING, "teacher")
          .column(DType.STRUCT, "student")
          .build();
  JSONOptions opts = JSONOptions.builder()
          .withLines(true)
          .withMixedTypesAsStrings(true)
          .build();
  Table table = Table.readJSON(schema, opts, TEST_STRUCTS_JSON);
  System.out.println(table.getColumn(0).getType());

This outputs STRING instead of STRUCT.

Expected behavior
I would not consider this input to be "mixed types". We expect this to be returned as a struct column.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context

@GregoryKimball
Copy link
Contributor

@karthikeyann Would you please take a look?

@karthikeyann
Copy link
Contributor

karthikeyann commented Jan 30, 2024

In this example, "teacher" column appears first. so, that will be first column, which is a string column. The struct column is index 1.
System.out.println(table.getColumn(1).getType()); prints STRUCT.
(tested via spark unit test, and in cudf-python)

@andygrove
Copy link
Contributor Author

Thanks @karthikeyann. I have either misunderstood something or my repro case is not correct. I will take a look.

@andygrove
Copy link
Contributor Author

I found another issue while working on a better repro.

    Schema schema = Schema.builder()
            .column(DType.STRUCT, "student")
            .build();
    JSONOptions opts = JSONOptions.builder()
            .withLines(true)
            .withMixedTypesAsStrings(true)
            .build();
    Table table = Table.readJSON(schema, opts, TestUtils.getResourceAsFile("structs.json"));
{"teacher": "Ntflgt","student": {"name": "Odhut", "age": 10}}
{"teacher": "Pnjugo","student": {"name": "Xqnqpg", "age": 13}}
{"student":null}
ai.rapids.cudf.CudfException: CUDF failure at: /home/andy/git/nvidia/cudf/cpp/include/cudf/column/column_factories.hpp:342: Invalid, non-fixed-width type.
        at ai.rapids.cudf.Table.readJSON(Native Method)

@andygrove
Copy link
Contributor Author

andygrove commented Jan 30, 2024

I now have an accurate repro:

@Test
void testMixedTypes() throws IOException {
  MultiBufferDataSource source = sourceFrom(
          TestUtils.getResourceAsFile("structs.json"));
  JSONOptions opts = JSONOptions.builder()
          .withLines(true)
          .withMixedTypesAsStrings(true)
          .build();
  TableWithMeta table = Table.readJSON(opts, source.getHostBuffers()[0],
          0, source.size());
  Table t = table.releaseTable();
  for (int i=0; i<t.getNumberOfColumns(); i++) {
    System.out.println("TYPE " + i + ": " + t.getColumn(i).getType());
  }
  TableDebug.builder().build().debug("TABLE", t);
}

Note that this requires adding a method to MultiBufferDataSource:

public HostMemoryBuffer[] getHostBuffers() {
  return hostBuffers;
}

Test data:

{"teacher": "Ntflgt","student": {"name": "Odhut", "age": 10}}
{"teacher": "Pnjugo","student": {"name": "Xqnqpg", "age": 13}}
{"student":null}
{"teacher": "Ntflgt","student": {"name": "Odhut", "age": 10}}
{"teacher": "Pnjugo","student": {"name": "Xqnqpg", "age": 13}}

Output:

TYPE 0: STRING
TYPE 1: STRING
DEBUG TABLE Table{columns=[ColumnVector{rows=5, type=STRING, nullCount=Optional.empty, offHeap=(ID: 40057 7fe71c588380)}, ColumnVector{rows=5, type=STRING, nullCount=Optional.empty, offHeap=(ID: 40058 7fe7275745d0)}], cudfTable=140630589746128, rows=5}
GPU COLUMN 0 - NC: 1 DATA: DeviceMemoryBufferView{address=0x7fe746001600, length=24, id=-1} VAL: DeviceMemoryBufferView{address=0x7fe746001500, length=64, id=-1}
COLUMN 0 - STRING
0 "Ntflgt" 4e74666c6774
1 "Pnjugo" 506e6a75676f
2 NULL
3 "Ntflgt" 4e74666c6774
4 "Pnjugo" 506e6a75676f
GPU COLUMN 1 - NC: 1 DATA: DeviceMemoryBufferView{address=0x7fe746001800, length=114, id=-1} VAL: DeviceMemoryBufferView{address=0x7fe746001f00, length=64, id=-1}
COLUMN 1 - STRING
0 "{"name": "Odhut", "age": 10}" 7b226e616d65223a20224f64687574222c2022616765223a2031307d
1 "{"name": "Xqnqpg", "age": 13}" 7b226e616d65223a202258716e717067222c2022616765223a2031337d
2 NULL
3 "{"name": "Odhut", "age": 10}" 7b226e616d65223a20224f64687574222c2022616765223a2031307d
4 "{"name": "Xqnqpg", "age": 13}" 7b226e616d65223a202258716e717067222c2022616765223a2

The student column here is returned as a string instead of a struct.

@karthikeyann
Copy link
Contributor

Fix available with PR #14939
tested the repro case in python

In [5]: df
Out[5]: 
  teacher                        student
0  Ntflgt   {'name': 'Odhut', 'age': 10}
1  Pnjugo  {'name': 'Xqnqpg', 'age': 13}
2    <NA>                           None
3  Ntflgt   {'name': 'Odhut', 'age': 10}
4  Pnjugo  {'name': 'Xqnqpg', 'age': 13}

In [6]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   teacher  4 non-null      object
 1   student  4 non-null      struct
dtypes: object(1), struct(1)
memory usage: 390.0+ bytes

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 7, 2024
…ng is enabled in JSON reader (#14939)

Fixes #14864

`null` literal should be ignored (considered as null) during parsing while handling mixed types.
Unit tests of complex scenarios are added to test this as well.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Andy Grove (https://github.com/andygrove)
  - Shruti Shivakumar (https://github.com/shrshi)
  - https://github.com/nvdbaranec

URL: #14939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants