Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::experimental::read_json returns column with corrupted data when the input is invalid #12418

Closed
ttnghia opened this issue Dec 19, 2022 · 3 comments · Fixed by #12447
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Dec 19, 2022

From the current json test:

TEST_F(JsonReaderTest, JsonNestedDtypeSchema)

The test specifies an input:

std::string json_string = R"( [{"a":[123, {"0": 123}], "b":1.0}, {"b":1.1}, {"b":2.1}])";

Notice that the input for list a (lists of structs) is invalid: the first element is 123, which is not a struct. As such, the test expects the result of parsing the first list a to be [null, 123.0] which is reasonable.

However, if at the end of the test, we print out the result by:

cudf::test::print(result.tbl->get_column(0));

then we will get this:

List<Struct<float,>>:
Length : 3
Offsets : 0, 1, 1, 1
Null count: 2
001
   Struct<float,>:
   Length : 2:
   Null count: 1
   10
      10
      NULL, 123

The output column seems to have corrupted data. In particular, it is a lists column having only one non-null row (the first row), which in turn has only one element (offsets 0, 1]). However, the child column has 2 elements.

To overcome this bug, I temporarily disable the JsonReaderTest#JsonNestedDtypeSchema test in #12370. It should be enabled back after fixing this.

@ttnghia ttnghia added bug Something isn't working Needs Triage Need team to review and classify labels Dec 19, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 19, 2022

CC @karthikeyann.

Also CC @GregoryKimball.

@ttnghia ttnghia changed the title [BUG] cudf::experimental::read_json wrongly parse data when the input is invalid [BUG] cudf::experimental::read_json wrongly parses data when the input is invalid Dec 19, 2022
@ttnghia ttnghia changed the title [BUG] cudf::experimental::read_json wrongly parses data when the input is invalid [BUG] cudf::experimental::read_json returns column with corrupted data when the input is invalid Dec 19, 2022
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Dec 22, 2022

Thank you @ttnghia for identifying this issue. It seems that reading this string from #11682 in the Nested JSON reader results in an unsanitized column. @elstehle do you think this was an issue also in the host-side tree algorithms and column creation, or something introduced as part of the device-side tree algorithms that @karthikeyann developed?

It would be better to not materialize a column with unsanitized nulls, but if that is difficult we could always run the sanitization function on the columns before returning them.

@GregoryKimball GregoryKimball added 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Dec 22, 2022
@karthikeyann
Copy link
Contributor

karthikeyann commented Dec 28, 2022

Actual List offsets are [1, 2, 2, 2].
Column printer in test utilities subtracts the first offset with rest of the offsets. so, it prints [0, 1, 1, 1].

This issue arises due to mixed types in lists, and it is not considered invalid input. Specification is,

If string/value type is mixed with struct, the string/value locations are considered null.

Issue is in list offset device-side algorithm.
Column deduction infers 123 as seperate column, and {"0": 123} struct as seperate column. Parsing of 123 is ignored, but the list offsets considered {"0": 123} as first item. so, it wrote its offset 1 to lists's first offset.
I changed the device-side algorithm to depend on parent_col_id instead of col_id. bug fix in the works now.

Also, this issue shows the need for more test cases for list types. (// TODO comment in list offset algorithm). More tests will be added as part of the fix.

rapids-bot bot pushed a commit that referenced this issue Jan 5, 2023
During construction, the new lists column is also applied a null mask. Such null superposition allows the existence of non-empty nulls, which may undefined behavior later on for various cudf APIs.

This PR fixes it by removing non-empty nulls after lists column constructions. It also fixes unused variables `stream` and `mr`.

After lists creation is corrected in this PR, a lot of tests failed because either these tests were incorrect, or the corresponding APIs were wrongly handling nulls (i.e., they generate outputs with non-empty null lists). Below is the list of the tests that are also fixed in this PR:
 - [X] COPYING_TEST
 - [X] LISTS_TEST
 - [X] PARQUET_TEST
 - [X] STRUCTS_TEST
 - [X] UTILITIES_TEST

In addition, `JSON_TEST` failed due to an unrelated bug (#12418). The bug already existed there but is discovered only now due to changes in this PR. The corresponding failed test is just temporarily disabled and will be fixed separately.

Also closes:
 * #12405

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #12370
rapids-bot bot pushed a commit that referenced this issue Jan 11, 2023
…12447)

Fixes the bug in list offsets in mixed types - string/value type with struct type.

In nested JSON reader, following json string
`[{"a":[123, {"0": 123}], "b":1.0}, {"b":1.1}, {"b":2.1}]`
should produce first column as

```
List<Struct<float,>>:
Length : 3
Offsets : 0, 2, 2, 2
Null count: 2
001
   Struct<float,>:
   Length : 2:
   Null count: 1
   10
      10
      NULL, 123
```

Depends on  #12330
closes #12418

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #12447
@GregoryKimball GregoryKimball removed this from libcudf Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants