-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudf::experimental::read_json
returns column with corrupted data when the input is invalid
#12418
Comments
CC @karthikeyann. Also CC @GregoryKimball. |
cudf::experimental::read_json
wrongly parse data when the input is invalidcudf::experimental::read_json
wrongly parses data when the input is invalid
cudf::experimental::read_json
wrongly parses data when the input is invalidcudf::experimental::read_json
returns column with corrupted data when the input is invalid
Thank you @ttnghia for identifying this issue. It seems that reading this string from #11682 in the Nested JSON reader results in an unsanitized column. @elstehle do you think this was an issue also in the host-side tree algorithms and column creation, or something introduced as part of the device-side tree algorithms that @karthikeyann developed? It would be better to not materialize a column with unsanitized nulls, but if that is difficult we could always run the sanitization function on the columns before returning them. |
Actual List offsets are This issue arises due to mixed types in lists, and it is not considered invalid input. Specification is,
Issue is in list offset device-side algorithm. Also, this issue shows the need for more test cases for list types. (// TODO comment in list offset algorithm). More tests will be added as part of the fix. |
During construction, the new lists column is also applied a null mask. Such null superposition allows the existence of non-empty nulls, which may undefined behavior later on for various cudf APIs. This PR fixes it by removing non-empty nulls after lists column constructions. It also fixes unused variables `stream` and `mr`. After lists creation is corrected in this PR, a lot of tests failed because either these tests were incorrect, or the corresponding APIs were wrongly handling nulls (i.e., they generate outputs with non-empty null lists). Below is the list of the tests that are also fixed in this PR: - [X] COPYING_TEST - [X] LISTS_TEST - [X] PARQUET_TEST - [X] STRUCTS_TEST - [X] UTILITIES_TEST In addition, `JSON_TEST` failed due to an unrelated bug (#12418). The bug already existed there but is discovered only now due to changes in this PR. The corresponding failed test is just temporarily disabled and will be fixed separately. Also closes: * #12405 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Mark Harris (https://github.com/harrism) - Mike Wilson (https://github.com/hyperbolic2346) URL: #12370
…12447) Fixes the bug in list offsets in mixed types - string/value type with struct type. In nested JSON reader, following json string `[{"a":[123, {"0": 123}], "b":1.0}, {"b":1.1}, {"b":2.1}]` should produce first column as ``` List<Struct<float,>>: Length : 3 Offsets : 0, 2, 2, 2 Null count: 2 001 Struct<float,>: Length : 2: Null count: 1 10 10 NULL, 123 ``` Depends on #12330 closes #12418 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #12447
From the current json test:
The test specifies an input:
Notice that the input for list
a
(lists of structs) is invalid: the first element is123
, which is not a struct. As such, the test expects the result of parsing the first lista
to be[null, 123.0]
which is reasonable.However, if at the end of the test, we print out the result by:
then we will get this:
The output column seems to have corrupted data. In particular, it is a lists column having only one non-null row (the first row), which in turn has only one element (offsets
0, 1]
). However, the child column has 2 elements.To overcome this bug, I temporarily disable the
JsonReaderTest#JsonNestedDtypeSchema
test in #12370. It should be enabled back after fixing this.The text was updated successfully, but these errors were encountered: