JSON reader: Collect more corner cases in JSON reader parsing #4821

HaoYang670 · 2022-02-18T07:49:45Z

Track #9 (comment) and #4138
There are some odd corner cases in JSON reader parsing in Spark, which are difficult to match the behavior on GPU. We need to find more of them.

HaoYang670 · 2022-02-18T07:59:00Z

Case sensitivity

1

{"a": 1, "b": 2}
{"A": 3, "B": 4}

is parsed as

+----+----+
|   a|   b|
+----+----+
|   1|   2|
|null|null|
+----+----+

in Spark, and

      a     b     A     B
0   1.0   2.0  <NA>  <NA>
1  <NA>  <NA>   3.0   4.0

in CUDF

2

{"a": 1, "B": 2}
{"A": 3, "b": 4}

is parsed as

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `a`, `b`

in Spark, and

      a     B     A     b
0   1.0   2.0  <NA>  <NA>
1  <NA>  <NA>   3.0   4.0

in CUDF

Number

1

{"a": 1.0}
{"a": 0.1}
{"a": .1}

is parsed as

+---------------+----+
|_corrupt_record|   a|
+---------------+----+
|           null| 1.0|
|           null| 0.1|
|      {"a": .1}|null|
+---------------+----+

in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and

in CUDF

Empty file

1

{}

is parsed as

++
||
++
||
++

in Spark, and

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:609: Error determining column names.

in CUDF

Comments

(although JSON does not support comments)

1

// comment at first line
{"a": 1}

is parsed as

+--------------------+----+
|     _corrupt_record|   a|
+--------------------+----+
|// comment at fir...|null|
|                null|   1|
+--------------------+----+

in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:371: Input data is not a valid JSON file.

in CUDF

2

{"a": 1}
// comment at last line

is parsed as

+--------------------+----+
|     _corrupt_record|   a|
+--------------------+----+
|                null|   1|
|// comment at las...|null|
+--------------------+----+

in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and

      a
0   1.0
1  <NA>

in CUDF

3

// {"a": 0}
// {"a": 1}

is parsed as

org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default).

in Spark, and

   a
0  0
1  1

in CUDF

HaoYang670 · 2022-02-21T02:36:36Z

String

1: How many colons are there?

{"a":::::::::::}

is parsed as

org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default).

in Spark, and

            a
0  ::::::::::

in CUDF

2: something beyond `0x10ffff`

(the code point of 𝞧 is 0x1d747 )

{"a": "𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧"}
{"a": "𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧"}
{"a": "vvvvvvvvvvvvvvvvvvvv"}
{"a": "vvvvvvvvvvvvvvvvvvvvv"}

is parsed as

+--------------------+
|                   a|
+--------------------+
|𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧?...|
|𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧|
|vvvvvvvvvvvvvvvvvvvv|
|vvvvvvvvvvvvvvvvv...|
+--------------------+

in Spark (the weird thing is the ?), and

                       a
0      𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧
1         𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧
2   vvvvvvvvvvvvvvvvvvvv
3  vvvvvvvvvvvvvvvvvvvvv

in CUDF

HaoYang670 · 2022-02-21T02:39:49Z

How many lines are there?

test this rule:

element
    ws value ws

1

{"a": "This is the first line"}

is parsed as

+--------------------+
|                   a|
+--------------------+
|This is the first...|
+--------------------+

in Spark, and

                        a
0  This is the first line
1                    <NA>
2                    <NA>
3                    <NA>
4                    <NA>
5                    <NA>

in CUDF

2

{"a": "Is this the first line?"}

is parsed as

+--------------------+
|                   a|
+--------------------+
|Is this the first...|
+--------------------+

in Spark, and

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:371: Input data is not a valid JSON file.

in CUDF.

HaoYang670 · 2022-02-21T02:58:37Z

Array and Struct

1

{"a": [1,2]}

is parsed as

+------+
|     a|
+------+
|[1, 2]|
+------+

in Spark, and

    a
0  [1

in CUDF.

HaoYang670 · 2022-02-21T07:08:57Z

@revans2 Could you please help to review? Are these the corner cases in your mind?

revans2 · 2022-02-22T14:32:44Z

@HaoYang670 If I know all of the corner cases I would have documented them on the original issue. This is looking really good. But one thing you need to be careful of is that anything with _corrupt_record as a column in it will fall back to the CPU for parsing. So your analysis that looks at CUDF parsing this data is wrong because didn't do that. You can work around this by providing the schema to the json read command. Please make sure to check each query that it really was parsed on the GPU.

GaryShen2008 · 2022-05-20T05:28:16Z

We'll need to create a CI job to run fuzzing test frequently to see if we can find more corner cases.
Move it to 22.08 since it's not in 22.06's target.

HaoYang670 added question Further information is requested ? - Needs Triage Need team to review and classify labels Feb 18, 2022

jlowe added task Work required that improves the product but is not user facing and removed question Further information is requested ? - Needs Triage Need team to review and classify labels Feb 22, 2022

jlowe assigned HaoYang670 Feb 22, 2022

sameerz added the documentation Improvements or additions to documentation label Feb 22, 2022

HaoYang670 mentioned this issue Mar 23, 2022

Add fuzzing test for JSON reader #5001

Merged

GaryShen2008 self-assigned this May 20, 2022

GaryShen2008 unassigned GaryShen2008 and HaoYang670 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON reader: Collect more corner cases in JSON reader parsing #4821

JSON reader: Collect more corner cases in JSON reader parsing #4821

HaoYang670 commented Feb 18, 2022 •

edited

Loading

HaoYang670 commented Feb 18, 2022 •

edited

Loading

HaoYang670 commented Feb 21, 2022 •

edited

Loading

HaoYang670 commented Feb 21, 2022 •

edited

Loading

HaoYang670 commented Feb 21, 2022

HaoYang670 commented Feb 21, 2022

revans2 commented Feb 22, 2022

GaryShen2008 commented May 20, 2022

JSON reader: Collect more corner cases in JSON reader parsing #4821

JSON reader: Collect more corner cases in JSON reader parsing #4821

Comments

HaoYang670 commented Feb 18, 2022 • edited Loading

HaoYang670 commented Feb 18, 2022 • edited Loading

Case sensitivity

1

2

Number

1

Empty file

1

Comments

1

2

3

HaoYang670 commented Feb 21, 2022 • edited Loading

String

1: How many colons are there?

2: something beyond 0x10ffff

HaoYang670 commented Feb 21, 2022 • edited Loading

How many lines are there?

1

2

HaoYang670 commented Feb 21, 2022

Array and Struct

1

HaoYang670 commented Feb 21, 2022

revans2 commented Feb 22, 2022

GaryShen2008 commented May 20, 2022

HaoYang670 commented Feb 18, 2022 •

edited

Loading

HaoYang670 commented Feb 18, 2022 •

edited

Loading

HaoYang670 commented Feb 21, 2022 •

edited

Loading

2: something beyond `0x10ffff`

HaoYang670 commented Feb 21, 2022 •

edited

Loading