Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON reader: Collect more corner cases in JSON reader parsing #4821

Open
HaoYang670 opened this issue Feb 18, 2022 · 7 comments
Open

JSON reader: Collect more corner cases in JSON reader parsing #4821

HaoYang670 opened this issue Feb 18, 2022 · 7 comments
Labels
documentation Improvements or additions to documentation task Work required that improves the product but is not user facing

Comments

@HaoYang670
Copy link
Collaborator

HaoYang670 commented Feb 18, 2022

Track #9 (comment) and #4138
There are some odd corner cases in JSON reader parsing in Spark, which are difficult to match the behavior on GPU. We need to find more of them.

@HaoYang670 HaoYang670 added question Further information is requested ? - Needs Triage Need team to review and classify labels Feb 18, 2022
@HaoYang670
Copy link
Collaborator Author

HaoYang670 commented Feb 18, 2022

Case sensitivity

1

{"a": 1, "b": 2}
{"A": 3, "B": 4}

is parsed as

+----+----+
|   a|   b|
+----+----+
|   1|   2|
|null|null|
+----+----+

in Spark, and

      a     b     A     B
0   1.0   2.0  <NA>  <NA>
1  <NA>  <NA>   3.0   4.0

in CUDF

2

{"a": 1, "B": 2}
{"A": 3, "b": 4}

is parsed as

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `a`, `b`

in Spark, and

      a     B     A     b
0   1.0   2.0  <NA>  <NA>
1  <NA>  <NA>   3.0   4.0

in CUDF

Number

1

{"a": 1.0}
{"a": 0.1}
{"a": .1} 

is parsed as

+---------------+----+
|_corrupt_record|   a|
+---------------+----+
|           null| 1.0|
|           null| 0.1|
|      {"a": .1}|null|
+---------------+----+

in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and

     a
0  1.0
1  0.1
2  0.1

in CUDF

Empty file

1

{}

is parsed as

++
||
++
||
++

in Spark, and

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:609: Error determining column names.

in CUDF

Comments

(although JSON does not support comments)

1

// comment at first line
{"a": 1}

is parsed as

+--------------------+----+
|     _corrupt_record|   a|
+--------------------+----+
|// comment at fir...|null|
|                null|   1|
+--------------------+----+

in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:371: Input data is not a valid JSON file.

in CUDF

2

{"a": 1}
// comment at last line

is parsed as

+--------------------+----+
|     _corrupt_record|   a|
+--------------------+----+
|                null|   1|
|// comment at las...|null|
+--------------------+----+

in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and

      a
0   1.0
1  <NA>

in CUDF

3

// {"a": 0}
// {"a": 1}

is parsed as

org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default).

in Spark, and

   a
0  0
1  1

in CUDF

@HaoYang670
Copy link
Collaborator Author

HaoYang670 commented Feb 21, 2022

String

1: How many colons are there?

{"a":::::::::::}

is parsed as

org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default).

in Spark, and

            a
0  ::::::::::

in CUDF

2: something beyond 0x10ffff

(the code point of 𝞧 is 0x1d747 )

{"a": "𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧"}
{"a": "𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧"}
{"a": "vvvvvvvvvvvvvvvvvvvv"}
{"a": "vvvvvvvvvvvvvvvvvvvvv"}

is parsed as

+--------------------+
|                   a|
+--------------------+
|𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧?...|
|𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧|
|vvvvvvvvvvvvvvvvvvvv|
|vvvvvvvvvvvvvvvvv...|
+--------------------+

in Spark (the weird thing is the ?), and

                       a
0      𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧
1         𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧𝞧
2   vvvvvvvvvvvvvvvvvvvv
3  vvvvvvvvvvvvvvvvvvvvv

in CUDF

@HaoYang670
Copy link
Collaborator Author

HaoYang670 commented Feb 21, 2022

How many lines are there?

test this rule:

element
    ws value ws

1

{"a": "This is the first line"}





is parsed as

+--------------------+
|                   a|
+--------------------+
|This is the first...|
+--------------------+

in Spark, and

                        a
0  This is the first line
1                    <NA>
2                    <NA>
3                    <NA>
4                    <NA>
5                    <NA>

in CUDF

2

{"a": "Is this the first line?"}

is parsed as

+--------------------+
|                   a|
+--------------------+
|Is this the first...|
+--------------------+

in Spark, and

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:371: Input data is not a valid JSON file.

in CUDF.

@HaoYang670
Copy link
Collaborator Author

Array and Struct

1

{"a": [1,2]}

is parsed as

+------+
|     a|
+------+
|[1, 2]|
+------+

in Spark, and

    a
0  [1

in CUDF.

@HaoYang670
Copy link
Collaborator Author

@revans2 Could you please help to review? Are these the corner cases in your mind?

@revans2
Copy link
Collaborator

revans2 commented Feb 22, 2022

@HaoYang670 If I know all of the corner cases I would have documented them on the original issue. This is looking really good. But one thing you need to be careful of is that anything with _corrupt_record as a column in it will fall back to the CPU for parsing. So your analysis that looks at CUDF parsing this data is wrong because didn't do that. You can work around this by providing the schema to the json read command. Please make sure to check each query that it really was parsed on the GPU.

@jlowe jlowe added task Work required that improves the product but is not user facing and removed question Further information is requested ? - Needs Triage Need team to review and classify labels Feb 22, 2022
@sameerz sameerz added the documentation Improvements or additions to documentation label Feb 22, 2022
@GaryShen2008
Copy link
Collaborator

We'll need to create a CI job to run fuzzing test frequently to see if we can find more corner cases.
Move it to 22.08 since it's not in 22.06's target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

5 participants