-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON reader: Collect more corner cases in JSON reader parsing #4821
Comments
Case sensitivity1{"a": 1, "b": 2}
{"A": 3, "B": 4} is parsed as +----+----+
| a| b|
+----+----+
| 1| 2|
|null|null|
+----+----+ in Spark, and a b A B
0 1.0 2.0 <NA> <NA>
1 <NA> <NA> 3.0 4.0 in CUDF 2{"a": 1, "B": 2}
{"A": 3, "b": 4} is parsed as org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `a`, `b` in Spark, and a B A b
0 1.0 2.0 <NA> <NA>
1 <NA> <NA> 3.0 4.0 in CUDF Number1{"a": 1.0}
{"a": 0.1}
{"a": .1} is parsed as +---------------+----+
|_corrupt_record| a|
+---------------+----+
| null| 1.0|
| null| 0.1|
| {"a": .1}|null|
+---------------+----+ in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and a
0 1.0
1 0.1
2 0.1 in CUDF Empty file1{} is parsed as ++
||
++
||
++ in Spark, and RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:609: Error determining column names. in CUDF Comments(although JSON does not support comments) 1// comment at first line
{"a": 1} is parsed as +--------------------+----+
| _corrupt_record| a|
+--------------------+----+
|// comment at fir...|null|
| null| 1|
+--------------------+----+ in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:371: Input data is not a valid JSON file. in CUDF 2{"a": 1}
// comment at last line is parsed as +--------------------+----+
| _corrupt_record| a|
+--------------------+----+
| null| 1|
|// comment at las...|null|
+--------------------+----+ in Spark (This will fall back to the CPU for parsing as it is a _corrupt_record), and a
0 1.0
1 <NA> in CUDF 3// {"a": 0}
// {"a": 1} is parsed as org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). in Spark, and a
0 0
1 1 in CUDF |
String1: How many colons are there?{"a":::::::::::} is parsed as org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). in Spark, and a
0 :::::::::: in CUDF 2: something beyond
|
How many lines are there?test this rule:
1{"a": "This is the first line"}
is parsed as +--------------------+
| a|
+--------------------+
|This is the first...|
+--------------------+ in Spark, and a
0 This is the first line
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA> in CUDF 2{"a": "Is this the first line?"} is parsed as +--------------------+
| a|
+--------------------+
|Is this the first...|
+--------------------+ in Spark, and RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/io/json/reader_impl.cu:371: Input data is not a valid JSON file. in CUDF. |
Array and Struct1{"a": [1,2]} is parsed as +------+
| a|
+------+
|[1, 2]|
+------+ in Spark, and a
0 [1 in CUDF. |
@revans2 Could you please help to review? Are these the |
@HaoYang670 If I know all of the corner cases I would have documented them on the original issue. This is looking really good. But one thing you need to be careful of is that anything with |
We'll need to create a CI job to run fuzzing test frequently to see if we can find more corner cases. |
Track #9 (comment) and #4138
There are some odd corner cases in JSON reader parsing in Spark, which are difficult to match the behavior on GPU. We need to find more of them.
The text was updated successfully, but these errors were encountered: