Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUDIT] Investigate change in "partial results" semantics for JSON read #7143

Open
mythrocks opened this issue Nov 22, 2022 · 0 comments
Open
Labels
audit_3.4.0 Audit related tasks for 3.4.0 bug Something isn't working

Comments

@mythrocks
Copy link
Collaborator

Description
SPARK-40646 introduces a subtle change to how partial parse results are reported when reading JSON input. If the parse on one column fails, the column that follows is also reported as null.

We need to check whether GpuReadJsonFileFormat produces null values in the same way.

Steps/Code to reproduce bug
From the commit message, reading the following input:

{"a": {"x": 1, "y": true}, "b": {"x": 1}}
{"a": {"x": 2}, "b": {"x": 2}}

via the following schema:

val df = spark.read
  .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>")
  .json("path")

produces the following result:

a	                b
=================================
null	                null
{"x":2,"y":null}	{"x":2}

Expected behavior
The results should have been:

a	                b
=================================
null	                {"x":1}
{"x":2,"y":null}	{"x":2}

Note: This was apparently a regression introduced by SPARK-33134.

@mythrocks mythrocks added bug Something isn't working ? - Needs Triage Need team to review and classify audit_3.4.0 Audit related tasks for 3.4.0 and removed ? - Needs Triage Need team to review and classify labels Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.4.0 Audit related tasks for 3.4.0 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant