Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fix issues with repeated JSON columns #11361

Open
revans2 opened this issue Aug 19, 2024 · 1 comment
Open

[BUG] fix issues with repeated JSON columns #11361

revans2 opened this issue Aug 19, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 19, 2024

Describe the bug
To be clear I don't think this is a very important use case. Especially because Spark in inconsistent here too.

A JSON object can technically have repeated keys in it.

{"data":100,"data":200}

The JSON spec goes out of it's way to not specify how to interpret the data in JSON. Just what is and is not technically valid JSON. So repeated keys is ambiguous.

So for json_tuple the last non-null value always wins.

scala> Seq("""{"data":100,"data":200}""","""{"data":null,"data":200}""","""{"data":100,"data":null}""").toDF("json").selectExpr("*", "json_tuple(json, 'data')").show(false)
+------------------------+---+
|json                    |c0 |
+------------------------+---+
|{"data":100,"data":200} |200|
|{"data":null,"data":200}|200|
|{"data":100,"data":null}|100|
+------------------------+---+

Data type does not really matter because everything is returned as a String

scala> Seq("""{"data":100,"data":"200"}""","""{"data":null,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "json_tuple(json, 'data')").show(false)
+----------------------------+-------+
|json                        |c0     |
+----------------------------+-------+
|{"data":100,"data":"200"}   |200    |
|{"data":null,"data":false}  |false  |
|{"data":100.012,"data":null}|100.012|
+----------------------------+-------+

For from_json it is much more complicated because types are involved.

Here it appears to be that last always wins if the input data can be manipulated to match the desired type. And null can always be manipulated???

scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "from_json(json, 'data string')").show(false)
+----------------------------+---------------+
|json                        |from_json(json)|
+----------------------------+---------------+
|{"data":100,"data":"200"}   |{200}          |
|{"data":"100","data":200}   |{200}          |
|{"data":null,"data":false}  |{false}        |
|{"data":true,"data":false}  |{false}        |
|{"data":100.012,"data":null}|{null}         |
+----------------------------+---------------+

In this the last row is actually a null, not a string with "null" in it.
If I ask for it as a long, then the "200" for the first row is ignored because it cannot be made into a long.

scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "from_json(json, 'data long')").show(false)
+----------------------------+---------------+
|json                        |from_json(json)|
+----------------------------+---------------+
|{"data":100,"data":"200"}   |{100}          |
|{"data":"100","data":200}   |{200}          |
|{"data":null,"data":false}  |{null}         |
|{"data":true,"data":false}  |{null}         |
|{"data":100.012,"data":null}|{null}         |
+----------------------------+---------------+

Double is similar, but even though "100.012" can be made into a double, the last null value still wins, which is different from json_tuple

scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "from_json(json, 'data double')").show(false)
+----------------------------+---------------+
|json                        |from_json(json)|
+----------------------------+---------------+
|{"data":100,"data":"200"}   |{100.0}        |
|{"data":"100","data":200}   |{200.0}        |
|{"data":null,"data":false}  |{null}         |
|{"data":true,"data":false}  |{null}         |
|{"data":100.012,"data":null}|{null}         |
+----------------------------+---------------+

Nesting appears to be very similar. It goes with last wins, so long as the data can be manipulated into the desired data type.

scala> Seq("""{"data":100,"data":{"a":50,"a":100}}""","""{"data":{"b":100},"data":{"a":200}}""","""{"data":{"a":100},"data":{"b":200}}""","""{"data":{"a":101,"a":102},"data":null}""","""{"data":{"a":100},"data":"test"}""").toDF("json").selectExpr("*", "from_json(json, 'data struct<a:string>')").show(false)
+--------------------------------------+---------------+
|json                                  |from_json(json)|
+--------------------------------------+---------------+
|{"data":100,"data":{"a":50,"a":100}}  |{{100}}        |
|{"data":{"b":100},"data":{"a":200}}   |{{200}}        |
|{"data":{"a":100},"data":{"b":200}}   |{{null}}       |
|{"data":{"a":101,"a":102},"data":null}|{null}         |
|{"data":{"a":100},"data":"test"}      |{{100}}        |
+--------------------------------------+---------------+

get_json_object appears to always be first non-null match wins.

scala> Seq("""{"data":100,"data":{"a":50,"a":100}}""","""{"data":{"b":100},"data":{"a":200}}""","""{"data":{"a":100},"data":{"b":200}}""","""{"data":{"a":101,"a":102},"data":null}""","""{"data":{"a":100},"data":"test"}""").toDF("json").selectExpr("*", "get_json_object(json, '$.data.a')").show(false)
+--------------------------------------+-------------------------------+
|json                                  |get_json_object(json, $.data.a)|
+--------------------------------------+-------------------------------+
|{"data":100,"data":{"a":50,"a":100}}  |50                             |
|{"data":{"b":100},"data":{"a":200}}   |200                            |
|{"data":{"a":100},"data":{"b":200}}   |100                            |
|{"data":{"a":101,"a":102},"data":null}|101                            |
|{"data":{"a":100},"data":"test"}      |100                            |
+--------------------------------------+-------------------------------+
scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "get_json_object(json, '$.data')").show(false)
+----------------------------+-----------------------------+
|json                        |get_json_object(json, $.data)|
+----------------------------+-----------------------------+
|{"data":100,"data":"200"}   |100                          |
|{"data":"100","data":200}   |100                          |
|{"data":null,"data":false}  |false                        |
|{"data":true,"data":false}  |true                         |
|{"data":100.012,"data":null}|100.012                      |
+----------------------------+-----------------------------+

Happily our implementation for get_json_object is already fairly close on this. But not 100% of the way there. It is missing the not null part and only is doing the first match part.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 19, 2024
@mattahrens
Copy link
Collaborator

Potential scope: throw exception when repeated key is encountered in JSON parsing in the plugin.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants