[FEA] Add support for get_json_object #6985

kuhushukla · 2020-12-11T15:37:14Z

Is your feature request related to a problem? Please describe.
Allow cudf support for https://spark.apache.org/docs/2.4.5/api/sql/index.html#get_json_object

Describe the solution you'd like
Provide cpp support for get_json_object which extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

Describe alternatives you've considered
N/A

harrism · 2020-12-13T22:44:44Z

Is this for loading a JSON file? Or is it for treating the contents of a strings column as JSON text? The answer affects whether or not this is a cuIO feature request.

revans2 · 2020-12-14T17:08:04Z

Sorry this was not described very well. Full warning this is not simple in the least but we have a lot of customers that really want this.

get_json_object in spark takes a string column and a path. The string column is parsed as JSON. Then the path is applied to the parsed JSON object to produce a new JSON object. The result is then turned back into a string.

The example from the spark documentation is

SELECT get_json_object('{"a":"b"}', '$.a');

produces the string b

The $ indicates the root of the JSON tree, and the .a is the key to the JSON dictionary.

I'll try to post all of the operations that path supports and some more examples.

The main thing is that I don't know how generic this type of an operator is. I don't see much in pandas that would provide similar functionality. The closest I could come up with is json_normalize but that is not really that close because we want to parse the string and json_normalize assumes that it is already parsed...

revans2 · 2020-12-14T18:12:29Z

OK Here is a bit more information. This function is trying to be compatible with the hive get_json_object function and is very similar to the SQLServer JSON_QUERY function. In all cases the path is a subset of the JSONPath https://goessner.net/articles/JsonPath/.

In the case of Hive (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object)

A limited version of JSONPath is supported:

    $ : Root object
    . : Child operator
    [] : Subscript operator for array
    * : Wildcard for []

From reading through the Spark code it appears to be the same.

If a path does not match a null is returned.

+-------------------------------+
|get_json_object({"a":"b"}, $.c)|
+-------------------------------+
|                           null|
+-------------------------------+

You can access array elements with [index]

+----------------------------------------------------+
|get_json_object({"a": ["b", "c", "d", "e"]}, $.a[0])|
+----------------------------------------------------+
|                                                   b|
+----------------------------------------------------+

If no index is given you get a null back, but * as the index is a wild card that can match all entries.

+--------------------------------------------------------------+
|get_json_object({"a": [{"b": 1}, {"b": 2}, "d", "e"]}, $.a[*])|
+--------------------------------------------------------------+
|[{"b":1},{"b":2},"d","e"]                                     |
+--------------------------------------------------------------+

This even works with nesting

+----------------------------------------------------------------+
|get_json_object({"a": [{"b": 1}, {"b": 2}, "d", "e"]}, $.a[*].b)|
+----------------------------------------------------------------+
|[1,2]                                                           |
+----------------------------------------------------------------+

So like I said this is not trivial, but it is some what standards based and it is supported by other SQL implementations, which is why we think it belongs in cudf.

chenrui17 · 2021-01-21T07:28:09Z

I want to know when this feature probably be supported ?

sameerz · 2021-02-12T02:02:39Z

@chenrui17 we will target support in RAPIDS 0.19 , with the spark-rapids work in 0.5. Related PR #7286

github-actions · 2021-03-14T02:31:56Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sameerz · 2021-03-14T22:40:37Z

We are benchmarking the draft PR #7286. This is still active.

nvdbaranec · 2021-05-18T15:33:19Z

Done

kuhushukla added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Dec 11, 2020

kuhushukla mentioned this issue Dec 11, 2020

[FEA] Add java bindings for get_json_object #6986

Closed

kuhushukla mentioned this issue Dec 11, 2020

[FEA] Support new operators on the GPU NVIDIA/spark-rapids#1374

Closed

11 tasks

harrism added the libcudf Affects libcudf (C++/CUDA) code. label Dec 13, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Dec 15, 2020

chenrui17 mentioned this issue Jan 20, 2021

[QST]List<Struct<String, String>> is same as List<Map<String, String>> ? NVIDIA/spark-rapids#1554

Closed

github-actions bot added the inactive-30d label Mar 14, 2021

github-actions bot removed the inactive-30d label Mar 14, 2021

sameerz assigned nvdbaranec Apr 12, 2021

nvdbaranec closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for get_json_object #6985

[FEA] Add support for get_json_object #6985

kuhushukla commented Dec 11, 2020

harrism commented Dec 13, 2020

revans2 commented Dec 14, 2020

revans2 commented Dec 14, 2020

chenrui17 commented Jan 21, 2021 •

edited

Loading

sameerz commented Feb 12, 2021

github-actions bot commented Mar 14, 2021

sameerz commented Mar 14, 2021

nvdbaranec commented May 18, 2021

[FEA] Add support for get_json_object #6985

[FEA] Add support for get_json_object #6985

Comments

kuhushukla commented Dec 11, 2020

harrism commented Dec 13, 2020

revans2 commented Dec 14, 2020

revans2 commented Dec 14, 2020

chenrui17 commented Jan 21, 2021 • edited Loading

sameerz commented Feb 12, 2021

github-actions bot commented Mar 14, 2021

sameerz commented Mar 14, 2021

nvdbaranec commented May 18, 2021

chenrui17 commented Jan 21, 2021 •

edited

Loading