Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for get_json_object #6985

Closed
kuhushukla opened this issue Dec 11, 2020 · 8 comments
Closed

[FEA] Add support for get_json_object #6985

kuhushukla opened this issue Dec 11, 2020 · 8 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@kuhushukla
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Allow cudf support for https://spark.apache.org/docs/2.4.5/api/sql/index.html#get_json_object

Describe the solution you'd like
Provide cpp support for get_json_object which extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

Describe alternatives you've considered
N/A

@kuhushukla kuhushukla added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Dec 11, 2020
@harrism harrism added the libcudf Affects libcudf (C++/CUDA) code. label Dec 13, 2020
@harrism
Copy link
Member

harrism commented Dec 13, 2020

Is this for loading a JSON file? Or is it for treating the contents of a strings column as JSON text? The answer affects whether or not this is a cuIO feature request.

@revans2
Copy link
Contributor

revans2 commented Dec 14, 2020

Sorry this was not described very well. Full warning this is not simple in the least but we have a lot of customers that really want this.

get_json_object in spark takes a string column and a path. The string column is parsed as JSON. Then the path is applied to the parsed JSON object to produce a new JSON object. The result is then turned back into a string.

The example from the spark documentation is

SELECT get_json_object('{"a":"b"}', '$.a');

produces the string b

The $ indicates the root of the JSON tree, and the .a is the key to the JSON dictionary.

I'll try to post all of the operations that path supports and some more examples.

The main thing is that I don't know how generic this type of an operator is. I don't see much in pandas that would provide similar functionality. The closest I could come up with is json_normalize but that is not really that close because we want to parse the string and json_normalize assumes that it is already parsed...

@revans2
Copy link
Contributor

revans2 commented Dec 14, 2020

OK Here is a bit more information. This function is trying to be compatible with the hive get_json_object function and is very similar to the SQLServer JSON_QUERY function. In all cases the path is a subset of the JSONPath https://goessner.net/articles/JsonPath/.

In the case of Hive (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object)

A limited version of JSONPath is supported:

    $ : Root object
    . : Child operator
    [] : Subscript operator for array
    * : Wildcard for []

From reading through the Spark code it appears to be the same.

If a path does not match a null is returned.

+-------------------------------+
|get_json_object({"a":"b"}, $.c)|
+-------------------------------+
|                           null|
+-------------------------------+

You can access array elements with [index]

+----------------------------------------------------+
|get_json_object({"a": ["b", "c", "d", "e"]}, $.a[0])|
+----------------------------------------------------+
|                                                   b|
+----------------------------------------------------+

If no index is given you get a null back, but * as the index is a wild card that can match all entries.

+--------------------------------------------------------------+
|get_json_object({"a": [{"b": 1}, {"b": 2}, "d", "e"]}, $.a[*])|
+--------------------------------------------------------------+
|[{"b":1},{"b":2},"d","e"]                                     |
+--------------------------------------------------------------+

This even works with nesting

+----------------------------------------------------------------+
|get_json_object({"a": [{"b": 1}, {"b": 2}, "d", "e"]}, $.a[*].b)|
+----------------------------------------------------------------+
|[1,2]                                                           |
+----------------------------------------------------------------+

So like I said this is not trivial, but it is some what standards based and it is supported by other SQL implementations, which is why we think it belongs in cudf.

@chenrui17
Copy link
Contributor

chenrui17 commented Jan 21, 2021

I want to know when this feature probably be supported ?

@sameerz
Copy link
Contributor

sameerz commented Feb 12, 2021

@chenrui17 we will target support in RAPIDS 0.19 , with the spark-rapids work in 0.5. Related PR #7286

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented Mar 14, 2021

We are benchmarking the draft PR #7286. This is still active.

@nvdbaranec
Copy link
Contributor

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

7 participants