Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_json_object() implementation #7286

Merged
merged 40 commits into from
Mar 31, 2021

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Feb 2, 2021

An implementation of get_json_object().

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

The fundamental functionality here is running a JSONPath query on each row in an input column of json strings.

JSONPath spec: https://tools.ietf.org/id/draft-goessner-dispatch-jsonpath-00.html

For review purposes, the key entry point is parse_json_path(). Each thread of the kernel processes 1 row via this function. The behavior is recursive in nature but we maintain our own context stack to do it in loop fashion.

parse_json_path is just the high level controlling logic, with most of the heavy lifting happening in the json_state parser class. Though the "heavy lifting" is pretty much just traditional string parsing code.

The path to optimization here (I'll open a separate cudf issue for this) is

  • Change parse_json_path to work on a warp basis. So each row in the column would be processed by one warp.
  • Make the json_state parser class thread/warp aware (the class would just store its tid and operate accordingly). I think this is reasonably straightforward to do as most of the cuIO decoding kernels behave like this.

@nvdbaranec nvdbaranec requested review from a team as code owners February 2, 2021 23:14
@nvdbaranec nvdbaranec added 2 - In Progress Currently a work in progress feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Feb 2, 2021
@nvdbaranec nvdbaranec marked this pull request as draft February 3, 2021 00:00
@harrism
Copy link
Member

harrism commented Feb 3, 2021

@nvdbaranec the docs you linked for get_json_object do not say what the function does. Can you please provide a description of what the code in this PR does?

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 3, 2021

From here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.functions.get_json_object

Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

@kkraus14 kkraus14 added the Spark Functionality that helps Spark RAPIDS label Feb 3, 2021
@harrism
Copy link
Member

harrism commented Feb 3, 2021

Let's discuss whether libcudf is the right home for this functionality.

@harrism
Copy link
Member

harrism commented Feb 3, 2021

Let's discuss whether libcudf is the right home for this functionality.

Criteria:

  1. Is it broadly applicable for libcudf users?
    A: I don't know. So far I know it's important to one Spark use case.
  2. Is it uniquely enabled by libcudf features and internals? (would it be a lot harder to support outside of libcudf)
    A: It seems to be fairly independent, custom CUDA code that is not heavily dependent on other libcudf features or data structures. So I think the answer is no.

(I'm sure there are other criteria, but those two come quickly to mind.)

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 3, 2021

  1. Is it broadly applicable for libcudf users?

@randerzander just recently pinged me inquiring supporting a function like this as well, so I think we can assume some type of functionality like this is broadly applicable. @randerzander do you need it purely in the case of parsing input JSONLines or do you have use cases where you end up with a String column of JSON formatted strings?

@randerzander
Copy link
Contributor

  1. Is it broadly applicable for libcudf users?

@randerzander just recently pinged me inquiring supporting a function like this as well, so I think we can assume some type of functionality like this is broadly applicable. @randerzander do you need it purely in the case of parsing input JSONLines or do you have use cases where you end up with a String column of JSON formatted strings?

Yes, there are use-cases where you need to extract the string of whatever existed (could be more nested JSON, could be a numeric, a string, a list) at the specified JSON path.

…rt. Code is still purely naive and probably doesn't

handle all possible error conditions well.
@github-actions github-actions bot added the CMake CMake build issue label Feb 10, 2021
…s to point to new location for get_json_object().

Use a grid stride loop in core kernel.  Use some thrust_optionals where appropriate. Compute and return null count
instead of just leaving it unknown.
cpp/include/cudf/strings/detail/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/json.hpp Show resolved Hide resolved
cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/tests/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/strings/json/json_path.cu Outdated Show resolved Hide resolved
cpp/src/strings/json/json_path.cu Show resolved Hide resolved
cpp/src/strings/json/json_path.cu Outdated Show resolved Hide resolved
cpp/src/strings/json/json_path.cu Show resolved Hide resolved
cpp/src/strings/json/json_path.cu Outdated Show resolved Hide resolved
in substring.hpp.  Add strings_json doxygen group. Make sure JSONPath terminology is used consistently. Other small PR review cleanup.
@nvdbaranec nvdbaranec requested a review from davidwendt March 30, 2021 22:51
@nvdbaranec
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some headers that I think can be removed.

cpp/src/strings/json/json_path.cu Outdated Show resolved Hide resolved
cpp/src/strings/json/json_path.cu Show resolved Hide resolved
cpp/benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec requested a review from davidwendt March 31, 2021 01:19
@jrhemstad
Copy link
Contributor

@gpucibot

@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@nvdbaranec nvdbaranec added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 31, 2021
@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b937112 into rapidsai:branch-0.19 Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.