-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_json_object() implementation #7286
Conversation
@nvdbaranec the docs you linked for |
|
Let's discuss whether libcudf is the right home for this functionality. |
Criteria:
(I'm sure there are other criteria, but those two come quickly to mind.) |
@randerzander just recently pinged me inquiring supporting a function like this as well, so I think we can assume some type of functionality like this is broadly applicable. @randerzander do you need it purely in the case of parsing input JSONLines or do you have use cases where you end up with a String column of JSON formatted strings? |
Yes, there are use-cases where you need to extract the string of whatever existed (could be more nested JSON, could be a numeric, a string, a list) at the specified JSON path. |
…rt. Code is still purely naive and probably doesn't handle all possible error conditions well.
…nstead of doing the parsing on the gpu.
…s to point to new location for get_json_object(). Use a grid stride loop in core kernel. Use some thrust_optionals where appropriate. Compute and return null count instead of just leaving it unknown.
in substring.hpp. Add strings_json doxygen group. Make sure JSONPath terminology is used consistently. Other small PR review cleanup.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some headers that I think can be removed.
@gpucibot merge |
@gpucibot merge |
An implementation of get_json_object().
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object
The fundamental functionality here is running a JSONPath query on each row in an input column of json strings.
JSONPath spec: https://tools.ietf.org/id/draft-goessner-dispatch-jsonpath-00.html
For review purposes, the key entry point is
parse_json_path()
. Each thread of the kernel processes 1 row via this function. The behavior is recursive in nature but we maintain our own context stack to do it in loop fashion.parse_json_path
is just the high level controlling logic, with most of the heavy lifting happening in thejson_state
parser class. Though the "heavy lifting" is pretty much just traditional string parsing code.The path to optimization here (I'll open a separate cudf issue for this) is
parse_json_path
to work on a warp basis. So each row in the column would be processed by one warp.json_state
parser class thread/warp aware (the class would just store itstid
and operate accordingly). I think this is reasonably straightforward to do as most of the cuIO decoding kernels behave like this.