-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON parser integration #11717
JSON parser integration #11717
Conversation
…f-column-type-conversion
Co-authored-by: Elias Stehle <[email protected]>
Co-authored-by: Elias Stehle <[email protected]>
… fea-json-tree-traversal
…n-explicit-schema
…f-column-type-conversion
…re/leaf-column-type-conversion
cpp/src/io/json/json_tree.cu
Outdated
case token_t::ValueBegin: | ||
return NC_STR; // NC_VAL; | ||
// NV_VAL is removed because type inference and | ||
// reduce_to_column_tree category collapsing takes care of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a comment for the reviewer, and not future readers of the code. Perhaps:
case token_t::ValueBegin: | |
return NC_STR; // NC_VAL; | |
// NV_VAL is removed because type inference and | |
// reduce_to_column_tree category collapsing takes care of this. | |
case token_t::ValueBegin: | |
return NC_STR; // type inference and reduce_to_column_tree category collapsing will later convert this to a value |
WDYT? (I don't know the details of the type inference, so my proposed comment might be wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very confident on many of the details, but some potential for a few cleanups?
cpp/src/io/json/json_tree.cu
Outdated
@@ -225,6 +228,7 @@ tree_meta_t get_tree_representation(device_span<PdaTokenT const> tokens, | |||
// TODO: make it own function. | |||
rmm::device_uvector<size_type> parent_token_ids(num_tokens, stream); | |||
rmm::device_uvector<size_type> initial_order(num_tokens, stream); | |||
// TODO re-write the algorithm to work only on nodes, not tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth producing a tracking issue?
cpp/tests/io/json_tree.cpp
Outdated
@@ -289,7 +289,7 @@ tree_meta_t2 get_tree_representation_cpu(device_span<PdaTokenT const> tokens_gpu | |||
case token_t::StructBegin: return NC_STRUCT; | |||
case token_t::ListBegin: return NC_LIST; | |||
case token_t::StringBegin: return NC_STR; | |||
case token_t::ValueBegin: return NC_VAL; | |||
case token_t::ValueBegin: return NC_STR; // NC_VAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case token_t::ValueBegin: return NC_STR; // NC_VAL; | |
case token_t::ValueBegin: return NC_STR; |
Maybe add the same comment that was introduced above?
cpp/tests/io/json_tree.cpp
Outdated
cuio_json::NC_FN, cuio_json::NC_LIST, cuio_json::NC_STRUCT, cuio_json::NC_STRUCT, | ||
cuio_json::NC_FN, cuio_json::NC_STR, cuio_json::NC_FN, cuio_json::NC_STR, | ||
cuio_json::NC_FN, cuio_json::NC_VAL}; | ||
cuio_json::NC_FN, cuio_json::NC_STR}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NC_VAL
is no longer used at all, does it make sense to remove it from the enums? Or is it still used in the non-experimental engines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's removed for a specific null behaviour (null literals are identified as values now). But it might be needed in future (null literals are not checked, it might be change in future). NC_VAL does not add value now, but will be required when we add other features of json parser.
python/cudf/cudf/tests/test_json.py
Outdated
bytes_file = BytesIO() | ||
data = { | ||
"c1": [{"f1": "sf11", "f2": "sf21"}, {"f1": "sf12", "f2": "sf22"}], | ||
"c2": [["l11", "l21"], ["l12", "l22"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes came in as part of #11746, so not sure if you need to merge trunk for the diff to disappear?
cpp/src/io/json/json_column.cu
Outdated
auto to_int = [](auto v) { return std::to_string(static_cast<int>(v)); }; | ||
auto print_vec = [](auto const& cpu, auto const name, auto converter) { | ||
for (auto const& v : cpu) | ||
printf("%3s,", converter(v).c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: std::format
once we move to C++20 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Definitely. looking forward to use std::format
.
cpp/src/io/json/json_column.cu
Outdated
[] __device__(NodeT type_a, NodeT type_b) -> NodeT { | ||
auto is_a_leaf = (type_a == NC_VAL || type_a == NC_STR); | ||
auto is_b_leaf = (type_b == NC_VAL || type_b == NC_STR); | ||
// (v+v=v, *+*=*, *+v=*, *+#=E, NESTED+VAL=NESTED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some upfront comments on the meaning of the symbols here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add details.
cpp/src/io/json/json_column.cu
Outdated
// restore unique_col_ids order | ||
std::sort(h_range_col_id_it, h_range_col_id_it + num_columns, [](auto const& a, auto const& b) { | ||
return thrust::get<1>(a) < thrust::get<1>(b); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the cost of slightly higher host memory pressure, is it easier just to do the original sort into a copy?
cpp/src/io/json/json_column.cu
Outdated
// restore col_ids, TODO is this required? | ||
// thrust::copy( | ||
// rmm::exec_policy(stream), original_col_ids.begin(), original_col_ids.end(), col_ids.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. not required. There are few optimisations pending here.
cpp/src/io/json/json_column.cu
Outdated
col->set_null_mask(rmm::device_buffer{0, stream, mr}, 0); | ||
} | ||
|
||
// For string columns return ["offsets", "char"] schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment name doesn't match code name.
cpp/src/io/json/json_column.cu
Outdated
// gpu tree generation | ||
return get_tree_representation(tokens_gpu, token_indices_gpu, stream); | ||
}(); // IILE used to free memory of token data. | ||
// print_tree(input, gpu_tree, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave as executable code behind #ifdef NJP_DEBUG_PRINT
?
Benchmark ResultsDated 21-Sept-2022 nested_json_gpu_parser[0] Quadro GV100
|
Benchmark ResultsDated 27-Sept-2022 nested_json_gpu_parser[0] Quadro GV100
|
Description
Integrates the experimental full gpu json parser. This PR is to check integration of various parts of nested json parser.
All test passes and no cuda memory errors.
Depends on #11610 , #11714, and #11746
Benchmark Results
Dated 27-Sept-2022
nested_json_gpu_parser
[0] Quadro GV100
Checklist