Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parser integration #11717

Closed

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Sep 20, 2022

Description

Integrates the experimental full gpu json parser. This PR is to check integration of various parts of nested json parser.

All test passes and no cuda memory errors.
Depends on #11610 , #11714, and #11746

Benchmark Results

Dated 27-Sept-2022

nested_json_gpu_parser

[0] Quadro GV100

string_size Samples CPU Time Noise GPU Time Noise Elem/s bytes_per_second peak_memory_usage
2^20 = 1048576 2192x 5.784 ms 11.10% 5.777 ms 11.10% 181.518M 181518461 17.149 MiB
2^21 = 2097152 960x 6.158 ms 5.10% 6.150 ms 5.09% 340.980M 340979979 34.106 MiB
2^22 = 4194304 192x 7.798 ms 15.82% 7.791 ms 15.82% 538.363M 538362570 68.020 MiB
2^23 = 8388608 848x 10.036 ms 4.47% 10.029 ms 4.47% 836.432M 836432150 135.848 MiB
2^24 = 16777216 80x 16.555 ms 9.49% 16.547 ms 9.49% 1.014G 1013889148 271.504 MiB
2^25 = 33554432 552x 27.182 ms 7.52% 27.175 ms 7.52% 1.235G 1234774323 542.817 MiB
2^26 = 67108864 80x 48.934 ms 3.19% 48.927 ms 3.19% 1.372G 1371621284 1.060 GiB
2^27 = 134217728 166x 90.397 ms 1.64% 90.390 ms 1.64% 1.485G 1484872039 2.120 GiB
2^28 = 268435456 87x 172.521 ms 0.99% 172.514 ms 0.99% 1.556G 1556019625 4.239 GiB
2^29 = 536870912 43x 351.175 ms 0.77% 351.170 ms 0.77% 1.529G 1528807741 8.479 GiB
2^30 = 1073741824 11x 667.835 ms 0.18% 667.831 ms 0.18% 1.608G 1607803624 16.957 GiB

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

elstehle and others added 30 commits September 4, 2022 08:22
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 26, 2022
@karthikeyann karthikeyann requested a review from a team as a code owner September 26, 2022 20:10
@karthikeyann karthikeyann added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 26, 2022
Comment on lines 64 to 67
case token_t::ValueBegin:
return NC_STR; // NC_VAL;
// NV_VAL is removed because type inference and
// reduce_to_column_tree category collapsing takes care of this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a comment for the reviewer, and not future readers of the code. Perhaps:

Suggested change
case token_t::ValueBegin:
return NC_STR; // NC_VAL;
// NV_VAL is removed because type inference and
// reduce_to_column_tree category collapsing takes care of this.
case token_t::ValueBegin:
return NC_STR; // type inference and reduce_to_column_tree category collapsing will later convert this to a value

WDYT? (I don't know the details of the type inference, so my proposed comment might be wrong)

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very confident on many of the details, but some potential for a few cleanups?

@@ -225,6 +228,7 @@ tree_meta_t get_tree_representation(device_span<PdaTokenT const> tokens,
// TODO: make it own function.
rmm::device_uvector<size_type> parent_token_ids(num_tokens, stream);
rmm::device_uvector<size_type> initial_order(num_tokens, stream);
// TODO re-write the algorithm to work only on nodes, not tokens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth producing a tracking issue?

@@ -289,7 +289,7 @@ tree_meta_t2 get_tree_representation_cpu(device_span<PdaTokenT const> tokens_gpu
case token_t::StructBegin: return NC_STRUCT;
case token_t::ListBegin: return NC_LIST;
case token_t::StringBegin: return NC_STR;
case token_t::ValueBegin: return NC_VAL;
case token_t::ValueBegin: return NC_STR; // NC_VAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case token_t::ValueBegin: return NC_STR; // NC_VAL;
case token_t::ValueBegin: return NC_STR;

Maybe add the same comment that was introduced above?

cuio_json::NC_FN, cuio_json::NC_LIST, cuio_json::NC_STRUCT, cuio_json::NC_STRUCT,
cuio_json::NC_FN, cuio_json::NC_STR, cuio_json::NC_FN, cuio_json::NC_STR,
cuio_json::NC_FN, cuio_json::NC_VAL};
cuio_json::NC_FN, cuio_json::NC_STR};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If NC_VAL is no longer used at all, does it make sense to remove it from the enums? Or is it still used in the non-experimental engines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed for a specific null behaviour (null literals are identified as values now). But it might be needed in future (null literals are not checked, it might be change in future). NC_VAL does not add value now, but will be required when we add other features of json parser.

bytes_file = BytesIO()
data = {
"c1": [{"f1": "sf11", "f2": "sf21"}, {"f1": "sf12", "f2": "sf22"}],
"c2": [["l11", "l21"], ["l12", "l22"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes came in as part of #11746, so not sure if you need to merge trunk for the diff to disappear?

auto to_int = [](auto v) { return std::to_string(static_cast<int>(v)); };
auto print_vec = [](auto const& cpu, auto const name, auto converter) {
for (auto const& v : cpu)
printf("%3s,", converter(v).c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: std::format once we move to C++20 :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Definitely. looking forward to use std::format.

[] __device__(NodeT type_a, NodeT type_b) -> NodeT {
auto is_a_leaf = (type_a == NC_VAL || type_a == NC_STR);
auto is_b_leaf = (type_b == NC_VAL || type_b == NC_STR);
// (v+v=v, *+*=*, *+v=*, *+#=E, NESTED+VAL=NESTED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some upfront comments on the meaning of the symbols here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add details.

Comment on lines 454 to 457
// restore unique_col_ids order
std::sort(h_range_col_id_it, h_range_col_id_it + num_columns, [](auto const& a, auto const& b) {
return thrust::get<1>(a) < thrust::get<1>(b);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the cost of slightly higher host memory pressure, is it easier just to do the original sort into a copy?

Comment on lines 543 to 545
// restore col_ids, TODO is this required?
// thrust::copy(
// rmm::exec_policy(stream), original_col_ids.begin(), original_col_ids.end(), col_ids.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. not required. There are few optimisations pending here.

col->set_null_mask(rmm::device_buffer{0, stream, mr}, 0);
}

// For string columns return ["offsets", "char"] schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment name doesn't match code name.

// gpu tree generation
return get_tree_representation(tokens_gpu, token_indices_gpu, stream);
}(); // IILE used to free memory of token data.
// print_tree(input, gpu_tree, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave as executable code behind #ifdef NJP_DEBUG_PRINT ?

@github-actions github-actions bot removed the CMake CMake build issue label Sep 27, 2022
@karthikeyann
Copy link
Contributor Author

Benchmark Results

Dated 21-Sept-2022

nested_json_gpu_parser

[0] Quadro GV100

string_size Samples CPU Time Noise GPU Time Noise Elem/s
2^20 = 1048576 2501x 5.975 ms 7.89% 5.968 ms 7.88% 175.689M
2^21 = 2097152 672x 6.677 ms 4.13% 6.670 ms 4.13% 314.402M
2^22 = 4194304 592x 8.323 ms 4.69% 8.316 ms 4.69% 504.346M
2^23 = 8388608 1008x 12.778 ms 13.18% 12.771 ms 13.18% 656.833M
2^24 = 16777216 750x 19.987 ms 5.99% 19.980 ms 5.99% 839.719M
2^25 = 33554432 80x 37.238 ms 4.29% 37.230 ms 4.29% 901.272M
2^26 = 67108864 220x 68.162 ms 1.98% 68.154 ms 1.98% 984.658M
2^27 = 134217728 80x 131.745 ms 1.09% 131.738 ms 1.09% 1.019G
2^28 = 268435456 11x 260.182 ms 0.35% 260.175 ms 0.35% 1.032G
2^29 = 536870912 11x 520.952 ms 0.19% 520.946 ms 0.19% 1.031G
2^30 = 1073741824 11x 1.057 s 0.27% 1.057 s 0.27% 1.016G

@karthikeyann
Copy link
Contributor Author

Benchmark Results

Dated 27-Sept-2022

nested_json_gpu_parser

[0] Quadro GV100

string_size Samples CPU Time Noise GPU Time Noise Elem/s bytes_per_second peak_memory_usage
2^20 = 1048576 2192x 5.784 ms 11.10% 5.777 ms 11.10% 181.518M 181518461 17.149 MiB
2^21 = 2097152 960x 6.158 ms 5.10% 6.150 ms 5.09% 340.980M 340979979 34.106 MiB
2^22 = 4194304 192x 7.798 ms 15.82% 7.791 ms 15.82% 538.363M 538362570 68.020 MiB
2^23 = 8388608 848x 10.036 ms 4.47% 10.029 ms 4.47% 836.432M 836432150 135.848 MiB
2^24 = 16777216 80x 16.555 ms 9.49% 16.547 ms 9.49% 1.014G 1013889148 271.504 MiB
2^25 = 33554432 552x 27.182 ms 7.52% 27.175 ms 7.52% 1.235G 1234774323 542.817 MiB
2^26 = 67108864 80x 48.934 ms 3.19% 48.927 ms 3.19% 1.372G 1371621284 1.060 GiB
2^27 = 134217728 166x 90.397 ms 1.64% 90.390 ms 1.64% 1.485G 1484872039 2.120 GiB
2^28 = 268435456 87x 172.521 ms 0.99% 172.514 ms 0.99% 1.556G 1556019625 4.239 GiB
2^29 = 536870912 43x 351.175 ms 0.77% 351.170 ms 0.77% 1.529G 1528807741 8.479 GiB
2^30 = 1073741824 11x 667.835 ms 0.18% 667.831 ms 0.18% 1.608G 1607803624 16.957 GiB

@karthikeyann karthikeyann marked this pull request as draft September 27, 2022 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants