Use make_strings_children in parse_data nested json reader #12382

karthikeyann · 2022-12-14T13:37:30Z

Description

Use make_strings_children utility in parse_data nested json reader
Addresses part of #12167

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/include/cudf/io/detail/data_casting.cuh

randerzander · 2022-12-15T19:20:20Z

run benchmarks

DaceT · 2022-12-21T20:57:07Z

run benchmarks

…_strings_children

karthikeyann · 2022-12-31T18:42:12Z

Benchmark Results

nested_json_gpu_parser

[0] Quadro GV100

string_size	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	bytes_per_second	peak_memory_usage
2^20 = 1048576	3056x	4.093 ms	6.08%	4.087 ms	6.07%	256.555M	256554796	9.201 MiB
2^21 = 2097152	1376x	4.357 ms	2.02%	4.351 ms	2.01%	482.009M	482008813	18.164 MiB
2^22 = 4194304	1520x	5.033 ms	1.91%	5.027 ms	1.90%	834.390M	834390240	36.090 MiB
2^23 = 8388608	2273x	6.582 ms	2.61%	6.576 ms	2.60%	1.276G	1275592364	71.942 MiB
2^24 = 16777216	1481x	10.114 ms	8.88%	10.108 ms	8.87%	1.660G	1659820960	143.645 MiB
2^25 = 33554432	848x	17.180 ms	4.78%	17.175 ms	4.78%	1.954G	1953714331	287.051 MiB
2^26 = 67108864	474x	31.625 ms	2.57%	31.620 ms	2.57%	2.122G	2122376400	573.864 MiB
2^27 = 134217728	264x	56.838 ms	1.30%	56.833 ms	1.30%	2.362G	2361620677	1.121 GiB
2^28 = 268435456	137x	109.830 ms	1.63%	109.824 ms	1.63%	2.444G	2444226569	2.241 GiB
2^29 = 536870912	67x	224.780 ms	0.91%	224.776 ms	0.91%	2.388G	2388474330	4.482 GiB
2^30 = 1073741824	36x	427.308 ms	0.98%	427.302 ms	0.98%	2.513G	2512840746	8.963 GiB

codecov · 2022-12-31T19:36:34Z

Codecov Report

Base: 86.58% // Head: 85.69% // Decreases project coverage by -0.88% ⚠️

Coverage data is based on head (a47415d) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-23.02   #12382      +/-   ##
================================================
- Coverage         86.58%   85.69%   -0.89%     
================================================
  Files               155      155              
  Lines             24368    24803     +435     
================================================
+ Hits              21098    21255     +157     
- Misses             3270     3548     +278

Impacted Files	Coverage Δ
python/cudf/cudf/_version.py	`1.41% <0.00%> (-98.59%)`	⬇️
python/cudf/cudf/core/buffer/spill_manager.py	`72.50% <0.00%> (-7.50%)`	⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py	`90.04% <0.00%> (-2.81%)`	⬇️
python/cudf/cudf/utils/dtypes.py	`77.77% <0.00%> (-1.69%)`	⬇️
python/cudf/cudf/options.py	`86.11% <0.00%> (-1.59%)`	⬇️
python/cudf/cudf/core/single_column_frame.py	`94.30% <0.00%> (-1.27%)`	⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py	`98.38% <0.00%> (-1.01%)`	⬇️
python/dask_cudf/dask_cudf/io/parquet.py	`91.81% <0.00%> (-0.59%)`	⬇️
python/cudf/cudf/core/multiindex.py	`91.66% <0.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/algorithms.py	`90.00% <0.00%> (-0.48%)`	⬇️
... and 37 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

davidwendt · 2023-01-03T15:09:11Z

cpp/include/cudf/io/detail/data_casting.cuh

  }
-  return out_it;
+  return bytes;


When is bytes != num_chars_written?
This function seems overly complex. Maybe I'm missing something?

constexpr size_type write_utf8_char(char_utf8 character, char*& out_it) { auto const bytes = (out_it == nullptr) ? strings::detail::bytes_in_char_utf8(character) : strings::detail::from_char_utf8(character, out_it); out_it += bytes; return bytes; }

There does not seem to be a need for the template as well.

I think, this is written to allow the compiler to unroll the loop.
I will change this to above suggestion.

davidwendt · 2023-01-03T15:15:48Z

cpp/include/cudf/io/detail/data_casting.cuh

+      col_size,
+      std::move(offsets),
+      std::move(chars),
+      cudf::detail::null_count(static_cast<bitmask_type*>(null_mask.data()), 0, col_size, stream),


I was curious if this could be computed in the functor and if that would be faster than doing the null_count() calculation here.

null_count method will be faster because it handles 32-bit per thread, while functor will handle 1 bit per thread, which will eventually lead to lesser atomicAdds to global memory too.

Besides, functor's shared memory is not in our control since we don't know threads per block. So, we can't use BlockReduce, which will increase no of atomicAdds to the global memory location.

cpp/src/io/json/json_column.cu

…cified

cpp/include/cudf/io/detail/data_casting.cuh

…_strings_children

vuule

looks good, just some coding style suggestions.

vuule · 2023-01-09T23:30:52Z

cpp/include/cudf/io/detail/data_casting.cuh

-  return out_it;
+  auto const bytes = (out_it == nullptr) ? strings::detail::bytes_in_char_utf8(character)
+                                         : strings::detail::from_char_utf8(character, out_it);
+  if (out_it) out_it += bytes;


Suggested change

if (out_it) out_it += bytes;

if (out_it != nullpt) { out_it += bytes; }

vuule · 2023-01-09T23:36:21Z

cpp/include/cudf/io/detail/data_casting.cuh

+  __device__ void operator()(size_type idx)
+  {
+    if (not bit_is_set(null_mask, idx)) {
+      if (!d_chars) d_offsets[idx] = 0;


Suggested change

if (!d_chars) d_offsets[idx] = 0;

if (d_chars == nullptr) { d_offsets[idx] = 0; }

vuule · 2023-01-09T23:36:38Z

cpp/include/cudf/io/detail/data_casting.cuh

+
+    // Check if the value corresponds to the null literal
+    auto const is_null_literal =
+      (!d_chars) &&


Suggested change

(!d_chars) &&

(d_chars == nullptr) &&

vuule · 2023-01-09T23:37:07Z

cpp/include/cudf/io/detail/data_casting.cuh

+      serialized_trie_contains(options.trie_na, {in_begin, static_cast<std::size_t>(num_in_chars)});
+    if (is_null_literal) {
+      clear_bit(null_mask, idx);
+      if (!d_chars) d_offsets[idx] = 0;


Suggested change

if (!d_chars) d_offsets[idx] = 0;

if (d_chars == nullptr) { d_offsets[idx] = 0; }

vuule · 2023-01-09T23:38:21Z

cpp/include/cudf/io/detail/data_casting.cuh

+    auto str_process_info = process_string(in_begin, in_end, d_buffer, options);
+    if (str_process_info.result != data_casting_result::PARSING_SUCCESS) {
+      clear_bit(null_mask, idx);
+      if (!d_chars) d_offsets[idx] = 0;


Suggested change

if (!d_chars) d_offsets[idx] = 0;

if (d_chars == nullptr) { d_offsets[idx] = 0; }

vuule · 2023-01-09T23:39:02Z

cpp/include/cudf/io/detail/data_casting.cuh

+      clear_bit(null_mask, idx);
+      if (!d_chars) d_offsets[idx] = 0;
+    } else {
+      if (!d_chars) d_offsets[idx] = str_process_info.bytes;


Suggested change

if (!d_chars) d_offsets[idx] = str_process_info.bytes;

if (d_chars == nullptr) { d_offsets[idx] = str_process_info.bytes; }

karthikeyann · 2023-01-10T00:10:39Z

just some coding style suggestions.

String processing code is full of if(out_ptr) style code. My preference is to follow the same style and also because it's concise.

vuule · 2023-01-10T00:18:17Z

just some coding style suggestions.

String processing code is full of if(out_ptr) style code. My preference is to follow the same style and also because it's concise.

Well, I didn't follow the same style in my cudf::string changes and got @davidwendt 's approval. I think the takeaway is that it does not really matter :D.

Not a big deal, this is probably a job for a tool to enforce/apply.

karthikeyann · 2023-01-10T13:57:20Z

\merge

karthikeyann · 2023-01-10T13:57:38Z

/merge

use make_strings_children in parse_data nested json reader

aac94a0

karthikeyann added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 14, 2022

karthikeyann added this to the Nested JSON reader milestone Dec 14, 2022

karthikeyann self-assigned this Dec 14, 2022

karthikeyann requested a review from a team as a code owner December 14, 2022 13:37

karthikeyann requested review from bdice, elstehle, davidwendt and vuule and removed request for bdice December 14, 2022 13:37

davidwendt requested changes Dec 14, 2022

View reviewed changes

cpp/include/cudf/io/detail/data_casting.cuh Outdated Show resolved Hide resolved

cpp/include/cudf/io/detail/data_casting.cuh Outdated Show resolved Hide resolved

karthikeyann added 3 commits December 31, 2022 23:19

address review comments

d33a6fc

Merge branch 'branch-23.02' of github.com:rapidsai/cudf into enh-json…

a6361ff

…_strings_children

null count, fix reference pass bug

4ccdafc

karthikeyann requested a review from davidwendt December 31, 2022 18:26

davidwendt requested changes Jan 3, 2023

View reviewed changes

karthikeyann and others added 4 commits January 4, 2023 17:23

Merge branch 'branch-23.02' into enh-json_strings_children

dd1d07e

remove usage of UNKNOWN_NULL_COUNT

02dd511

Merge branch 'branch-23.02' into enh-json_strings_children

67ca4d1

review comments, move code due to order of evaluation in C++ is unspe…

4b18cb4

…cified

Merge branch 'branch-23.02' into enh-json_strings_children

8bc5ff4

karthikeyann requested a review from davidwendt January 5, 2023 17:21

davidwendt reviewed Jan 5, 2023

View reviewed changes

cpp/include/cudf/io/detail/data_casting.cuh Outdated Show resolved Hide resolved

davidwendt approved these changes Jan 5, 2023

View reviewed changes

karthikeyann added 2 commits January 9, 2023 14:00

fix nullptr += issue, use size_type instread of int32_t

d38e6f8

Merge branch 'branch-23.02' of github.com:rapidsai/cudf into enh-json…

a4bf586

…_strings_children

davidwendt approved these changes Jan 9, 2023

View reviewed changes

Merge branch 'branch-23.02' into enh-json_strings_children

44c2f97

vuule reviewed Jan 9, 2023

View reviewed changes

fix merge issue

a47415d

karthikeyann requested a review from vuule January 10, 2023 00:11

vuule approved these changes Jan 10, 2023

View reviewed changes

rapids-bot bot merged commit 0b8eb42 into rapidsai:branch-23.02 Jan 10, 2023

jlowe mentioned this pull request Jan 10, 2023

Update to latest cudf changes on branch-23.02 NVIDIA/spark-rapids-jni#868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use make_strings_children in parse_data nested json reader #12382

Use make_strings_children in parse_data nested json reader #12382

karthikeyann commented Dec 14, 2022 •

edited

Loading

randerzander commented Dec 15, 2022

DaceT commented Dec 21, 2022

karthikeyann commented Dec 31, 2022

codecov bot commented Dec 31, 2022 •

edited

Loading

davidwendt Jan 3, 2023

karthikeyann Jan 5, 2023

davidwendt Jan 3, 2023

karthikeyann Jan 5, 2023

vuule left a comment

vuule Jan 9, 2023

vuule Jan 9, 2023

vuule Jan 9, 2023

vuule Jan 9, 2023

vuule Jan 9, 2023

vuule Jan 9, 2023

karthikeyann commented Jan 10, 2023 •

edited

Loading

vuule commented Jan 10, 2023

karthikeyann commented Jan 10, 2023

karthikeyann commented Jan 10, 2023

	if (out_it) out_it += bytes;
	if (out_it != nullpt) { out_it += bytes; }

	if (!d_chars) d_offsets[idx] = 0;
	if (d_chars == nullptr) { d_offsets[idx] = 0; }

	if (!d_chars) d_offsets[idx] = str_process_info.bytes;
	if (d_chars == nullptr) { d_offsets[idx] = str_process_info.bytes; }

Use make_strings_children in parse_data nested json reader #12382

Use make_strings_children in parse_data nested json reader #12382

Conversation

karthikeyann commented Dec 14, 2022 • edited Loading

Description

Checklist

randerzander commented Dec 15, 2022

DaceT commented Dec 21, 2022

karthikeyann commented Dec 31, 2022

Benchmark Results

nested_json_gpu_parser

[0] Quadro GV100

codecov bot commented Dec 31, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann commented Jan 10, 2023 • edited Loading

vuule commented Jan 10, 2023

karthikeyann commented Jan 10, 2023

karthikeyann commented Jan 10, 2023

karthikeyann commented Dec 14, 2022 •

edited

Loading

codecov bot commented Dec 31, 2022 •

edited

Loading

karthikeyann commented Jan 10, 2023 •

edited

Loading