Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. #9931

nvdbaranec · 2021-12-17T22:06:38Z

Issue was relatively straightforward. There is a section of code in the bucket generation step that detects "gaps" that would be generated during the reduction step. It was incorrectly indexing into the list of cumulative weights for input values. Fundamental change was to change the TotalWeightIter iterator which was just returning the total weight for an input group into a GroupInfoFunc functor that returns total weight as well as group size info that is used to index cumulative weights correctly.

codecov · 2021-12-17T23:54:31Z

Codecov Report

Merging #9931 (8963fef) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9931      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20475     +170     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18341     +166

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 897a9ea...8963fef. Read the comment docs.

andygrove · 2021-12-20T20:22:06Z

I have tested this patch locally with the RAPIDS Accelerator tests that were originally failing and I have not seen any failures.

nvdbaranec · 2022-01-05T16:17:46Z

rerun tests

vyasr

Looks good to me, seems like it's just fixing a bunch of off by 1 out of bounds accesses. I have a few very minor suggestions, feel free to incorporate or not.

vyasr · 2022-01-05T17:46:32Z

cpp/src/groupby/sort/group_tdigest.cu

+    auto const group_start       = inner_offsets[outer_offsets[group_index]];
+    auto const group_end         = inner_offsets[outer_offsets[group_index + 1]];
+    auto const num_weights       = group_end - group_start;
+    auto const last_weight_index = group_end - 1;


Minor nit: You could inline this so that you online calculate it when the calculation is false. Should be negligible though since I assume num_weights is almost always nonzero and you're already avoiding actually indexing into cumulative_weights except when needed.

I'm pretty sure the compiler will optimize this correctly, and I think group_end - 1 is a bit on the cryptic side. This code is already drilling through 2 layers of offsets which is confusing to begin with :)

vyasr · 2022-01-05T17:49:14Z

cpp/src/groupby/sort/group_tdigest.cu

+  double total_weight;
+  size_type group_size, group_start;
+  thrust::tie(total_weight, group_size, group_start) = group_info(group_index);


I assume you need a tie here because of thrust::tuple not supporting structured bindings?

Correct. Same as

cudf/cpp/src/groupby/sort/group_tdigest.cu

Line 349 in fbf769f

// NOTE: can't use structured bindings here.

cpp/src/groupby/sort/group_tdigest.cu

nvdbaranec · 2022-01-06T22:37:29Z

@gpucibot merge

Fixed an issue during bucket generation that could create gaps.

3132c7b

nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Dec 17, 2021

nvdbaranec requested a review from a team as a code owner December 17, 2021 22:06

nvdbaranec requested review from vyasr and jrhemstad December 17, 2021 22:06

andygrove mentioned this pull request Dec 20, 2021

Enable approx percentile tests [databricks] NVIDIA/spark-rapids#4400

Merged

sameerz changed the title ~~Fixed issue with percentile_approx where output tdigests could have unitialized data at the end.~~ Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. Dec 21, 2021

Merge branch 'branch-22.02' into approx_percentile_gaps

fbf769f

nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 4, 2022

vyasr approved these changes Jan 5, 2022

View reviewed changes

jrhemstad approved these changes Jan 5, 2022

View reviewed changes

PR review changes.

8963fef

rapids-bot bot merged commit 120aa62 into rapidsai:branch-22.02 Jan 6, 2022

elstehle mentioned this pull request Jul 23, 2022

Adds JSON tokenizer #11264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. #9931

Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. #9931

nvdbaranec commented Dec 17, 2021

codecov bot commented Dec 17, 2021 •

edited

Loading

andygrove commented Dec 20, 2021

nvdbaranec commented Jan 5, 2022

vyasr left a comment

vyasr Jan 5, 2022

nvdbaranec Jan 5, 2022

vyasr Jan 5, 2022

nvdbaranec Jan 5, 2022

nvdbaranec commented Jan 6, 2022

Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. #9931

Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. #9931

Conversation

nvdbaranec commented Dec 17, 2021

codecov bot commented Dec 17, 2021 • edited Loading

Codecov Report

andygrove commented Dec 20, 2021

nvdbaranec commented Jan 5, 2022

vyasr left a comment

Choose a reason for hiding this comment

vyasr Jan 5, 2022

Choose a reason for hiding this comment

nvdbaranec Jan 5, 2022

Choose a reason for hiding this comment

vyasr Jan 5, 2022

Choose a reason for hiding this comment

nvdbaranec Jan 5, 2022

Choose a reason for hiding this comment

nvdbaranec commented Jan 6, 2022

codecov bot commented Dec 17, 2021 •

edited

Loading