Fix mean computation for the geometric distribution in the data generator #15282

vuule · 2024-03-12T22:29:51Z

Description

Since we moved random data generation to the GPU, geometric distribution has been approximated by half-normal distribution. However, the mean computation wasn't updated, causing a ~20% higher mean that the actual generated values.
Another issue that exasperated the problem is the implicit conversion to ints in the random generator. This effectively lowered the mean of generated values by 0.5.

Together, these lead to list columns having the last row with more than 20% of the total column data. Huge single row caused low performance in many benchmarks. For example, Parquet files end up with a few huge pages and load imbalance in decode.

This PR fixes the mean computation to reflex the actual distribution, and rounds the random values when converting to ints. The result is a correct distribution of the number of elements in each randomly generated list.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2024-03-12T22:30:50Z

CC @shrshi

vuule · 2024-03-13T00:52:58Z

Parquet reader benchmarks with only list columns are faster by 50%, while the ones with mixed types are 20-30X faster.
As expected, there's no impact on performance measurements with other types. The file size has not changed significantly either.

vuule · 2024-03-13T00:59:38Z

cpp/benchmarks/common/generate_input.cu

+      // In the current implementation, the geometric distribution is
+      // approximated by absolute value of a uniform distribution
+      auto const gauss_std_dev   = geometric_as_gauss_std_dev(dist.lower_bound, dist.upper_bound);
+      auto const half_gauss_mean = gauss_std_dev * sqrt(2. / M_PI);


from https://en.wikipedia.org/wiki/Half-normal_distribution 😁

…bug-geometric-distribution-mean

vuule · 2024-03-13T23:10:20Z

@PointKernel made some additional changes; requested your review so I don't sneak them past you :)

shrshi

Just one nit, otherwise looks great to me! Thank you :)

cpp/benchmarks/common/random_distribution_factory.cuh

cpp/benchmarks/common/generate_input.cu

…bug-geometric-distribution-mean

…vuule/cudf into bug-geometric-distribution-mean

vuule · 2024-03-14T21:27:43Z

/merge

correct mean for geometric; round instead of implicit conversion

4695cff

vuule self-assigned this Mar 12, 2024

vuule added bug Something isn't working tests Unit testing for project non-breaking Non-breaking change labels Mar 12, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 12, 2024

vuule marked this pull request as ready for review March 13, 2024 00:49

vuule requested a review from a team as a code owner March 13, 2024 00:49

vuule requested review from PointKernel and pmattione-nvidia March 13, 2024 00:49

vuule commented Mar 13, 2024

View reviewed changes

PointKernel approved these changes Mar 13, 2024

View reviewed changes

vuule added 5 commits March 13, 2024 12:29

Merge branch 'branch-24.04' of https://github.com/rapidsai/cudf into …

502661e

…bug-geometric-distribution-mean

remove unused function

3b9414e

consistent std dev; reduce list range to make up for inc std dev

46b242a

style

c43d3bf

Merge branch 'branch-24.04' into bug-geometric-distribution-mean

9a815ee

vuule requested review from PointKernel and shrshi March 13, 2024 23:09

PointKernel approved these changes Mar 14, 2024

View reviewed changes

pmattione-nvidia approved these changes Mar 14, 2024

View reviewed changes

shrshi approved these changes Mar 14, 2024

View reviewed changes

cpp/benchmarks/common/random_distribution_factory.cuh Outdated Show resolved Hide resolved

cpp/benchmarks/common/generate_input.cu Show resolved Hide resolved

vuule added 3 commits March 14, 2024 12:41

fix typo

b2d3e1e

Merge branch 'branch-24.04' of https://github.com/rapidsai/cudf into …

be0c265

…bug-geometric-distribution-mean

Merge branch 'bug-geometric-distribution-mean' of https://github.com/…

f4fe2e9

…vuule/cudf into bug-geometric-distribution-mean

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 14, 2024

rapids-bot bot merged commit 95ce0bb into rapidsai:branch-24.04 Mar 14, 2024
75 checks passed

vuule deleted the bug-geometric-distribution-mean branch March 14, 2024 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mean computation for the geometric distribution in the data generator #15282

Fix mean computation for the geometric distribution in the data generator #15282

vuule commented Mar 12, 2024

vuule commented Mar 12, 2024

vuule commented Mar 13, 2024

vuule Mar 13, 2024

vuule commented Mar 13, 2024

shrshi left a comment

vuule commented Mar 14, 2024

Fix mean computation for the geometric distribution in the data generator #15282

Fix mean computation for the geometric distribution in the data generator #15282

Conversation

vuule commented Mar 12, 2024

Description

Checklist

vuule commented Mar 12, 2024

vuule commented Mar 13, 2024

vuule Mar 13, 2024

Choose a reason for hiding this comment

vuule commented Mar 13, 2024

shrshi left a comment

Choose a reason for hiding this comment

vuule commented Mar 14, 2024