-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mean computation for the geometric distribution in the data generator #15282
Fix mean computation for the geometric distribution in the data generator #15282
Conversation
CC @shrshi |
Parquet reader benchmarks with only list columns are faster by 50%, while the ones with mixed types are 20-30X faster. |
// In the current implementation, the geometric distribution is | ||
// approximated by absolute value of a uniform distribution | ||
auto const gauss_std_dev = geometric_as_gauss_std_dev(dist.lower_bound, dist.upper_bound); | ||
auto const half_gauss_mean = gauss_std_dev * sqrt(2. / M_PI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…bug-geometric-distribution-mean
@PointKernel made some additional changes; requested your review so I don't sneak them past you :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit, otherwise looks great to me! Thank you :)
…bug-geometric-distribution-mean
…vuule/cudf into bug-geometric-distribution-mean
/merge |
Description
Since we moved random data generation to the GPU, geometric distribution has been approximated by half-normal distribution. However, the mean computation wasn't updated, causing a ~20% higher mean that the actual generated values.
Another issue that exasperated the problem is the implicit conversion to ints in the random generator. This effectively lowered the mean of generated values by 0.5.
Together, these lead to list columns having the last row with more than 20% of the total column data. Huge single row caused low performance in many benchmarks. For example, Parquet files end up with a few huge pages and load imbalance in decode.
This PR fixes the mean computation to reflex the actual distribution, and rounds the random values when converting to ints. The result is a correct distribution of the number of elements in each randomly generated list.
Checklist