-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Hyper Log Log PLus Plus(HLL++) #11638
base: branch-25.02
Are you sure you want to change the base?
Conversation
d42d80a
to
1945192
Compare
} | ||
} | ||
|
||
case class GpuHLL(childExpr: Expression, relativeSD: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let' call by full name like GpuHyperLogLogPlusPlus
to better reflect the CPU version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ReductionAggregation.HLL(numRegistersPerSketch), DType.STRUCT) | ||
override lazy val groupByAggregate: GroupByAggregation = | ||
GroupByAggregation.HLL(numRegistersPerSketch) | ||
override val name: String = "CudfHLL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if "PlusPlus" is necessary.
override val name: String = "CudfHLL" | |
override val name: String = "CudfHyperLogLogPlusPlus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
0a4939f
to
eb00c2b
Compare
Signed-off-by: Chong Gao <[email protected]>
Ready to review except test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
expr[HyperLogLogPlusPlus]( | ||
"Aggregation approximate count distinct", | ||
ExprChecks.reductionAndGroupByAgg(TypeSig.LONG, TypeSig.LONG, | ||
Seq(ParamCheck("input", TypeSig.cpuAtomics, TypeSig.all))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Using cpuAtomics
for a GPU field gets to be kind of confusing. Could you please create a gpuAtomics
instead?
Description
Spark
approx_count_distinct
description linkSpark accepts one column(can be nested column) and a double literal
relativeSD
.Currently only support
TypeSig.cpuAtomics
types, for nested types, move to a follow-up task.closes ##5199
Depending on JNI/cuDF PRs:
rapidsai/cudf#17133
NVIDIA/spark-rapids-jni#2522
TODO
Follow-up
Support nested types: move to follow-up tasks. Because for nested type, it's depending on
xxhash64
to support nested type.Perf test
correctness
For int column, the results are identical between CPU and GPU.
Signed-off-by: Chong Gao [email protected]