Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Hyper Log Log PLus Plus(HLL++) #11638

Draft
wants to merge 1 commit into
base: branch-25.02
Choose a base branch
from

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Oct 21, 2024

Description

Spark approx_count_distinct description link
Spark accepts one column(can be nested column) and a double literal relativeSD.
Currently only support TypeSig.cpuAtomics types, for nested types, move to a follow-up task.

closes ##5199

Depending on JNI/cuDF PRs:
rapidsai/cudf#17133
NVIDIA/spark-rapids-jni#2522

TODO

  • Add more test cases

Follow-up

Support nested types: move to follow-up tasks. Because for nested type, it's depending on xxhash64 to support nested type.

Perf test

// group by
import org.apache.spark.sql.functions
spark.range(10000000).repartition(5).withColumn("m", functions.expr("id % 10")).createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())

// reduction
spark.range(10000000).repartition(5).createOrReplaceTempView("tab")
spark.time(spark.sql("select APPROX_COUNT_DISTINCT(id) from tab ").show())
num_groups CPU time(hot runs) GPU time(hot runs) speedup
10 1106ms, 1020ms, 1059ms 196ms, 208ms, 188ms 3.53x
1,000,000 5135ms, 5307ms, 5487ms 1447ms, 1565ms, 1497ms 5.38x
reduction 942ms, 1041ms, 973ms 169ms, 165ms, 180ms 5.75x

correctness

For int column, the results are identical between CPU and GPU.

Gpu result:
+---+-------------------------+                                                 
|  m|approx_count_distinct(id)|
+---+-------------------------+
|  0|                  1009779|
|  1|                   912573|
|  2|                   994262|
|  3|                   962191|
|  4|                   957975|
|  5|                   969328|
|  6|                   975973|
|  7|                  1017056|
|  8|                   989262|
|  9|                   954960|
+---+-------------------------+

CPU result:
+---+-------------------------+                                                 
|  m|approx_count_distinct(id)|
+---+-------------------------+
|  0|                  1009779|
|  1|                   912573|
|  2|                   994262|
|  3|                   962191|
|  4|                   957975|
|  5|                   969328|
|  6|                   975973|
|  7|                  1017056|
|  8|                   989262|
|  9|                   954960|
+---+-------------------------+

For the correctness of other types, need more checks.

Signed-off-by: Chong Gao [email protected]

}
}

case class GpuHLL(childExpr: Expression, relativeSD: Double)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let' call by full name like GpuHyperLogLogPlusPlus to better reflect the CPU version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ReductionAggregation.HLL(numRegistersPerSketch), DType.STRUCT)
override lazy val groupByAggregate: GroupByAggregation =
GroupByAggregation.HLL(numRegistersPerSketch)
override val name: String = "CudfHLL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if "PlusPlus" is necessary.

Suggested change
override val name: String = "CudfHLL"
override val name: String = "CudfHyperLogLogPlusPlus"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life res-life changed the title [Do not review] Add Hyper Log Log PLus Plus(HLL++) [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Oct 24, 2024
@res-life res-life changed the title [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Add support for Hyper Log Log PLus Plus(HLL++) Oct 31, 2024
Signed-off-by: Chong Gao <[email protected]>
@res-life res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53
@res-life
Copy link
Collaborator Author

Ready to review except test cases.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

expr[HyperLogLogPlusPlus](
"Aggregation approximate count distinct",
ExprChecks.reductionAndGroupByAgg(TypeSig.LONG, TypeSig.LONG,
Seq(ParamCheck("input", TypeSig.cpuAtomics, TypeSig.all))),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using cpuAtomics for a GPU field gets to be kind of confusing. Could you please create a gpuAtomics instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants