[GLUTEN-8528][CH]Support approx_count_distinct #8550

taiyang-li · 2025-01-16T11:01:28Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8528)

How was this patch tested?

New added uts

github-actions · 2025-01-16T11:01:44Z

#8528

github-actions · 2025-01-16T11:01:59Z

Run Gluten ClickHouse CI on ARM

taiyang-li · 2025-02-05T10:23:05Z

@CodiumAI-Agent /review

CodiumAI-Agent · 2025-02-05T10:23:43Z

PR Reviewer Guide 🔍

(Review updated until commit `92e9224`)

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶 8528 - Partially compliant Compliant requirements: Implement support for `approx_count_distinct` functionality. Ensure compatibility with Spark's `approx_count_distinct` function. Add necessary tests to validate the implementation. Non-compliant requirements: Requires further human verification: Validate the correctness of the `approx_count_distinct` implementation through integration testing. Verify the performance impact of the changes in a real-world scenario.
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The logic for handling `HyperLogLogPlusPlus` in lines 279-289 introduces additional nodes for `relativeSDLiteral`. Ensure this does not inadvertently affect other aggregate functions or introduce unexpected behavior. val extraNodes = aggregateFunc match { case hll: HyperLogLogPlusPlus => val relativeSDLiteral = Literal(hll.relativeSD) Seq( ExpressionConverter .replaceWithExpressionTransformer(relativeSDLiteral, child.output) .doTransform(args)) case _ => Seq.empty } nodes ++ extraNodes Edge Case Handling The parser implementation for `approx_count_distinct` (lines 48-148) should be reviewed for edge cases, such as invalid input types or unexpected argument counts. template <typename NameStruct> class AggregateFunctionParserApproxCountDistinct final : public AggregateFunctionParser { public: static constexpr auto name = NameStruct::spark_name; AggregateFunctionParserApproxCountDistinct(ParserContextPtr parser_context_) : AggregateFunctionParser(parser_context_) { } ~AggregateFunctionParserApproxCountDistinct() override = default; String getName() const override { return NameStruct::spark_name; } String getCHFunctionName(const CommonFunctionInfo &) const override { return NameStruct::ch_name; } String getCHFunctionName(DataTypes & types) const override { /// Always invoked during second stage, the first argument is expr, the second argument is relative_sd. /// 1. Remove the second argument because types are used to create the aggregate function. /// 2. Replace the first argument type with UInt64 or Nullable(UInt64) because uniqHLLPP requres it. types.resize(1); const auto old_type = types[0]; types[0] = std::make_shared<DataTypeUInt64>(); if (old_type->isNullable()) types[0] = std::make_shared<DataTypeNullable>(types[0]); return NameStruct::ch_name; } Array parseFunctionParameters( const CommonFunctionInfo & func_info, ActionsDAG::NodeRawConstPtrs & arg_nodes, ActionsDAG & actions_dag) const override { if (func_info.phase == substrait::AGGREGATION_PHASE_INITIAL_TO_INTERMEDIATE \|\| func_info.phase == substrait::AGGREGATION_PHASE_INITIAL_TO_RESULT \|\| func_info.phase == substrait::AGGREGATION_PHASE_UNSPECIFIED) { const auto & arguments = func_info.arguments; const size_t num_args = arguments.size(); const size_t num_nodes = arg_nodes.size(); if (num_args != num_nodes \|\| num_args > 2 \|\| num_args < 1 \|\| num_nodes > 2 \|\| num_nodes < 1) throw Exception( ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH, "Function {} takes 1 or 2 arguments in phase {}", getName(), magic_enum::enum_name(func_info.phase)); Array params; if (num_args == 2) { const auto & relative_sd_arg = arguments[1].value(); if (relative_sd_arg.has_literal()) { auto [_, field] = parseLiteral(relative_sd_arg.literal()); params.push_back(std::move(field)); } else throw Exception(ErrorCodes::BAD_ARGUMENTS, "Second argument of function {} must be literal", getName()); } else { params.push_back(0.05); } const auto & expr_arg = arg_nodes[0]; const auto * is_null_node = toFunctionNode(actions_dag, "isNull", {expr_arg}); const auto * hash_node = toFunctionNode(actions_dag, "sparkXxHash64", {expr_arg}); const auto * null_node = addColumnToActionsDAG(actions_dag, std::make_shared<DataTypeNullable>(std::make_shared<DataTypeUInt64>()), {}); const auto * if_node = toFunctionNode(actions_dag, "if", {is_null_node, null_node, hash_node}); /// Replace the first argument expr with if(isNull(expr), null, sparkXxHash64(expr)) arg_nodes[0] = if_node; arg_nodes.resize(1); return params; } else { if (arg_nodes.size() != 1) throw Exception( ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH, "Function {} takes 1 argument in phase {}", getName(), magic_enum::enum_name(func_info.phase)); const auto & result_type = arg_nodes[0]->result_type; const auto * aggregate_function_type = checkAndGetDataType<DataTypeAggregateFunction>(result_type.get()); if (!aggregate_function_type) throw Exception( ErrorCodes::BAD_ARGUMENTS, "The first argument type of function {} in phase {} must be AggregateFunction, but is {}", getName(), magic_enum::enum_name(func_info.phase), result_type->getName()); return aggregate_function_type->getParameters(); } } Array getDefaultFunctionParameters() const override { return {0.05}; } }; static const AggregateFunctionParserRegister<AggregateFunctionParserApproxCountDistinct<ApproxCountDistinctNameStruct>> registerer_approx_count_distinct; } Test Coverage The added tests focus on specific scenarios for `HyperLogLogPlusPlus`. Ensure that all edge cases, such as empty inputs or extreme values, are covered. #include <gtest/gtest.h> #include <AggregateFunctions/AggregateFunctionUniqHyperLogLogPlusPlus.h> #include "IO/ReadBufferFromString.h" using namespace DB; static std::vector<UInt64> random_uint64s = {17956993516945311251ULL, 4306050051188505054ULL, 14289061765075743502ULL, 16763375724458316157ULL, 6144297519955185930ULL, 18446472757487308114ULL, 16923578592198257123ULL, 13557354668567515845ULL, 15328387702200001967ULL, 15878166530370497646ULL}; static void initSmallHLL(HyperLogLogPlusPlusData & hll) { for (auto x : random_uint64s) hll.add(x); } static void initLargeHLL(HyperLogLogPlusPlusData & hll) { for (auto x : random_uint64s) { for (size_t i = 0; i < 100; ++i) hll.add(x * (i+1)); } } TEST(HyperLogLogPlusPlusDataTest, Small) { HyperLogLogPlusPlusData hll; initSmallHLL(hll); EXPECT_EQ(hll.query(), 10); } TEST(HyperLogLogPlusPlusDataTest, Large) { HyperLogLogPlusPlusData hll; initLargeHLL(hll); EXPECT_EQ(hll.query(), 806); } TEST(HyperLogLogPlusPlusDataTest, Merge) { HyperLogLogPlusPlusData hll1; initSmallHLL(hll1); HyperLogLogPlusPlusData hll2; initLargeHLL(hll2); hll1.merge(hll2); EXPECT_EQ(hll1.query(), 806); } TEST(HyperLogLogPlusPlusDataTest, SerializeAndDeserialize) { HyperLogLogPlusPlusData hll1; initLargeHLL(hll1); WriteBufferFromOwnString write_buffer; hll1.serialize(write_buffer); ReadBufferFromString read_buffer(write_buffer.str()); HyperLogLogPlusPlusData hll2; hll2.deserialize(read_buffer); EXPECT_EQ(hll2.query(), 806); }

taiyang-li · 2025-02-10T08:54:02Z

Native approx_count_distinct implementation is 25x faster than that in vanilla spark...

0: jdbc:hive2://localhost:10000/> select approx_count_distinct(id, 0.001), approx_count_distinct(id, 0.01), approx_count_distinct(id, 0.1) from range(1000);    
+----------------------------+----------------------------+----------------------------+
| approx_count_distinct(id)  | approx_count_distinct(id)  | approx_count_distinct(id)  |
+----------------------------+----------------------------+----------------------------+
| 999                        | 996                        | 928                        |
+----------------------------+----------------------------+----------------------------+
1 row selected (5.82 seconds)
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> set spark.gluten.enabled = false; 
+-----------------------+--------+
|          key          | value  |
+-----------------------+--------+
| spark.gluten.enabled  | false  |
+-----------------------+--------+
1 row selected (0.137 seconds)
0: jdbc:hive2://localhost:10000/> select approx_count_distinct(id, 0.001), approx_count_distinct(id, 0.01), approx_count_distinct(id, 0.1) from range(1000);     
+----------------------------+----------------------------+----------------------------+
| approx_count_distinct(id)  | approx_count_distinct(id)  | approx_count_distinct(id)  |
+----------------------------+----------------------------+----------------------------+
| 999                        | 996                        | 928                        |
+----------------------------+----------------------------+----------------------------+
1 row selected (149.915 seconds)

CodiumAI-Agent · 2025-02-10T08:54:42Z

Persistent review updated to latest commit 92e9224

github-actions · 2025-02-10T08:58:05Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-02-10T09:00:24Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-02-10T09:06:00Z

Run Gluten ClickHouse CI on ARM

zhanglistar · 2025-02-10T09:16:03Z

Lets' enable spark hll UT to see what will happen.

github-actions · 2025-02-10T09:20:02Z

Run Gluten ClickHouse CI on ARM

taiyang-li · 2025-02-10T09:28:56Z

Run Gluten ClickHouse CI on ARM

done.

github-actions · 2025-02-10T09:29:22Z

Run Gluten ClickHouse CI on ARM

lgbo-ustc · 2025-02-11T01:04:22Z

LGTM

zhanglistar

Better add some comment about HLLPP for later feature readers.

zhanglistar · 2025-02-10T10:29:43Z

cpp-ch/local-engine/Parser/aggregate_function_parser/ApproxCountDistinctFunctionParser.cpp

+#include <DataTypes/DataTypeNullable.h>
+#include <Poco/Logger.h>
+#include <Common/logger_useful.h>
+#include "DataTypes/DataTypeAggregateFunction.h"


Use <> instead of " in include clause

zhanglistar · 2025-02-11T03:14:53Z

cpp-ch/local-engine/Parser/AggregateFunctionParser.cpp

@@ -25,6 +25,7 @@
 #include <Parser/TypeParser.h>
 #include <Common/CHUtil.h>
 #include <Common/Exception.h>
+#include <Common/logger_useful.h>


Useless header

zhanglistar · 2025-02-11T03:22:47Z

cpp-ch/local-engine/AggregateFunctions/AggregateFunctionUniqHyperLogLogPlusPlus.h

+
+    inline static const std::vector<std::vector<double>> BIAS_DATA = {
+        // precision 4
+        {10,


这里格式化下吧每行10个元素。

zhanglistar · 2025-02-11T03:26:18Z

cpp-ch/local-engine/AggregateFunctions/AggregateFunctionUniqHyperLogLogPlusPlus.h

+
+struct HyperLogLogPlusPlusData
+{
+    explicit HyperLogLogPlusPlusData(double relative_sd_ = 0.05)


Use Float64 for consistency

taiyang-li added 3 commits January 15, 2025 18:21

first commit

88fb149

commit again

6b87b7f

finish dev

92e9224

taiyang-li changed the title ~~[GLUTEN-8528][CH]Support approx count distinct~~ [GLUTEN-8528][CH]Support approx_count_distinct Jan 16, 2025

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Jan 16, 2025

finish debug

a3548b5

taiyang-li marked this pull request as ready for review February 10, 2025 08:57

commit again

e519240

fix bugs

1838314

zhanglistar requested a review from lgbo-ustc February 10, 2025 09:14

commit again

902034c

add spark uts

f4ac50b

zhanglistar reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-8528][CH]Support approx_count_distinct #8550

[GLUTEN-8528][CH]Support approx_count_distinct #8550

taiyang-li commented Jan 16, 2025 •

edited

Loading

github-actions bot commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

taiyang-li commented Feb 5, 2025

CodiumAI-Agent commented Feb 5, 2025 •

edited

Loading

taiyang-li commented Feb 10, 2025 •

edited

Loading

CodiumAI-Agent commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

zhanglistar commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

taiyang-li commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

lgbo-ustc commented Feb 11, 2025

zhanglistar left a comment

zhanglistar Feb 10, 2025

zhanglistar Feb 11, 2025

zhanglistar Feb 11, 2025

zhanglistar Feb 11, 2025

[GLUTEN-8528][CH]Support approx_count_distinct #8550

Are you sure you want to change the base?

[GLUTEN-8528][CH]Support approx_count_distinct #8550

Conversation

taiyang-li commented Jan 16, 2025 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

taiyang-li commented Feb 5, 2025

CodiumAI-Agent commented Feb 5, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit 92e9224)

taiyang-li commented Feb 10, 2025 • edited Loading

CodiumAI-Agent commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

zhanglistar commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

taiyang-li commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

lgbo-ustc commented Feb 11, 2025

zhanglistar left a comment

Choose a reason for hiding this comment

zhanglistar Feb 10, 2025

Choose a reason for hiding this comment

zhanglistar Feb 11, 2025

Choose a reason for hiding this comment

zhanglistar Feb 11, 2025

Choose a reason for hiding this comment

zhanglistar Feb 11, 2025

Choose a reason for hiding this comment

taiyang-li commented Jan 16, 2025 •

edited

Loading

CodiumAI-Agent commented Feb 5, 2025 •

edited

Loading

(Review updated until commit `92e9224`)

taiyang-li commented Feb 10, 2025 •

edited

Loading