-
Notifications
You must be signed in to change notification settings - Fork 59
Conversation
感觉公式有点诡异,把各个机器上的primary个数(或者secondary的个数,或者总数)求个方差不行吗?这样就不用调balancer了吧? |
原本是考虑过用公式计算,但觉得这个主要是想反映出做balance是不是有帮助,以及balance是不是完成,所以改成这个思路 |
balance是不是有帮助,每次跑完看有没有action生成出来就行,balancer有没有在跑,可以看config_context下的proposal_actions。但是统计是否均衡这一状态,我觉得还是用公式来根据现在的状态好。 比如你可以把你想知道的这些状态都加上flag或者counter:
把这些指标都列出来,大家也更能知道现在系统在发生什么。把这些状态最后糅合到一个百分比里面,运维的时候还是不知道到底发生了什么。 |
好的,考虑以下修改:
|
目前做了以下修改:
后续考虑添加公式计算的balance score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
总体来看感觉有些破坏模块间的界限,我觉得还是增加接口好一些。具体给的建议仅供参考。
@@ -76,7 +76,7 @@ class server_load_balancer | |||
// ret: | |||
// if any balancer proposal is generated, return true. Or-else, false | |||
// | |||
virtual bool balance(meta_view view, migration_list &list) = 0; | |||
virtual bool balance(meta_view view, migration_list &list, bool balance_checker) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释没有跟着改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
void server_state::update_balancer_perf_counters() | ||
{ | ||
for (auto action : _temporary_list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move_primary, add_secondary,copy_primary,其实都是balancer相关的功能,放到server_state这里有点破坏模块间的界限了,建议连同counter放到greedy_balancer下。
比如你可以新增加一个report_actions之类的接口,仔细设计下吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
添加了report接口,统计本轮balance proposal的情况,并更新perf counters
if (level == meta_function_level::fl_steady) { | ||
ddebug("check if balanced without doing replica migration coz meta server is in level(%s)", | ||
_meta_function_level_VALUES_TO_NAMES.find(level)->second); | ||
_meta_svc->get_balancer()->balance({&_all_apps, &_nodes}, _temporary_list, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里传一个true和false其实有些不清晰,包括代码看到balancer里面也是,if判断搞得有乱,相信你在写这些code的时候也觉得逻辑不直观吧。
我觉得这里可以给balancer的基类加一个接口: balance_with_opts(meta_view, action_list, opts);
opts可以设计成const std::map<std::string, std::string>&类型的。
比如你这里的需求可以这么传opts: {"only_move_primary": "false"}。
这个函数的语义就是用指定的option跑balancer, 跑完之后再把options修改回原来的状态。
跑完之后,就可以调用balancer->report_actions(_temporay_list)来上传你的counter了。
有了这个接口后,你balancer内部函数里面的一些balance_check的判断应该就都可以不要了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于这一块,暂时保留了balance_checker的判断,因为report接口同样用了这个参数判断更新哪些perf counter,另外感觉opts的方式有点绕。。。如果后面讨论觉得opts更合适我可以再改
@@ -316,6 +320,8 @@ class server_state | |||
|
|||
// for load balancer | |||
migration_list _temporary_list; | |||
int _balance_checker_operation_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同样的,balancer相关的状态,用register_command的形式放到balancer里面。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注册了command,同时也添加了接口,以便cluster_info里打印出proposal总数
@@ -327,6 +333,11 @@ class server_state | |||
dsn_handle_t _ctrl_add_secondary_enable_flow_control; | |||
dsn_handle_t _ctrl_add_secondary_max_count_for_one_node; | |||
|
|||
perf_counter_wrapper _balance_operation_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
balancer相关的状态放balancer里
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -201,7 +201,7 @@ struct configuration_update_request | |||
2:dsn.layer2.partition_configuration config; | |||
3:config_type type = config_type.CT_INVALID; | |||
4:dsn.rpc_address node; | |||
5:dsn.rpc_address host_node; // only used by stateless apps | |||
5:dsn.rpc_address host_node; // only used by stateless apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以把这个注释为depriciated,哪天谁有空可以把这个stateless 的feature删掉,我们完全用不上。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented as depreciated
struct configuration_balancer_request | ||
{ | ||
1:dsn.gpid gpid; | ||
2:list<configuration_proposal_action> action_list; | ||
3:optional bool force = false; | ||
4:balancer_request_type balance_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增字段用optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
bool primary_balancer_globally(); | ||
|
||
bool copy_secondary_per_app(const std::shared_ptr<app_state> &app); | ||
bool secondary_balancer_globally(); | ||
|
||
void greedy_balancer(); | ||
void greedy_balancer(bool balance_checker = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default参数是google code style禁止的,我本人也不太喜欢这个feature。还是尽量少用吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
了解了,感谢这么详细的comments!
fb86225
to
b69aeef
Compare
replica-server: support remote command useless-dir-reserve-seconds (#…
Merge new commits
添加了在cluster_info时计算集群各节点的primary和all partition count的标准差,以衡量集群均衡程度 |
::memset(t_operation_counters, 0, sizeof(t_operation_counters)); | ||
|
||
// init perf counters | ||
_balance_operation_count.init_app_counter("eon.server_state", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"eon.greedy_balancer"
下同
return result; | ||
} | ||
|
||
double greedy_load_balancer::mean_stddev(std::vector<unsigned> result_set, bool partial_sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::vector&
return !t_migration_result->empty(); | ||
} | ||
|
||
void greedy_load_balancer::report(dsn::replication::migration_list list, bool balance_checker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const migration_list&
@@ -45,11 +45,15 @@ class greedy_load_balancer : public simple_load_balancer | |||
public: | |||
greedy_load_balancer(meta_service *svc); | |||
virtual ~greedy_load_balancer(); | |||
bool balance(meta_view view, migration_list &list) override; | |||
bool balance(meta_view view, migration_list &list, bool balance_checker) override; | |||
void report(migration_list list, bool balance_checker) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const migration_list&
|
||
double greedy_load_balancer::mean_stddev(std::vector<unsigned> result_set, bool partial_sample) | ||
{ | ||
double sum = std::accumulate(std::begin(result_set), std::end(result_set), 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
求标准差的函数可以放到utility下面去
if (primary_count.size() <= 1 || partition_count.size() <= 1) | ||
return result; | ||
|
||
double primary_stddev = mean_stddev(primary_count, partial_sample); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你们可以内部讨论下要不要把标准差转到一个[0-1]之间的数(比如sigmoid之类的,当然会麻烦一些)。直接用标准差的好处是,对节点间不平均的数量有个大致的描述,但可能就不是之前你们想要的百分比的样子了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉标准差还是比较直观的,就先这样了
// list: the returned balance results | ||
// balance_checker: make proposals for balance checker if true, otherwise make real balancer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我还是觉得加个balance_checker,让函数的行为难以用语言描述。
比如这句注释读起来就让人觉得很懵。因为从接口上来看,balance本来就是一个没有side effect的接口。不应该有real或者不real之分的。
不过你权衡下吧,如果觉得这样更好,我也不bb了。
// balance_checker: report the count of balance operation to be done if true, otherwise report | ||
// both the operation count and action details done done balancer | ||
// | ||
virtual void report(migration_list list, bool balance_checker) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const migration_list&
42e2e97
to
35c0843
Compare
replica-server: add counters of recent_read_fail_count and recent_wri…
include/dsn/utility/stddev.h
Outdated
@@ -7,9 +7,9 @@ | |||
#include <vector> | |||
|
|||
namespace dsn { | |||
namespace stddev { | |||
namespace math { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那这里还不如改为dsn::utils 名字空间,和其他工具函数保持一致。
另外文件名改为math.h/math.cpp,这样以后有其他数学相关的函数也都放在这里。
src/core/core/strings.cpp
Outdated
@@ -183,5 +183,13 @@ std::string string_md5(const char *buffer, unsigned length) | |||
|
|||
return result; | |||
} | |||
|
|||
std::string to_string_with_precision(const double double_val, const int precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用 fmtlib,这个函数没必要写,http://fmtlib.net/latest/syntax.html
@@ -201,7 +201,7 @@ struct configuration_update_request | |||
2:dsn.layer2.partition_configuration config; | |||
3:config_type type = config_type.CT_INVALID; | |||
4:dsn.rpc_address node; | |||
5:dsn.rpc_address host_node; // only used by stateless apps | |||
5:dsn.rpc_address host_node; // depreciated, only used by stateless apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprecated
include/dsn/utility/math.h
Outdated
namespace dsn { | ||
namespace utils { | ||
|
||
double mean_stddev(const std::vector<unsigned> &result_set, bool partial_sample); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数没有一点注释?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个应该还好吧,函数名一看就明白,公式也很简单
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned 要改成 uint32_t(当然你提前要 include )这是 cpplint 标准写法。
src/dist/replication/test/meta_test/balancer_simulator/balancer_simulator.cpp
Show resolved
Hide resolved
else | ||
stddev = sqrt(accum / (result_set.size())); | ||
|
||
stddev = ((double)((int)((stddev + 0.005) * 100))) / 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里只是为了做四舍五入吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
四舍五入保留小数点后两位
{ | ||
std::string result("unknown"); | ||
if (args.empty()) { | ||
result = std::string("total=" + std::to_string(t_operation_counters[ALL_COUNT])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
做一个 short circuit return吧,不然if...else...层次会很多
result = std::string("total=" + std::to_string(t_operation_counters[ALL_COUNT])); | ||
} else { | ||
if (args[0] == "total") { | ||
result = std::string("total=" + std::to_string(t_operation_counters[ALL_COUNT])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面这些类似
bool primary_partial_sample = false; | ||
bool partial_sample = false; | ||
|
||
for (auto iter = view.nodes->begin(); iter != view.nodes->end();) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
写成range based loop吧, 另外只用到了iter->second
,可以重新定义一个变量,这样代码更简短直观
namespace utils { | ||
|
||
double mean_stddev(const std::vector<uint32_t> &result_set, bool partial_sample) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以加个assert(result_set.size() > 1)
{ | ||
std::string result("unknown"); | ||
if (args.empty()) { | ||
return std::string("total=" + std::to_string(t_operation_counters[ALL_COUNT])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的意思是:
if (condition1) {
return xxx;
}
if (condition2) {
return xxx;
}
...
这样就没有else了
add cluster balance indicator (XiaoMi#214)
add cluster balance indicator according to issue #237