Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: improve capacity unit calculation #339

Merged
merged 10 commits into from
Jun 18, 2019
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion src/server/config-server.ini
Original file line number Diff line number Diff line change
Expand Up @@ -274,16 +274,19 @@ falcon_path = /v1/push

[pegasus.collector]
cluster = onebox

available_detect_app = @APP_NAME@
available_detect_alert_script_dir = ./package/bin
available_detect_alert_email_address =
available_detect_interval_seconds = 3
available_detect_alert_fail_count = 30
available_detect_timeout = 5000

app_stat_interval_seconds = 10

cu_stat_app = stat
cu_stat_app = @APP_NAME@
neverchanje marked this conversation as resolved.
Show resolved Hide resolved
cu_fetch_interval_seconds = 8
st_fetch_interval_seconds = 60

[pegasus.clusters]
onebox = @LOCAL_IP@:34601,@LOCAL_IP@:34602,@LOCAL_IP@:34603
Expand Down
3 changes: 3 additions & 0 deletions src/server/config.ini
Original file line number Diff line number Diff line change
Expand Up @@ -285,16 +285,19 @@

[pegasus.collector]
cluster = %{cluster.name}

available_detect_app = temp
available_detect_alert_script_dir = ./package/bin
available_detect_alert_email_address =
available_detect_interval_seconds = 3
available_detect_alert_fail_count = 30
available_detect_timeout = 5000

app_stat_interval_seconds = 10

cu_stat_app = stat
neverchanje marked this conversation as resolved.
Show resolved Hide resolved
cu_fetch_interval_seconds = 8
st_fetch_interval_seconds = 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

跟上面的配置为什么相差这么大?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改进点3:增加storage size统计,默认每小时统计一次,统计每个表的存储空间(和app_stat命令一样)。因为变化不大,所以没必要太频繁。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


[pegasus.clusters]
%{cluster.name} = %{meta.server.list}
Expand Down
71 changes: 64 additions & 7 deletions src/server/info_collector.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ namespace server {

DEFINE_TASK_CODE(LPC_PEGASUS_APP_STAT_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
DEFINE_TASK_CODE(LPC_PEGASUS_CU_STAT_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
DEFINE_TASK_CODE(LPC_PEGASUS_ST_STAT_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也一样,还是不追求名字短,追求可读吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


info_collector::info_collector()
{
Expand Down Expand Up @@ -65,6 +66,17 @@ info_collector::info_collector()
"cu_fetch_interval_seconds",
8, // default value 8s
"capacity unit fetch interval seconds");
_cu_fetch_retry_count = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

写成std::min(3, _cu_fetch_interval_seconds) 吧, 避免cu_fetch_interval_seconds配的很小, 跟重试有重叠

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

_cu_fetch_retry_wait_seconds = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个如果固定为1的话, 写成静态常量好了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


_st_fetch_interval_seconds =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改 _storage_fetch_interval_seconds,不然这个配置名反人类了

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加全吧, _storage_size_fetch_interval_seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那改成storage_size_fetch_interval_seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

配置文件里面该成 storage_size_fetch_interval_seconds,但是变量名不用改,毕竟也没啥歧义,不会造成困扰。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

(uint32_t)dsn_config_get_value_uint64("pegasus.collector",
"st_fetch_interval_seconds",
3600, // default value 1h
"storage size fetch interval seconds");
_st_fetch_retry_count = 3;
// _st_fetch_retry_wait_seconds is in range of [1, 60]
_st_fetch_retry_wait_seconds = std::min(60u, std::max(1u, _st_fetch_interval_seconds / 10));
}

info_collector::~info_collector()
Expand All @@ -88,10 +100,18 @@ void info_collector::start()
_cu_stat_timer_task =
::dsn::tasking::enqueue_timer(LPC_PEGASUS_CU_STAT_TIMER,
&_tracker,
[this] { on_capacity_unit_stat(); },
[this] { on_capacity_unit_stat(_cu_fetch_retry_count); },
std::chrono::seconds(_cu_fetch_interval_seconds),
0,
std::chrono::minutes(1));

_st_stat_timer_task =
neverchanje marked this conversation as resolved.
Show resolved Hide resolved
::dsn::tasking::enqueue_timer(LPC_PEGASUS_ST_STAT_TIMER,
&_tracker,
[this] { on_storage_size_stat(_st_fetch_retry_count); },
std::chrono::seconds(_st_fetch_interval_seconds),
0,
std::chrono::minutes(1));
}

void info_collector::stop() { _tracker.cancel_outstanding_tasks(); }
Expand Down Expand Up @@ -230,21 +250,34 @@ info_collector::AppStatCounters *info_collector::get_app_counters(const std::str
return counters;
}

void info_collector::on_capacity_unit_stat()
void info_collector::on_capacity_unit_stat(int remaining_retry_count)
{
ddebug("start to stat capacity unit");
std::vector<node_capacity_unit_stat> nodes_stat;
if (!get_capacity_unit_stat(&_shell_context, nodes_stat)) {
derror("get capacity unit stat failed");
if (remaining_retry_count > 0) {
derror("get capacity unit stat failed, remaining_retry_count = %d, "
neverchanje marked this conversation as resolved.
Show resolved Hide resolved
"wait %u seconds to retry",
remaining_retry_count,
_cu_fetch_retry_wait_seconds);
::dsn::tasking::enqueue(LPC_PEGASUS_CU_STAT_TIMER,
&_tracker,
[=] { on_capacity_unit_stat(remaining_retry_count - 1); },
0,
std::chrono::seconds(_cu_fetch_retry_wait_seconds));
} else {
derror("get capacity unit stat failed, remaining_retry_count = 0, no retry anymore");
neverchanje marked this conversation as resolved.
Show resolved Hide resolved
}
return;
}
for (auto elem : nodes_stat) {
if (!has_capacity_unit_updated(elem.node_address, elem.timestamp)) {
for (node_capacity_unit_stat &elem : nodes_stat) {
if (elem.node_address.empty() || elem.timestamp.empty() ||
!has_capacity_unit_updated(elem.node_address, elem.timestamp)) {
dinfo("recent read/write capacity unit value of node %s has not updated",
elem.node_address.c_str());
continue;
}
_result_writer->set_result(elem.timestamp, elem.node_address, elem.dump_to_json());
_result_writer->set_result(elem.timestamp, "cu@" + elem.node_address, elem.dump_to_json());
}
}

Expand All @@ -258,10 +291,34 @@ bool info_collector::has_capacity_unit_updated(const std::string &node_address,
return true;
}
if (timestamp > find->second) {
_cu_update_info[node_address] = timestamp;
find->second = timestamp;
return true;
}
return false;
}

void info_collector::on_storage_size_stat(int remaining_retry_count)
{
ddebug("start to stat storage size");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉这行日志吧, 没有意义

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个还是有必要的,表示在正常工作,和available_detector里面一样。

Copy link
Member

@acelyc111 acelyc111 Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我是感觉正常情况下, 这种周期性的任务就没必要打日志了, 只在异常分支打日志, 不然日志文件中会充满这类日志.
如果是debug问题, 有其他手段, 比如pstack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

毕竟没有多少日志,这对查找问题还是有帮助的,先不改了

app_storage_size_stat st_stat;
if (!get_storage_size_stat(&_shell_context, st_stat)) {
if (remaining_retry_count > 0) {
derror("get storage size stat failed, remaining_retry_count = %d, "
"wait %u seconds to retry",
remaining_retry_count,
_st_fetch_retry_wait_seconds);
::dsn::tasking::enqueue(LPC_PEGASUS_ST_STAT_TIMER,
&_tracker,
[=] { on_storage_size_stat(remaining_retry_count - 1); },
0,
std::chrono::seconds(_st_fetch_retry_wait_seconds));
} else {
derror("get storage size stat failed, remaining_retry_count = 0, no retry anymore");
}
return;
}
_result_writer->set_result(st_stat.timestamp, "st", st_stat.dump_to_json());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非得简化的话,sz可能更好一点

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_size不应当是ss吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

} // namespace server
} // namespace pegasus
10 changes: 9 additions & 1 deletion src/server/info_collector.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,11 @@ class info_collector
void on_app_stat();
AppStatCounters *get_app_counters(const std::string &app_name);

void on_capacity_unit_stat();
void on_capacity_unit_stat(int remaining_retry_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成 retry_count 吧, 函数本身没有remaining这层概念

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得有remaining语义更清楚,表示还允许多少次重试。不然只是retry_count,那么retry_count=3可能被理解为这次调用是第3次重试。

bool has_capacity_unit_updated(const std::string &node_address, const std::string &timestamp);

void on_storage_size_stat(int remaining_retry_count);

private:
dsn::task_tracker _tracker;
::dsn::rpc_address _meta_servers;
Expand All @@ -86,7 +88,13 @@ class info_collector
// for writing cu stat result
std::unique_ptr<result_writer> _result_writer;
uint32_t _cu_fetch_interval_seconds;
uint32_t _cu_fetch_retry_count;
neverchanje marked this conversation as resolved.
Show resolved Hide resolved
uint32_t _cu_fetch_retry_wait_seconds;
::dsn::task_ptr _cu_stat_timer_task;
uint32_t _st_fetch_interval_seconds;
uint32_t _st_fetch_retry_count;
uint32_t _st_fetch_retry_wait_seconds;
::dsn::task_ptr _st_stat_timer_task;
::dsn::utils::ex_lock_nr _cu_update_info_lock;
// mapping 'node address' --> 'last updated timestamp'
std::map<std::string, string> _cu_update_info;
Expand Down
Loading