-
Notifications
You must be signed in to change notification settings - Fork 57
feat(handler) expose dataplane status on control plane #98
Conversation
The config_hash metrics is useful for catching "DP has inconsistent configs across the cluster for x time". But this will create a new metrics everytime the config is flipped, so the time series is not continous. |
ec2ff51
to
8a1f0a2
Compare
kong/plugins/prometheus/exporter.lua
Outdated
{"node_id", "hostname", "ip"}) | ||
metrics.dataplane_config_hash = prometheus:gauge("dataplane_config_hash", | ||
"Config hash numeric value of the data plane", | ||
{"node_id", "hostname", "ip"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would adding the current expected config has as another metric series help here?
With that additional timeseries, one can compare which data planes are out of sync and which one are in sync.
cc @wyndigo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rethinking this, this seems wrong.
Control plane node should export a series on the expected config hash and data plane nodes should export the current config hash. That minimizes the moving parts and reflects the state of the world much more accurately. (In the current code, we are relying on the control-data plane communication to not be buggy and report the right config hash)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hbagdi That's good point. We can also export the current hash here.
I was thinking future metrics we can expose on DP side as well. The problem is if we want to compare metrics from different side, both metrics should contain same set of labels. Or we can ignore()
labels but that seems putting more logic into prometheus. So i think though it's a good idea to expose the hash on DP side, making it available on CP is also worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to record of current config hash in control plane, there's currently no such logic in clustering.lua
.
I'm adding a metrics for the DP for now, that will also benefit generic dbless kong.
I'm not sure I understand. Why would it be so? |
If we put So @wyndigo 's idea is to make the config_hash, which is a md5 hexstring, into its numeric value. Then it's no longer a label and we can still compare difference between DPs. |
metrics.dataplane_last_seen:set(status.last_seen, labels) | ||
metrics.dataplane_config_hash:set(config_hash_to_number(status.config_hash), labels) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not output the config hash of data planes on the control-plane. Output only the data-planes this control-plane is seeing and the config hash that it expects others to have.
The idea here is to give visibility of two kinds:
- Control plane reports the data-plane it is seeing, this gives visibility into connections as they are working.
- Control plane reports the desired hash and data-planes reports the current hash. The operator can then look for split-brains and such problems in here.
cc @wyndigo Does that align well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hbagdi I'm okay to add additional config_hash on DP. There can be comparasion of mismatch on DP as well.
But I feel it's also important to gather it on CP. Consider use cases like user bring in their own DP, it will not be easy to ship metrics around. Actually for the current ways of gathering upstream/service metrics on DP feels odd to me. Ideally those states can be sync'ed across the cluster and at same time made available on CP.
Also like I proposed even with managed DP, it's likely deployed in different ways as CP and likely making labels inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see the problem.
In that case, we should think about better names for the metrics.
dataplane_config_hash
and config_current_hash
communicate nothing about where the metrics are coming from or what they mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm picking up this PR again as there're more metrics about hybrid mode I would like to
get in. @hbagdi Do you have suggestions on the naming here? config_current_hash
is
removed as I can't it expose on CP anymore. We only have data_plane_*
on CP now.
if cp_metrics then | ||
-- Cleanup old metrics | ||
metrics.data_plane_last_seen:reset() | ||
metrics.data_plane_config_hash:reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code will reset the metrics and then could fail to list data planes on line 377.
I suggest we do the following:
- list the data planes , check err
- reset metrics
- start the loop to populate the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be useful we just fail hard and not emitting any DP metrics at all on DB error.
Actually if we list DPs first, there'll still be possiblity that we are exposing partial data.
The one case that can be solved is that no DPs are listed at all, and then we keep exposing the old metrics.
but that may silencing the errors to end user, since if the plugin does fail to list metrics continously,
prometheus side its not detectable at all. Maybe a "kong_prometheus_errors" counter will help but
that's a different story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highlights another problem. From what I can read, it seems every time there is a request on /metrics
, there will be a SQL query. Is that right? If yes, that seems too much computation to do for each request on /metrics
. Can we please introduce some form of caching here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with @hbagdi offline, we will continue with current approach for now to avoid adding complexity in code.
Added https://github.com/Kong/kong-plugin-prometheus/issues/132 for tracking it.
local labels = { data_plane.id, data_plane.hostname, data_plane.ip } | ||
|
||
metrics.data_plane_last_seen:set(data_plane.last_seen, labels) | ||
metrics.data_plane_config_hash:set(config_hash_to_number(data_plane.config_hash), labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the utility of this number?
It can help detect if a data plane is stuck at a different version than others but this is not a linearly increasing number so the absolute number is of zero help.
@wyndigo Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is to detect inconsistent config between DPs, ideally they should compare to the hash on current CP,
but due to the API change I'm currently not able to just let it export. Need some change on Kong side.
I will address that in later PRs.
But for now, inconsistency of config hashes through DP already indicates something bad. One can monitor those
with count_values(data_plane_config_hash) > 1
- Fix exporter to attach subsystem label to memory stats [#118](Kong/kong-plugin-prometheus#118) - Expose dataplane status on control plane, new metrics `data_plane_last_seen`, `data_plane_config_hash` and `data_plane_version_compatible` are added. [#98](Kong/kong-plugin-prometheus#98)
- Fix exporter to attach subsystem label to memory stats [#118](Kong/kong-plugin-prometheus#118) - Expose dataplane status on control plane, new metrics `data_plane_last_seen`, `data_plane_config_hash` and `data_plane_version_compatible` are added. [#98](Kong/kong-plugin-prometheus#98)
This PR adds a series of metrics to expose connected Data Plane metrics on Control Plane
side.
sample output