Skip to content
This repository has been archived by the owner on Aug 7, 2021. It is now read-only.

feat(handler) expose dataplane status on control plane #98

Merged
merged 7 commits into from
May 27, 2021

Conversation

fffonion
Copy link
Contributor

@fffonion fffonion commented Sep 15, 2020

This PR adds a series of metrics to expose connected Data Plane metrics on Control Plane
side.

sample output

curl -s localhost:8001/metrics|grep data_plane
# HELP kong_data_plane_config_hash Config hash value of the data plane
# TYPE kong_data_plane_config_hash gauge
kong_data_plane_config_hash{node_id="d4e7584e-b2f2-415b-bb68-3b0936f1fde3",hostname="ubuntu-bionic",ip="127.0.0.1"} 1.7158931820287e+38
# HELP kong_data_plane_last_seen Last time data plane contacted control plane
# TYPE kong_data_plane_last_seen gauge
kong_data_plane_last_seen{node_id="d4e7584e-b2f2-415b-bb68-3b0936f1fde3",hostname="ubuntu-bionic",ip="127.0.0.1"} 1600190275
# HELP kong_data_plane_version_compatible Version compatible status of the data plane, 0 is incompatible
# TYPE kong_data_plane_version_compatible gauge
kong_data_plane_version_compatible{node_id="d4e7584e-b2f2-415b-bb68-3b0936f1fde3",hostname="ubuntu-bionic",ip="127.0.0.1",kong_version="2.4.1"} 1

@fffonion fffonion requested review from hbagdi and a team September 15, 2020 11:15
@fffonion
Copy link
Contributor Author

The config_hash metrics is useful for catching "DP has inconsistent configs across the cluster for x time".

But this will create a new metrics everytime the config is flipped, so the time series is not continous.
Need to verify if that will cause trouble in alerting. For example I can imagine we have a
count(kong_dataplane_last_seen) for expected data plane count.

@fffonion fffonion force-pushed the feat/cp-status branch 2 times, most recently from ec2ff51 to 8a1f0a2 Compare September 15, 2020 12:52
kong/plugins/prometheus/exporter.lua Outdated Show resolved Hide resolved
{"node_id", "hostname", "ip"})
metrics.dataplane_config_hash = prometheus:gauge("dataplane_config_hash",
"Config hash numeric value of the data plane",
{"node_id", "hostname", "ip"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding the current expected config has as another metric series help here?
With that additional timeseries, one can compare which data planes are out of sync and which one are in sync.
cc @wyndigo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethinking this, this seems wrong.
Control plane node should export a series on the expected config hash and data plane nodes should export the current config hash. That minimizes the moving parts and reflects the state of the world much more accurately. (In the current code, we are relying on the control-data plane communication to not be buggy and report the right config hash)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hbagdi That's good point. We can also export the current hash here.
I was thinking future metrics we can expose on DP side as well. The problem is if we want to compare metrics from different side, both metrics should contain same set of labels. Or we can ignore() labels but that seems putting more logic into prometheus. So i think though it's a good idea to expose the hash on DP side, making it available on CP is also worthwhile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to record of current config hash in control plane, there's currently no such logic in clustering.lua.
I'm adding a metrics for the DP for now, that will also benefit generic dbless kong.

@hbagdi
Copy link
Member

hbagdi commented Sep 16, 2020

But this will create a new metrics everytime the config is flipped, so the time series is not continous.

I'm not sure I understand. Why would it be so?

@fffonion
Copy link
Contributor Author

But this will create a new metrics everytime the config is flipped, so the time series is not continous.

I'm not sure I understand. Why would it be so?

If we put config_hash as a lable into metrics, you will expect for example
dataplane_last_seen{node_id="UUID", config_hash="hash1"} to exist before the flip, and
dataplane_last_seen{node_id="UUID", config_hash="hash2"} exist after. There'll be two color lines in
the prometheus graph. But they are actually referring to a same dataplane node.

So @wyndigo 's idea is to make the config_hash, which is a md5 hexstring, into its numeric value. Then it's no longer a label and we can still compare difference between DPs.

metrics.dataplane_last_seen:set(status.last_seen, labels)
metrics.dataplane_config_hash:set(config_hash_to_number(status.config_hash), labels)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not output the config hash of data planes on the control-plane. Output only the data-planes this control-plane is seeing and the config hash that it expects others to have.

The idea here is to give visibility of two kinds:

  1. Control plane reports the data-plane it is seeing, this gives visibility into connections as they are working.
  2. Control plane reports the desired hash and data-planes reports the current hash. The operator can then look for split-brains and such problems in here.

cc @wyndigo Does that align well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hbagdi I'm okay to add additional config_hash on DP. There can be comparasion of mismatch on DP as well.
But I feel it's also important to gather it on CP. Consider use cases like user bring in their own DP, it will not be easy to ship metrics around. Actually for the current ways of gathering upstream/service metrics on DP feels odd to me. Ideally those states can be sync'ed across the cluster and at same time made available on CP.
Also like I proposed even with managed DP, it's likely deployed in different ways as CP and likely making labels inconsistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see the problem.

In that case, we should think about better names for the metrics.
dataplane_config_hash and config_current_hash communicate nothing about where the metrics are coming from or what they mean.

Copy link
Contributor Author

@fffonion fffonion May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm picking up this PR again as there're more metrics about hybrid mode I would like to
get in. @hbagdi Do you have suggestions on the naming here? config_current_hash is
removed as I can't it expose on CP anymore. We only have data_plane_* on CP now.

if cp_metrics then
-- Cleanup old metrics
metrics.data_plane_last_seen:reset()
metrics.data_plane_config_hash:reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will reset the metrics and then could fail to list data planes on line 377.
I suggest we do the following:

  1. list the data planes , check err
  2. reset metrics
  3. start the loop to populate the metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be useful we just fail hard and not emitting any DP metrics at all on DB error.
Actually if we list DPs first, there'll still be possiblity that we are exposing partial data.
The one case that can be solved is that no DPs are listed at all, and then we keep exposing the old metrics.
but that may silencing the errors to end user, since if the plugin does fail to list metrics continously,
prometheus side its not detectable at all. Maybe a "kong_prometheus_errors" counter will help but
that's a different story.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This highlights another problem. From what I can read, it seems every time there is a request on /metrics, there will be a SQL query. Is that right? If yes, that seems too much computation to do for each request on /metrics. Can we please introduce some form of caching here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with @hbagdi offline, we will continue with current approach for now to avoid adding complexity in code.
Added https://github.com/Kong/kong-plugin-prometheus/issues/132 for tracking it.

local labels = { data_plane.id, data_plane.hostname, data_plane.ip }

metrics.data_plane_last_seen:set(data_plane.last_seen, labels)
metrics.data_plane_config_hash:set(config_hash_to_number(data_plane.config_hash), labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the utility of this number?
It can help detect if a data plane is stuck at a different version than others but this is not a linearly increasing number so the absolute number is of zero help.

@wyndigo Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is to detect inconsistent config between DPs, ideally they should compare to the hash on current CP,
but due to the API change I'm currently not able to just let it export. Need some change on Kong side.
I will address that in later PRs.
But for now, inconsistency of config hashes through DP already indicates something bad. One can monitor those
with count_values(data_plane_config_hash) > 1

@fffonion fffonion merged commit 26d4190 into master May 27, 2021
@fffonion fffonion deleted the feat/cp-status branch May 27, 2021 07:50
fffonion added a commit to Kong/kong that referenced this pull request Jun 2, 2021
- Fix exporter to attach subsystem label to memory stats
  [#118](Kong/kong-plugin-prometheus#118)
- Expose dataplane status on control plane, new metrics `data_plane_last_seen`,
  `data_plane_config_hash` and `data_plane_version_compatible` are added.
  [#98](Kong/kong-plugin-prometheus#98)
dndx pushed a commit to Kong/kong that referenced this pull request Jun 2, 2021
- Fix exporter to attach subsystem label to memory stats
  [#118](Kong/kong-plugin-prometheus#118)
- Expose dataplane status on control plane, new metrics `data_plane_last_seen`,
  `data_plane_config_hash` and `data_plane_version_compatible` are added.
  [#98](Kong/kong-plugin-prometheus#98)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants