feat(handler) expose dataplane status on control plane #98

fffonion · 2020-09-15T11:15:20Z

This PR adds a series of metrics to expose connected Data Plane metrics on Control Plane
side.

sample output

curl -s localhost:8001/metrics|grep data_plane
# HELP kong_data_plane_config_hash Config hash value of the data plane
# TYPE kong_data_plane_config_hash gauge
kong_data_plane_config_hash{node_id="d4e7584e-b2f2-415b-bb68-3b0936f1fde3",hostname="ubuntu-bionic",ip="127.0.0.1"} 1.7158931820287e+38
# HELP kong_data_plane_last_seen Last time data plane contacted control plane
# TYPE kong_data_plane_last_seen gauge
kong_data_plane_last_seen{node_id="d4e7584e-b2f2-415b-bb68-3b0936f1fde3",hostname="ubuntu-bionic",ip="127.0.0.1"} 1600190275
# HELP kong_data_plane_version_compatible Version compatible status of the data plane, 0 is incompatible
# TYPE kong_data_plane_version_compatible gauge
kong_data_plane_version_compatible{node_id="d4e7584e-b2f2-415b-bb68-3b0936f1fde3",hostname="ubuntu-bionic",ip="127.0.0.1",kong_version="2.4.1"} 1

fffonion · 2020-09-15T11:30:38Z

The config_hash metrics is useful for catching "DP has inconsistent configs across the cluster for x time".

But this will create a new metrics everytime the config is flipped, so the time series is not continous.
Need to verify if that will cause trouble in alerting. For example I can imagine we have a
count(kong_dataplane_last_seen) for expected data plane count.

kong/plugins/prometheus/exporter.lua

hbagdi · 2020-09-16T15:40:23Z

kong/plugins/prometheus/exporter.lua

+                                                {"node_id", "hostname", "ip"})
+      metrics.dataplane_config_hash = prometheus:gauge("dataplane_config_hash",
+                                                "Config hash numeric value of the data plane",
+                                                {"node_id", "hostname", "ip"})


Would adding the current expected config has as another metric series help here?
With that additional timeseries, one can compare which data planes are out of sync and which one are in sync.
cc @wyndigo

Rethinking this, this seems wrong.
Control plane node should export a series on the expected config hash and data plane nodes should export the current config hash. That minimizes the moving parts and reflects the state of the world much more accurately. (In the current code, we are relying on the control-data plane communication to not be buggy and report the right config hash)

@hbagdi That's good point. We can also export the current hash here.
I was thinking future metrics we can expose on DP side as well. The problem is if we want to compare metrics from different side, both metrics should contain same set of labels. Or we can ignore() labels but that seems putting more logic into prometheus. So i think though it's a good idea to expose the hash on DP side, making it available on CP is also worthwhile.

We need to record of current config hash in control plane, there's currently no such logic in clustering.lua.
I'm adding a metrics for the DP for now, that will also benefit generic dbless kong.

hbagdi · 2020-09-16T15:44:18Z

But this will create a new metrics everytime the config is flipped, so the time series is not continous.

I'm not sure I understand. Why would it be so?

fffonion · 2020-09-16T15:48:40Z

But this will create a new metrics everytime the config is flipped, so the time series is not continous.

I'm not sure I understand. Why would it be so?

If we put config_hash as a lable into metrics, you will expect for example
dataplane_last_seen{node_id="UUID", config_hash="hash1"} to exist before the flip, and
dataplane_last_seen{node_id="UUID", config_hash="hash2"} exist after. There'll be two color lines in
the prometheus graph. But they are actually referring to a same dataplane node.

So @wyndigo 's idea is to make the config_hash, which is a md5 hexstring, into its numeric value. Then it's no longer a label and we can still compare difference between DPs.

kong/plugins/prometheus/exporter.lua

hbagdi · 2020-09-21T23:21:46Z

kong/plugins/prometheus/exporter.lua

+      metrics.dataplane_last_seen:set(status.last_seen, labels)
+      metrics.dataplane_config_hash:set(config_hash_to_number(status.config_hash), labels)
+    end
+  end


Please do not output the config hash of data planes on the control-plane. Output only the data-planes this control-plane is seeing and the config hash that it expects others to have.

The idea here is to give visibility of two kinds:

Control plane reports the data-plane it is seeing, this gives visibility into connections as they are working.

Control plane reports the desired hash and data-planes reports the current hash. The operator can then look for split-brains and such problems in here.

cc @wyndigo Does that align well?

@hbagdi I'm okay to add additional config_hash on DP. There can be comparasion of mismatch on DP as well.
But I feel it's also important to gather it on CP. Consider use cases like user bring in their own DP, it will not be easy to ship metrics around. Actually for the current ways of gathering upstream/service metrics on DP feels odd to me. Ideally those states can be sync'ed across the cluster and at same time made available on CP.
Also like I proposed even with managed DP, it's likely deployed in different ways as CP and likely making labels inconsistent.

Okay, I see the problem.

In that case, we should think about better names for the metrics.
dataplane_config_hash and config_current_hash communicate nothing about where the metrics are coming from or what they mean.

I'm picking up this PR again as there're more metrics about hybrid mode I would like to
get in. @hbagdi Do you have suggestions on the naming here? config_current_hash is
removed as I can't it expose on CP anymore. We only have data_plane_* on CP now.

kong/plugins/prometheus/exporter.lua

hbagdi · 2021-05-17T17:42:14Z

kong/plugins/prometheus/exporter.lua

+  if cp_metrics then
+    -- Cleanup old metrics
+    metrics.data_plane_last_seen:reset()
+    metrics.data_plane_config_hash:reset()


The code will reset the metrics and then could fail to list data planes on line 377.
I suggest we do the following:

list the data planes , check err

reset metrics

start the loop to populate the metrics.

I think it will be useful we just fail hard and not emitting any DP metrics at all on DB error.
Actually if we list DPs first, there'll still be possiblity that we are exposing partial data.
The one case that can be solved is that no DPs are listed at all, and then we keep exposing the old metrics.
but that may silencing the errors to end user, since if the plugin does fail to list metrics continously,
prometheus side its not detectable at all. Maybe a "kong_prometheus_errors" counter will help but
that's a different story.

This highlights another problem. From what I can read, it seems every time there is a request on /metrics, there will be a SQL query. Is that right? If yes, that seems too much computation to do for each request on /metrics. Can we please introduce some form of caching here?

Synced with @hbagdi offline, we will continue with current approach for now to avoid adding complexity in code.
Added https://github.com/Kong/kong-plugin-prometheus/issues/132 for tracking it.

kong/plugins/prometheus/exporter.lua

hbagdi · 2021-05-17T18:16:25Z

kong/plugins/prometheus/exporter.lua

+      local labels = { data_plane.id, data_plane.hostname, data_plane.ip }
+
+      metrics.data_plane_last_seen:set(data_plane.last_seen, labels)
+      metrics.data_plane_config_hash:set(config_hash_to_number(data_plane.config_hash), labels)


What is the utility of this number?
It can help detect if a data plane is stuck at a different version than others but this is not a linearly increasing number so the absolute number is of zero help.

@wyndigo Any thoughts?

Yes this is to detect inconsistent config between DPs, ideally they should compare to the hash on current CP,
but due to the API change I'm currently not able to just let it export. Need some change on Kong side.
I will address that in later PRs.
But for now, inconsistency of config hashes through DP already indicates something bad. One can monitor those
with count_values(data_plane_config_hash) > 1

- Fix exporter to attach subsystem label to memory stats [#118](Kong/kong-plugin-prometheus#118) - Expose dataplane status on control plane, new metrics `data_plane_last_seen`, `data_plane_config_hash` and `data_plane_version_compatible` are added. [#98](Kong/kong-plugin-prometheus#98)

fffonion requested review from hbagdi and a team September 15, 2020 11:15

fffonion force-pushed the feat/cp-status branch 2 times, most recently from ec2ff51 to 8a1f0a2 Compare September 15, 2020 12:52

wyndigo reviewed Sep 15, 2020

View reviewed changes

kong/plugins/prometheus/exporter.lua Outdated Show resolved Hide resolved

hbagdi added the do not merge label Sep 15, 2020

hbagdi reviewed Sep 16, 2020

View reviewed changes

hbagdi reviewed Sep 21, 2020

View reviewed changes

kong/plugins/prometheus/exporter.lua Outdated Show resolved Hide resolved

hbagdi reviewed Sep 21, 2020

View reviewed changes

kong/plugins/prometheus/exporter.lua Show resolved Hide resolved

hbagdi reviewed Sep 21, 2020

View reviewed changes

fffonion added 3 commits May 17, 2021 16:54

feat(handler) expose dataplane status on control plane

4f8e1ce

export config_hash as a number

5ccbc03

address comments

82fabac

fffonion force-pushed the feat/cp-status branch from c6c9ee4 to 246ecb1 Compare May 17, 2021 08:55

export current config hash

17baeb5

fffonion force-pushed the feat/cp-status branch from 246ecb1 to 17baeb5 Compare May 17, 2021 08:56

remove current_config_hash and add data_plane_compatible

fcb9904

hbagdi reviewed May 17, 2021

View reviewed changes

kong/plugins/prometheus/exporter.lua Show resolved Hide resolved

hbagdi reviewed May 17, 2021

View reviewed changes

kong/plugins/prometheus/exporter.lua Show resolved Hide resolved

hbagdi reviewed May 17, 2021

View reviewed changes

kong/plugins/prometheus/exporter.lua Outdated Show resolved Hide resolved

hbagdi reviewed May 17, 2021

View reviewed changes

address comments

a9ce0fe

dndx approved these changes May 26, 2021

View reviewed changes

fffonion removed the do not merge label May 27, 2021

lint

f1032aa

fffonion merged commit 26d4190 into master May 27, 2021

fffonion deleted the feat/cp-status branch May 27, 2021 07:50

fffonion mentioned this pull request Jun 2, 2021

chore(deps) bump prometheus plugin to 1.3.0 Kong/kong#7415

Merged

fffonion mentioned this pull request Aug 7, 2021

Investigate performance hit for exporting dataplane staus Kong/kong#7674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(handler) expose dataplane status on control plane #98

feat(handler) expose dataplane status on control plane #98

fffonion commented Sep 15, 2020 •

edited

Loading

fffonion commented Sep 15, 2020

hbagdi Sep 16, 2020

hbagdi Sep 16, 2020

fffonion Sep 17, 2020

fffonion Sep 17, 2020

hbagdi commented Sep 16, 2020

fffonion commented Sep 16, 2020

hbagdi Sep 21, 2020

fffonion Sep 23, 2020

hbagdi Sep 29, 2020

fffonion May 17, 2021 •

edited

Loading

hbagdi May 17, 2021

fffonion May 18, 2021

hbagdi May 24, 2021

fffonion May 26, 2021

hbagdi May 17, 2021

fffonion May 18, 2021

feat(handler) expose dataplane status on control plane #98

feat(handler) expose dataplane status on control plane #98

Conversation

fffonion commented Sep 15, 2020 • edited Loading

fffonion commented Sep 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbagdi commented Sep 16, 2020

fffonion commented Sep 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fffonion May 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fffonion commented Sep 15, 2020 •

edited

Loading

fffonion May 17, 2021 •

edited

Loading