chore(networkmonitor): tool to discover and provide metrics on peers #1290

alrevuelta · 2022-10-21T15:20:19Z

Related to #1010

Description:

Adds a tool that constantly tries to discover new peers in the network.
It also tries to connect to them.
See README.md for instructions and available metrics.
Goal is to run a long lived instante of this tool, with a grafana dashboard we can monitor.
Some metrics are exposed to prometheus to be fetched. Labels are constrained according do @jakubgs feedback.
Other metrics, related to individual peers information are exposed directly via a rest api, ready to be accessed by i.e. Grafana. Note that these metrics are not exposed to prometheus.

Limitations:

Naive implementation, with all data stored in memory. If the network scales to thousands of node the performance might be an issue.
Routing table is not emptied.

Usage:

$ make networkmonitor
$ ./build/networkmonitor --log-level=INFO --b="enr:-Nm4QOdTOKZJKTUUZ4O_W932CXIET-M9NamewDnL78P5u9DOGnZlK0JFZ4k0inkfe6iY-0JAaJVovZXc575VV3njeiABgmlkgnY0gmlwhAjS3ueKbXVsdGlhZGRyc7g6ADg2MW5vZGUtMDEuYWMtY24taG9uZ2tvbmctYy53YWt1djIucHJvZC5zdGF0dXNpbS5uZXQGH0DeA4lzZWNwMjU2azGhAo0C-VvfgHiXrxZi3umDiooXMGY9FvYj5_d1Q4EeS7eyg3RjcIJ2X4N1ZHCCIyiFd2FrdTIP"

And:

See prometheus metrics http://localhost:8008/metrics
See custom metrics http://localhost:8009/allpeersinfo

status-im-auto · 2022-10-21T15:37:26Z

Jenkins Builds

Click to see older builds (24)

❔	Commit	#️⃣	Finished (UTC)	Duration	Platform	Result
❌	`ed5418d`	#1	2022-10-21 15:37:25	~16 min	`linux`	📄`log`
✔️	`ed5418d`	#1	2022-10-21 15:44:58	~24 min	`macos`	📦`bin`

❌	`a0723ef`	#2	2022-10-25 18:12:32	~15 min	`linux`	📄`log`
✔️	`a0723ef`	#2	2022-10-25 18:18:04	~20 min	`macos`	📦`bin`

✔️	`99d06c6`	#3	2022-10-26 07:27:04	~15 min	`linux`	📦`bin`
✔️	`99d06c6`	#3	2022-10-26 07:31:58	~20 min	`macos`	📦`bin`

❌	`c1ced43`	#4	2022-10-26 15:56:22	~18 min	`linux`	📄`log`
✔️	`c1ced43`	#4	2022-10-26 15:59:43	~22 min	`macos`	📦`bin`

✔️	`3a1e5ed`	#5	2022-10-27 10:38:54	~18 min	`linux`	📦`bin`
✔️	`3a1e5ed`	#5	2022-10-27 10:42:26	~21 min	`macos`	📦`bin`

✔️	`47c0e75`	#6	2022-11-01 16:31:39	~16 min	`linux`	📦`bin`
✔️	`47c0e75`	#6	2022-11-01 16:36:21	~20 min	`macos`	📦`bin`

❌	`4a6f160`	#7	2022-11-01 16:41:42	~17 min	`linux`	📄`log`
✔️	`4a6f160`	#7	2022-11-01 16:47:57	~23 min	`macos`	📦`bin`

✔️	`3d4e426`	#8	2022-11-03 10:09:33	~18 min	`linux`	📦`bin`
✔️	`3d4e426`	#8	2022-11-03 10:14:52	~24 min	`macos`	📦`bin`

✔️	`08d936a`	#9	2022-11-03 10:17:14	~18 min	`linux`	📦`bin`
✔️	`08d936a`	#9	2022-11-03 10:20:05	~20 min	`macos`	📦`bin`

✔️	`0102b78`	#10	2022-11-04 12:47:05	~17 min	`linux`	📦`bin`
✔️	`0102b78`	#10	2022-11-04 12:53:25	~23 min	`macos`	📦`bin`

✔️	`dc6478b`	#11	2022-11-04 12:50:54	~17 min	`linux`	📦`bin`
✔️	`dc6478b`	#11	2022-11-04 12:54:31	~20 min	`macos`	📦`bin`

✔️	`4de12f2`	#12	2022-11-07 16:05:24	~16 min	`macos`	📦`bin`
✔️	`4de12f2`	#12	2022-11-07 16:09:17	~20 min	`linux`	📦`bin`

❔	Commit	#️⃣	Finished (UTC)	Duration	Platform	Result
✔️	`27fcb37`	#13	2022-11-08 17:04:06	~18 min	`linux`	📦`bin`
✔️	`27fcb37`	#13	2022-11-08 17:04:15	~18 min	`macos`	📦`bin`

✔️	`725cd9f`	#14	2022-11-10 07:32:40	~15 min	`linux`	📦`bin`
✔️	`725cd9f`	#14	2022-11-10 07:34:40	~17 min	`macos`	📦`bin`

tools/networkmonitor/networkmonitor.nim

jakubgs

You cannot put unconstrained values like ENRs in labels for Prometheus metrics. This will lead to an inevitable explision in cardinality which will kill the performance of any Prometheus instance or remote write backend:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

https://prometheus.io/docs/practices/naming/#labels

The only way around that is using some kind of limit set of buckets.

alrevuelta · 2022-10-28T12:24:07Z

Expanding a bit on the issue brought up by @jakubgs.

Problem:

I want to have strings as metrics, something not trivial, but can be achieved with labels.

declarePublicGauge networkmonitor_example, "Description", labels = ["peerId", "info"]

We update it with a dummy 0.0 value, ok.

networkmonitor_example.set(0.0, labelValues = ["peerId1", "ip1,enr1,wakuflags1,lastconnection1"])
networkmonitor_example.set(0.0, labelValues = ["peerId2", "ip2,enr2,wakuflags2,lastconnection2"])

But if we modify one label lastconnection2 value, we will have another time series point. Something that can break prometheus at some point.

networkmonitor_example.set(0.0, labelValues = ["peerId1", "ip1,enr1,wakuflags1,lastconnection2"])

Proof of it, peerId1 has now two datapoints, and we only need one.

networkmonitor_example{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection1"} 0.0
networkmonitor_example_created{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection1"} 1666957272.0
networkmonitor_example{peerId="peerId2",info="ip2,enr2,wakuflags2,lastconnection2"} 0.0
networkmonitor_example_created{peerId="peerId2",info="ip2,enr2,wakuflags2,lastconnection2"} 1666957272.0

networkmonitor_example{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection2"} 0.0
networkmonitor_example_created{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection2"} 1666957272.0

Solution

As suggested by Jakub I tried using a limited set of buckets, but since I need strings I had to use labels having the same issue as the beginning. However I found an alternative that should fix the problem of having infinite time-series points.

Going into metrics tables, its possible to update an existing key if it was already present or create a new if not. With this we no longer have an infinite increasing number of time series points.

proc addOrUpdate(peerId: string, content: string, metricGauge: Gauge) = 
  for i in networkmonitor_example.metrics.keys:
    if i[0] == peerId:
      echo "already there, update it"
      metricGauge.metrics[@[i[0], i[1]]][0].labelValues[1] = content
      return
  echo "not there, add new field"
  metricGauge.set(0.0, labelValues = [peerId, content])

Long story short, it's possible to use have a key,value like set of metrics using labels with a constrained number of time series points.

addOrUpdate("peerId2", "ip1,enr1,wakuflags1,lastconnection2", networkmonitor_example)

@jakubgs :

Do you think this solves the issue you brought up? I have verified it by looking into localhost:8008/metrics, observing that no new points are added with new label combinations. That should be enough.
There will be ofc 1 entry per active peer in the network. Luckily this is around 10 and will take time to reach levels where prometheus can't handle it. I recall you mentioned few hundreds. If you think its safer I can add a threshold of x entries, and after that just error and drop values.
Note that this is temporal. If waku network grows to thousands of nodes, we will ofc have a proper DB.

jakubgs · 2022-10-28T22:47:40Z

No, this does not solve the issue as peer ids are unconstrained. Also not sure why you think you can constrain the number of labels created based on runtime data of a service that can stop and start. Every restart will make it forget what labels it has already created, in effect nullifying any attempt at constraining the number of values.

The only solutions are:

Not use any unconstrained values for labels like peer IDs at all.
Use a specific constrained set of buckets for values of labels that cannot grow.
Simply not use Prometheus as backend for data it was not intended for.

alrevuelta · 2022-11-01T09:22:23Z

Thanks for the input @jakubgs. Perhaps Prometheus is overkill for this use case, so would go with "Simply not use Prometheus as backend for data it was not intended for." I have found "Grafana Infinity Datasource" plugin that allows me to get all the info I want to display via http get. So I will connect grafana directly with this tool (networkmonitor) to display this data without prometheus in between. Anything against that?

alrevuelta · 2022-11-01T16:31:30Z

Since it didn't make sense to expose the "metrics" of the peers through Prometheus, I ended up setting up a custom rest api, that can be used to fetch the peers information. The idea is for a frontend like Grafana to use this information to display it. Note that on top of this other Prometheus metrics exists, but these are mere gauges. See README.

All input is appreciated!

rymnc · 2022-11-02T07:20:36Z

Do we really need the location information?

alrevuelta · 2022-11-02T07:43:14Z

@rymnc

Do we really need the location information?

Anyone can see it, so why not? I think it's interesting to see how decentralized the network is in terms of location.

LNSD · 2022-11-03T12:11:56Z

Is this PR ready to be reviewed? If not, can you put it into Draft mode? 👀

I see many TODOs scattered around the code, and also, there's a PR to merge into this (#1335)

alrevuelta · 2022-11-03T13:01:24Z

@LNSD Yes, it is ready for review, that's why I requested it to be reviewed. There are some TODOs but will be addressed in other PRs.

Regarding #1335 I will merge it to master once this one is merged, but it's based on this PR for a clearer diff. Note that #1335 is also ready for review.

jm-clius

This is a very useful tool! Thanks for thinking creatively about what we'd need here. I've added some comments below, mostly related to separating some of the logical components for future extensibility if this is to become a more general tool (such as reporting more capabilities of nodes once we have a capability discovery protocol). If suggestions are unclear, feel free to ping me to clarify. Also happy if some parts are left for future PRs.

tools/networkmonitor/networkmonitor.nim

tools/networkmonitor/networkmonitor_utils.nim

jm-clius · 2022-11-03T17:32:03Z

tools/networkmonitor/networkmonitor.nim

+    await sleepAsync(conf.refreshInterval * 1000 * 60)
+
+when isMainModule:
+  waitFor main()


I like that we keep apps simple, but perhaps to remain consistent with other apps the individual procedural blocks in main() can be extracted to separate procs. The main processing loop (under while true:) can then be named, scheduled with a timer and we'll just call runForever() after everything has been set up and started - i.e. at the end of the isMainModule: block? Hopefully this makes sense in terms of future extensibility?

fixed in 4de12f2

tools/networkmonitor/networkmonitor.nim

tools/networkmonitor/networkmonitor_utils.nim

alrevuelta · 2022-11-04T12:38:11Z

@jm-clius just fixed the comments! with an emphasis on having a clearer main, easier to read with high-level procedures on what's being done. start this, start that, craw this, blabla.

Also added the .push raises: that was missing, took me some time to fix everything :)

Note that i would prefer to not user timers for recurrent tasks, as looks like they are not meant for that.

LNSD

Please, check the comments I have written so far.

I have not finished reviewing the main() function. I need to jump on something else. I will continue reviewing it later.

tools/networkmonitor/networkmonitor_metrics.nim

LNSD · 2022-11-07T10:21:55Z

tools/networkmonitor/networkmonitor_metrics.nim

+proc installHandler*(router: var RestRouter, allPeers: CustomPeersTableRef) =
+  router.api(MethodGet, "/allpeersinfo") do () -> RestApiResponse:
+    let values = toSeq(allPeers.keys()).mapIt(allPeers[it])
+    return RestApiResponse.response($(%values), contentType="application/json")


The %s for JSON serialization is discouraged. In nwaku, following the nimbus example, we are using nim-json-serializarion. Check the waku/v2/rest module for examples.

sure, fixed in 4de12f2

LNSD · 2022-11-07T10:24:31Z

tools/networkmonitor/networkmonitor_metrics.nim

+proc startMetricsServer*(serverIp: ValidIpAddress, serverPort: Port) =
+    info "Starting metrics HTTP server", serverIp, serverPort
+
+    try:
+      startMetricsHttpServer($serverIp, serverPort)
+    except Exception as e:
+      raiseAssert("Exception while starting metrics HTTP server: " & e.msg)
+
+    info "Metrics HTTP server started", serverIp, serverPort


We are moving from exceptions to Results in the nwaku codebase. This should catch the exception and return a result.

sure, good catch, missed it in the refactoring.
i'm confused since with {.push raises: [].} this shouldn't compile, or? as its missing the {.raises: [xxx].}

LNSD · 2022-11-07T10:30:11Z

tools/networkmonitor/networkmonitor_metrics.nim

+# GET /allpeersinfo
+proc installHandler*(router: var RestRouter, allPeers: CustomPeersTableRef) =
+  router.api(MethodGet, "/allpeersinfo") do () -> RestApiResponse:
+    let values = toSeq(allPeers.keys()).mapIt(allPeers[it])


Please, check std/tables docs for proc values(): https://nim-lang.org/docs/tables.html#values.i%2CTableRef%5BA%2CB%5D

Suggested change

let values = toSeq(allPeers.keys()).mapIt(allPeers[it])

let values = toSeq(allPeers.values())

good catch thanks. 4de12f2

tools/networkmonitor/networkmonitor_utils.nim

LNSD · 2022-11-07T11:00:02Z

tools/networkmonitor/networkmonitor_config.nim

+  try:
+    result = chronos.seconds(parseInt(p))
+  except CatchableError as e:
+    raise newException(ConfigurationError, "Invalid timeout value")


This error message will be misleading if you add another "duration" config parameter. Make it more generic: "Invalid duration value"

fix 4de12f2

LNSD · 2022-11-07T11:38:00Z

tools/networkmonitor/networkmonitor.nim

+  # known issue: confutils.nim(775, 17) Error: can raise an unlisted exception: ref IOError
+  {.pop.}


This should be unnecessary if you mimic wakunode2 and the WakuConfig load method.

Mind elaborating? I was following wakunode2. Isn't it the same?

The comment is no longer necessary, given that we are wrapping the load proc with a try/catch and returning a result.

which "comment" do you mean? the "#known issue:" comment? or your comment?

tools/networkmonitor/networkmonitor.nim

alrevuelta · 2022-11-08T16:51:41Z

@LNSD Fixed all the comments. Is there anything else you want to address? No pressure if you need more time for review, just want to make sure its not blocked for no reason.

LNSD

Given the situation and from my POV, merge it if you think it is what you should do

LNSD · 2022-11-09T08:32:41Z

waku.nimble

+  buildBinary name, "tools/wakucanary/", "-d:chronicles_log_level=TRACE -d:chronicles_runtime_filtering:on"
+
+task networkmonitor, "Build network monitor tool":
+  let name = "networkmonitor"
+  buildBinary name, "tools/networkmonitor/", "-d:chronicles_log_level=TRACE -d:chronicles_runtime_filtering:on"


The -d:chronicles_runtime_filtering:on" should be specified in the nim.cfg next to the tool/app binary. Check the other binaries' nim.cfg for reference

sure, added it in 725cd9f

The thing I don't like is that you don't get in the compiling logs the flags you are using in nim.cfg, which imho can lead to problems. Increasing verbosity the max I managed to see was this. used this as reference.

Hint: used config file '/Users/alrevuelta/Github/nwaku/vendor/nimbus-build-system/vendor/Nim/config/nim.cfg' [Conf] Hint: used config file '/Users/alrevuelta/Github/nwaku/vendor/nimbus-build-system/vendor/Nim/config/config.nims' [Conf] Hint: used config file '/Users/alrevuelta/Github/nwaku/config.nims' [Conf] Hint: used config file '/Users/alrevuelta/Github/nwaku/tools/networkmonitor/nim.cfg' [Conf]

Unless I'm missing something chronicles_runtime_filtering is not present in the logs while being a flag, which imho is dangerous. Since I like being explicity, I prefered to have it directly in the makefile.

But note that I adapted the code to your suggestion following the repo "pattern".

jm-clius

LGTM. Let's get this merged, @alrevuelta. Also, thanks for robust discussions above, everyone. I think within the context of this tool, which has a specific research/monitoring aim and not a general production audience, we can always increment in terms of more app features such as SIGTERM handling, etc.

alrevuelta · 2022-11-10T09:29:19Z

since test2-ubuntu-latest is ok, bypassing branch protection due to a known issue with test2-macos-latest

LNSD reviewed Oct 24, 2022

View reviewed changes

tools/networkmonitor/networkmonitor.nim Outdated Show resolved Hide resolved

This was referenced Oct 27, 2022

Identify each node client with libp2p UserAgent waku-org/pm#3

Closed

Waku v2 monitoring node #1010

Closed

jakubgs requested changes Oct 27, 2022

View reviewed changes

alrevuelta force-pushed the network-monitoring-tool branch from 47c0e75 to 4a6f160 Compare November 1, 2022 16:24

alrevuelta marked this pull request as ready for review November 1, 2022 16:26

alrevuelta requested review from jm-clius and LNSD November 1, 2022 16:26

alrevuelta force-pushed the network-monitoring-tool branch from 4a6f160 to 3d4e426 Compare November 3, 2022 09:50

alrevuelta mentioned this pull request Nov 3, 2022

chore(networkmonitor): add metric listing content topics + messages #1335

Merged

3 tasks

jm-clius reviewed Nov 3, 2022

View reviewed changes

alrevuelta force-pushed the network-monitoring-tool branch 2 times, most recently from 0102b78 to dc6478b Compare November 4, 2022 12:33

alrevuelta requested a review from jm-clius November 4, 2022 12:41

LNSD reviewed Nov 7, 2022

View reviewed changes

alrevuelta force-pushed the network-monitoring-tool branch from dc6478b to 4de12f2 Compare November 7, 2022 15:48

alrevuelta requested a review from LNSD November 8, 2022 07:40

alrevuelta force-pushed the network-monitoring-tool branch from 4de12f2 to 27fcb37 Compare November 8, 2022 16:45

LNSD reviewed Nov 9, 2022

View reviewed changes

jm-clius approved these changes Nov 9, 2022

View reviewed changes

alrevuelta added 21 commits November 10, 2022 08:14

chore(networkmonitor): initial prototype

31f2965

chore(networkmonitor): add cli, metrics and PoC

ea38dd9

feat(utils): add supportsCapability function + tests

23d74e6

feat(utils): add supportedCapabilites function

42dc515

chore(networkmonitor): add metrics with enr/ip/capabilities

56c648c

chore(networkmonitor): refactor + tests

c8b4b26

chore(networkmonitor): add discovered timestamp

fa6cc41

chore(networkmonitor): add metrics on connected nodes

65c16ab

chore(networkmonitor): new flags + utils file + readme

2910181

chore(networkmonitor): add user-agent metrics

82ab698

chore(networkmonitor): connect only to randomly discovered peers

417fa2d

chore(networkmonitor): get location of peer using ip

3d8fa09

chore(networkmonitor): expose peer metrics with simple rest server

4d79864

chore(networkmonitor): update README

1b9fcfd

chore(networkmonitor): fix wakunode2 to waku_node

169716f

chore(networkmonitor): fix import order

c6c3cdf

chore(networkmonitor): fix comments + refactor + pushraises

8a6548e

chore(networkmonitor): refactor + handle exceptions

df4df80

chore(networkmonitor): fix makefile after rebase

e63632e

chore(networkmonitor): address review comments 1

f15a222

chore(networkmonitor): add nim.cfg

725cd9f

alrevuelta force-pushed the network-monitoring-tool branch from 27fcb37 to 725cd9f Compare November 10, 2022 07:16

alrevuelta merged commit 7917e05 into master Nov 10, 2022

alrevuelta deleted the network-monitoring-tool branch November 10, 2022 09:29

alrevuelta mentioned this pull request Nov 14, 2022

docs: release v0.13.0 #1378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(networkmonitor): tool to discover and provide metrics on peers #1290

chore(networkmonitor): tool to discover and provide metrics on peers #1290

alrevuelta commented Oct 21, 2022 •

edited

Loading

status-im-auto commented Oct 21, 2022 •

edited

Loading

jakubgs left a comment •

edited

Loading

alrevuelta commented Oct 28, 2022

jakubgs commented Oct 28, 2022

alrevuelta commented Nov 1, 2022

alrevuelta commented Nov 1, 2022

rymnc commented Nov 2, 2022

alrevuelta commented Nov 2, 2022

LNSD commented Nov 3, 2022

alrevuelta commented Nov 3, 2022

jm-clius left a comment

jm-clius Nov 3, 2022

alrevuelta Nov 7, 2022

alrevuelta commented Nov 4, 2022

LNSD left a comment

LNSD Nov 7, 2022

alrevuelta Nov 7, 2022

LNSD Nov 7, 2022

alrevuelta Nov 7, 2022

LNSD Nov 7, 2022

alrevuelta Nov 7, 2022

LNSD Nov 7, 2022

alrevuelta Nov 7, 2022

LNSD Nov 7, 2022

alrevuelta Nov 7, 2022

LNSD Nov 8, 2022

alrevuelta Nov 8, 2022

alrevuelta commented Nov 8, 2022

LNSD left a comment •

edited

Loading

LNSD Nov 9, 2022

alrevuelta Nov 10, 2022

jm-clius left a comment

alrevuelta commented Nov 10, 2022

	let values = toSeq(allPeers.keys()).mapIt(allPeers[it])
	let values = toSeq(allPeers.values())

		# known issue: confutils.nim(775, 17) Error: can raise an unlisted exception: ref IOError
		{.pop.}

chore(networkmonitor): tool to discover and provide metrics on peers #1290

chore(networkmonitor): tool to discover and provide metrics on peers #1290

Conversation

alrevuelta commented Oct 21, 2022 • edited Loading

status-im-auto commented Oct 21, 2022 • edited Loading

Jenkins Builds

jakubgs left a comment • edited Loading

Choose a reason for hiding this comment

alrevuelta commented Oct 28, 2022

Problem:

Solution

jakubgs commented Oct 28, 2022

alrevuelta commented Nov 1, 2022

alrevuelta commented Nov 1, 2022

rymnc commented Nov 2, 2022

alrevuelta commented Nov 2, 2022

LNSD commented Nov 3, 2022

alrevuelta commented Nov 3, 2022

jm-clius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrevuelta commented Nov 4, 2022

LNSD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrevuelta commented Nov 8, 2022

LNSD left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jm-clius left a comment

Choose a reason for hiding this comment

alrevuelta commented Nov 10, 2022

alrevuelta commented Oct 21, 2022 •

edited

Loading

status-im-auto commented Oct 21, 2022 •

edited

Loading

jakubgs left a comment •

edited

Loading

LNSD left a comment •

edited

Loading