Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(networkmonitor): tool to discover and provide metrics on peers #1290

Merged
merged 21 commits into from
Nov 10, 2022

Conversation

alrevuelta
Copy link
Contributor

@alrevuelta alrevuelta commented Oct 21, 2022

Related to #1010

Description:

  • Adds a tool that constantly tries to discover new peers in the network.
  • It also tries to connect to them.
  • See README.md for instructions and available metrics.
  • Goal is to run a long lived instante of this tool, with a grafana dashboard we can monitor.
  • Some metrics are exposed to prometheus to be fetched. Labels are constrained according do @jakubgs feedback.
  • Other metrics, related to individual peers information are exposed directly via a rest api, ready to be accessed by i.e. Grafana. Note that these metrics are not exposed to prometheus.

Limitations:

  • Naive implementation, with all data stored in memory. If the network scales to thousands of node the performance might be an issue.
  • Routing table is not emptied.

Usage:

$ make networkmonitor
$ ./build/networkmonitor --log-level=INFO --b="enr:-Nm4QOdTOKZJKTUUZ4O_W932CXIET-M9NamewDnL78P5u9DOGnZlK0JFZ4k0inkfe6iY-0JAaJVovZXc575VV3njeiABgmlkgnY0gmlwhAjS3ueKbXVsdGlhZGRyc7g6ADg2MW5vZGUtMDEuYWMtY24taG9uZ2tvbmctYy53YWt1djIucHJvZC5zdGF0dXNpbS5uZXQGH0DeA4lzZWNwMjU2azGhAo0C-VvfgHiXrxZi3umDiooXMGY9FvYj5_d1Q4EeS7eyg3RjcIJ2X4N1ZHCCIyiFd2FrdTIP"

And:

  • See prometheus metrics http://localhost:8008/metrics
  • See custom metrics http://localhost:8009/allpeersinfo

@status-im-auto
Copy link
Collaborator

status-im-auto commented Oct 21, 2022

Jenkins Builds

Click to see older builds (24)
Commit #️⃣ Finished (UTC) Duration Platform Result
ed5418d #1 2022-10-21 15:37:25 ~16 min linux 📄log
✔️ ed5418d #1 2022-10-21 15:44:58 ~24 min macos 📦bin
a0723ef #2 2022-10-25 18:12:32 ~15 min linux 📄log
✔️ a0723ef #2 2022-10-25 18:18:04 ~20 min macos 📦bin
✔️ 99d06c6 #3 2022-10-26 07:27:04 ~15 min linux 📦bin
✔️ 99d06c6 #3 2022-10-26 07:31:58 ~20 min macos 📦bin
c1ced43 #4 2022-10-26 15:56:22 ~18 min linux 📄log
✔️ c1ced43 #4 2022-10-26 15:59:43 ~22 min macos 📦bin
✔️ 3a1e5ed #5 2022-10-27 10:38:54 ~18 min linux 📦bin
✔️ 3a1e5ed #5 2022-10-27 10:42:26 ~21 min macos 📦bin
✔️ 47c0e75 #6 2022-11-01 16:31:39 ~16 min linux 📦bin
✔️ 47c0e75 #6 2022-11-01 16:36:21 ~20 min macos 📦bin
4a6f160 #7 2022-11-01 16:41:42 ~17 min linux 📄log
✔️ 4a6f160 #7 2022-11-01 16:47:57 ~23 min macos 📦bin
✔️ 3d4e426 #8 2022-11-03 10:09:33 ~18 min linux 📦bin
✔️ 3d4e426 #8 2022-11-03 10:14:52 ~24 min macos 📦bin
✔️ 08d936a #9 2022-11-03 10:17:14 ~18 min linux 📦bin
✔️ 08d936a #9 2022-11-03 10:20:05 ~20 min macos 📦bin
✔️ 0102b78 #10 2022-11-04 12:47:05 ~17 min linux 📦bin
✔️ 0102b78 #10 2022-11-04 12:53:25 ~23 min macos 📦bin
✔️ dc6478b #11 2022-11-04 12:50:54 ~17 min linux 📦bin
✔️ dc6478b #11 2022-11-04 12:54:31 ~20 min macos 📦bin
✔️ 4de12f2 #12 2022-11-07 16:05:24 ~16 min macos 📦bin
✔️ 4de12f2 #12 2022-11-07 16:09:17 ~20 min linux 📦bin
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 27fcb37 #13 2022-11-08 17:04:06 ~18 min linux 📦bin
✔️ 27fcb37 #13 2022-11-08 17:04:15 ~18 min macos 📦bin
✔️ 725cd9f #14 2022-11-10 07:32:40 ~15 min linux 📦bin
✔️ 725cd9f #14 2022-11-10 07:34:40 ~17 min macos 📦bin

Copy link
Contributor

@jakubgs jakubgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot put unconstrained values like ENRs in labels for Prometheus metrics. This will lead to an inevitable explision in cardinality which will kill the performance of any Prometheus instance or remote write backend:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

https://prometheus.io/docs/practices/naming/#labels

The only way around that is using some kind of limit set of buckets.

@alrevuelta
Copy link
Contributor Author

Expanding a bit on the issue brought up by @jakubgs.

Problem:

I want to have strings as metrics, something not trivial, but can be achieved with labels.

declarePublicGauge networkmonitor_example, "Description", labels = ["peerId", "info"]

We update it with a dummy 0.0 value, ok.

networkmonitor_example.set(0.0, labelValues = ["peerId1", "ip1,enr1,wakuflags1,lastconnection1"])
networkmonitor_example.set(0.0, labelValues = ["peerId2", "ip2,enr2,wakuflags2,lastconnection2"])

But if we modify one label lastconnection2 value, we will have another time series point. Something that can break prometheus at some point.

networkmonitor_example.set(0.0, labelValues = ["peerId1", "ip1,enr1,wakuflags1,lastconnection2"])

Proof of it, peerId1 has now two datapoints, and we only need one.

networkmonitor_example{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection1"} 0.0
networkmonitor_example_created{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection1"} 1666957272.0
networkmonitor_example{peerId="peerId2",info="ip2,enr2,wakuflags2,lastconnection2"} 0.0
networkmonitor_example_created{peerId="peerId2",info="ip2,enr2,wakuflags2,lastconnection2"} 1666957272.0

networkmonitor_example{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection2"} 0.0
networkmonitor_example_created{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection2"} 1666957272.0

Solution

As suggested by Jakub I tried using a limited set of buckets, but since I need strings I had to use labels having the same issue as the beginning. However I found an alternative that should fix the problem of having infinite time-series points.

Going into metrics tables, its possible to update an existing key if it was already present or create a new if not. With this we no longer have an infinite increasing number of time series points.

proc addOrUpdate(peerId: string, content: string, metricGauge: Gauge) = 
  for i in networkmonitor_example.metrics.keys:
    if i[0] == peerId:
      echo "already there, update it"
      metricGauge.metrics[@[i[0], i[1]]][0].labelValues[1] = content
      return
  echo "not there, add new field"
  metricGauge.set(0.0, labelValues = [peerId, content])

Long story short, it's possible to use have a key,value like set of metrics using labels with a constrained number of time series points.

addOrUpdate("peerId2", "ip1,enr1,wakuflags1,lastconnection2", networkmonitor_example)

@jakubgs :

  • Do you think this solves the issue you brought up? I have verified it by looking into localhost:8008/metrics, observing that no new points are added with new label combinations. That should be enough.
  • There will be ofc 1 entry per active peer in the network. Luckily this is around 10 and will take time to reach levels where prometheus can't handle it. I recall you mentioned few hundreds. If you think its safer I can add a threshold of x entries, and after that just error and drop values.
  • Note that this is temporal. If waku network grows to thousands of nodes, we will ofc have a proper DB.

@jakubgs
Copy link
Contributor

jakubgs commented Oct 28, 2022

No, this does not solve the issue as peer ids are unconstrained. Also not sure why you think you can constrain the number of labels created based on runtime data of a service that can stop and start. Every restart will make it forget what labels it has already created, in effect nullifying any attempt at constraining the number of values.

The only solutions are:

  • Not use any unconstrained values for labels like peer IDs at all.
  • Use a specific constrained set of buckets for values of labels that cannot grow.
  • Simply not use Prometheus as backend for data it was not intended for.

@alrevuelta
Copy link
Contributor Author

Thanks for the input @jakubgs. Perhaps Prometheus is overkill for this use case, so would go with "Simply not use Prometheus as backend for data it was not intended for." I have found "Grafana Infinity Datasource" plugin that allows me to get all the info I want to display via http get. So I will connect grafana directly with this tool (networkmonitor) to display this data without prometheus in between. Anything against that?

@alrevuelta alrevuelta force-pushed the network-monitoring-tool branch from 47c0e75 to 4a6f160 Compare November 1, 2022 16:24
@alrevuelta alrevuelta marked this pull request as ready for review November 1, 2022 16:26
@alrevuelta alrevuelta requested review from jm-clius and LNSD November 1, 2022 16:26
@alrevuelta
Copy link
Contributor Author

Since it didn't make sense to expose the "metrics" of the peers through Prometheus, I ended up setting up a custom rest api, that can be used to fetch the peers information. The idea is for a frontend like Grafana to use this information to display it. Note that on top of this other Prometheus metrics exists, but these are mere gauges. See README.

image

All input is appreciated!

@rymnc
Copy link
Contributor

rymnc commented Nov 2, 2022

Do we really need the location information?

@alrevuelta
Copy link
Contributor Author

@rymnc

Do we really need the location information?

Anyone can see it, so why not? I think it's interesting to see how decentralized the network is in terms of location.

@LNSD
Copy link
Contributor

LNSD commented Nov 3, 2022

Is this PR ready to be reviewed? If not, can you put it into Draft mode? 👀

I see many TODOs scattered around the code, and also, there's a PR to merge into this (#1335)

@alrevuelta
Copy link
Contributor Author

@LNSD Yes, it is ready for review, that's why I requested it to be reviewed. There are some TODOs but will be addressed in other PRs.

Regarding #1335 I will merge it to master once this one is merged, but it's based on this PR for a clearer diff. Note that #1335 is also ready for review.

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very useful tool! Thanks for thinking creatively about what we'd need here. I've added some comments below, mostly related to separating some of the logical components for future extensibility if this is to become a more general tool (such as reporting more capabilities of nodes once we have a capability discovery protocol). If suggestions are unclear, feel free to ping me to clarify. Also happy if some parts are left for future PRs.

tools/networkmonitor/networkmonitor.nim Outdated Show resolved Hide resolved
tools/networkmonitor/networkmonitor_utils.nim Outdated Show resolved Hide resolved
await sleepAsync(conf.refreshInterval * 1000 * 60)

when isMainModule:
waitFor main()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we keep apps simple, but perhaps to remain consistent with other apps the individual procedural blocks in main() can be extracted to separate procs. The main processing loop (under while true:) can then be named, scheduled with a timer and we'll just call runForever() after everything has been set up and started - i.e. at the end of the isMainModule: block? Hopefully this makes sense in terms of future extensibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4de12f2

tools/networkmonitor/networkmonitor.nim Show resolved Hide resolved
tools/networkmonitor/networkmonitor_utils.nim Show resolved Hide resolved
@alrevuelta alrevuelta force-pushed the network-monitoring-tool branch 2 times, most recently from 0102b78 to dc6478b Compare November 4, 2022 12:33
@alrevuelta
Copy link
Contributor Author

@jm-clius just fixed the comments! with an emphasis on having a clearer main, easier to read with high-level procedures on what's being done. start this, start that, craw this, blabla.

Also added the .push raises: that was missing, took me some time to fix everything :)

Note that i would prefer to not user timers for recurrent tasks, as looks like they are not meant for that.

@alrevuelta alrevuelta requested a review from jm-clius November 4, 2022 12:41
Copy link
Contributor

@LNSD LNSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check the comments I have written so far.

I have not finished reviewing the main() function. I need to jump on something else. I will continue reviewing it later.

tools/networkmonitor/networkmonitor_metrics.nim Outdated Show resolved Hide resolved
proc installHandler*(router: var RestRouter, allPeers: CustomPeersTableRef) =
router.api(MethodGet, "/allpeersinfo") do () -> RestApiResponse:
let values = toSeq(allPeers.keys()).mapIt(allPeers[it])
return RestApiResponse.response($(%values), contentType="application/json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %s for JSON serialization is discouraged. In nwaku, following the nimbus example, we are using nim-json-serializarion. Check the waku/v2/rest module for examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, fixed in 4de12f2

Comment on lines 67 to 75
proc startMetricsServer*(serverIp: ValidIpAddress, serverPort: Port) =
info "Starting metrics HTTP server", serverIp, serverPort

try:
startMetricsHttpServer($serverIp, serverPort)
except Exception as e:
raiseAssert("Exception while starting metrics HTTP server: " & e.msg)

info "Metrics HTTP server started", serverIp, serverPort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving from exceptions to Results in the nwaku codebase. This should catch the exception and return a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, good catch, missed it in the refactoring.
i'm confused since with {.push raises: [].} this shouldn't compile, or? as its missing the {.raises: [xxx].}

# GET /allpeersinfo
proc installHandler*(router: var RestRouter, allPeers: CustomPeersTableRef) =
router.api(MethodGet, "/allpeersinfo") do () -> RestApiResponse:
let values = toSeq(allPeers.keys()).mapIt(allPeers[it])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check std/tables docs for proc values(): https://nim-lang.org/docs/tables.html#values.i%2CTableRef%5BA%2CB%5D

Suggested change
let values = toSeq(allPeers.keys()).mapIt(allPeers[it])
let values = toSeq(allPeers.values())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch thanks. 4de12f2

tools/networkmonitor/networkmonitor_utils.nim Show resolved Hide resolved
try:
result = chronos.seconds(parseInt(p))
except CatchableError as e:
raise newException(ConfigurationError, "Invalid timeout value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message will be misleading if you add another "duration" config parameter. Make it more generic: "Invalid duration value"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix 4de12f2

Comment on lines +220 to +221
# known issue: confutils.nim(775, 17) Error: can raise an unlisted exception: ref IOError
{.pop.}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unnecessary if you mimic wakunode2 and the WakuConfig load method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind elaborating? I was following wakunode2. Isn't it the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is no longer necessary, given that we are wrapping the load proc with a try/catch and returning a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which "comment" do you mean? the "#known issue:" comment? or your comment?

tools/networkmonitor/networkmonitor.nim Show resolved Hide resolved
tools/networkmonitor/networkmonitor.nim Show resolved Hide resolved
tools/networkmonitor/networkmonitor.nim Show resolved Hide resolved
@alrevuelta alrevuelta force-pushed the network-monitoring-tool branch from dc6478b to 4de12f2 Compare November 7, 2022 15:48
@alrevuelta alrevuelta requested a review from LNSD November 8, 2022 07:40
@alrevuelta alrevuelta force-pushed the network-monitoring-tool branch from 4de12f2 to 27fcb37 Compare November 8, 2022 16:45
@alrevuelta
Copy link
Contributor Author

@LNSD Fixed all the comments. Is there anything else you want to address? No pressure if you need more time for review, just want to make sure its not blocked for no reason.

Copy link
Contributor

@LNSD LNSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the situation and from my POV, merge it if you think it is what you should do

waku.nimble Outdated
Comment on lines 110 to 114
buildBinary name, "tools/wakucanary/", "-d:chronicles_log_level=TRACE -d:chronicles_runtime_filtering:on"

task networkmonitor, "Build network monitor tool":
let name = "networkmonitor"
buildBinary name, "tools/networkmonitor/", "-d:chronicles_log_level=TRACE -d:chronicles_runtime_filtering:on"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -d:chronicles_runtime_filtering:on" should be specified in the nim.cfg next to the tool/app binary. Check the other binaries' nim.cfg for reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added it in 725cd9f

The thing I don't like is that you don't get in the compiling logs the flags you are using in nim.cfg, which imho can lead to problems. Increasing verbosity the max I managed to see was this. used this as reference.

Hint: used config file '/Users/alrevuelta/Github/nwaku/vendor/nimbus-build-system/vendor/Nim/config/nim.cfg' [Conf]
Hint: used config file '/Users/alrevuelta/Github/nwaku/vendor/nimbus-build-system/vendor/Nim/config/config.nims' [Conf]
Hint: used config file '/Users/alrevuelta/Github/nwaku/config.nims' [Conf]
Hint: used config file '/Users/alrevuelta/Github/nwaku/tools/networkmonitor/nim.cfg' [Conf]

Unless I'm missing something chronicles_runtime_filtering is not present in the logs while being a flag, which imho is dangerous. Since I like being explicity, I prefered to have it directly in the makefile.

But note that I adapted the code to your suggestion following the repo "pattern".

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's get this merged, @alrevuelta. Also, thanks for robust discussions above, everyone. I think within the context of this tool, which has a specific research/monitoring aim and not a general production audience, we can always increment in terms of more app features such as SIGTERM handling, etc.

@alrevuelta alrevuelta force-pushed the network-monitoring-tool branch from 27fcb37 to 725cd9f Compare November 10, 2022 07:16
@alrevuelta
Copy link
Contributor Author

since test2-ubuntu-latest is ok, bypassing branch protection due to a known issue with test2-macos-latest

@alrevuelta alrevuelta merged commit 7917e05 into master Nov 10, 2022
@alrevuelta alrevuelta deleted the network-monitoring-tool branch November 10, 2022 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants