-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(networkmonitor): tool to discover and provide metrics on peers #1290
Conversation
Jenkins BuildsClick to see older builds (24)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot put unconstrained values like ENRs in labels for Prometheus metrics. This will lead to an inevitable explision in cardinality which will kill the performance of any Prometheus instance or remote write backend:
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
https://prometheus.io/docs/practices/naming/#labels
The only way around that is using some kind of limit set of buckets.
Expanding a bit on the issue brought up by @jakubgs. Problem:I want to have strings as metrics, something not trivial, but can be achieved with declarePublicGauge networkmonitor_example, "Description", labels = ["peerId", "info"] We update it with a dummy networkmonitor_example.set(0.0, labelValues = ["peerId1", "ip1,enr1,wakuflags1,lastconnection1"])
networkmonitor_example.set(0.0, labelValues = ["peerId2", "ip2,enr2,wakuflags2,lastconnection2"]) But if we modify one label networkmonitor_example.set(0.0, labelValues = ["peerId1", "ip1,enr1,wakuflags1,lastconnection2"]) Proof of it, networkmonitor_example{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection1"} 0.0
networkmonitor_example_created{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection1"} 1666957272.0
networkmonitor_example{peerId="peerId2",info="ip2,enr2,wakuflags2,lastconnection2"} 0.0
networkmonitor_example_created{peerId="peerId2",info="ip2,enr2,wakuflags2,lastconnection2"} 1666957272.0
networkmonitor_example{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection2"} 0.0
networkmonitor_example_created{peerId="peerId1",info="ip1,enr1,wakuflags1,lastconnection2"} 1666957272.0 SolutionAs suggested by Jakub I tried using a limited set of buckets, but since I need strings I had to use Going into proc addOrUpdate(peerId: string, content: string, metricGauge: Gauge) =
for i in networkmonitor_example.metrics.keys:
if i[0] == peerId:
echo "already there, update it"
metricGauge.metrics[@[i[0], i[1]]][0].labelValues[1] = content
return
echo "not there, add new field"
metricGauge.set(0.0, labelValues = [peerId, content]) Long story short, it's possible to use have a addOrUpdate("peerId2", "ip1,enr1,wakuflags1,lastconnection2", networkmonitor_example) @jakubgs :
|
No, this does not solve the issue as peer ids are unconstrained. Also not sure why you think you can constrain the number of labels created based on runtime data of a service that can stop and start. Every restart will make it forget what labels it has already created, in effect nullifying any attempt at constraining the number of values. The only solutions are:
|
Thanks for the input @jakubgs. Perhaps Prometheus is overkill for this use case, so would go with "Simply not use Prometheus as backend for data it was not intended for." I have found "Grafana Infinity Datasource" plugin that allows me to get all the info I want to display via http get. So I will connect grafana directly with this tool (networkmonitor) to display this data without prometheus in between. Anything against that? |
47c0e75
to
4a6f160
Compare
Since it didn't make sense to expose the "metrics" of the peers through Prometheus, I ended up setting up a custom rest api, that can be used to fetch the peers information. The idea is for a frontend like Grafana to use this information to display it. Note that on top of this other Prometheus metrics exists, but these are mere gauges. See README. All input is appreciated! |
Do we really need the location information? |
Anyone can see it, so why not? I think it's interesting to see how decentralized the network is in terms of location. |
4a6f160
to
3d4e426
Compare
Is this PR ready to be reviewed? If not, can you put it into Draft mode? 👀 I see many TODOs scattered around the code, and also, there's a PR to merge into this (#1335) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very useful tool! Thanks for thinking creatively about what we'd need here. I've added some comments below, mostly related to separating some of the logical components for future extensibility if this is to become a more general tool (such as reporting more capabilities of nodes once we have a capability discovery protocol). If suggestions are unclear, feel free to ping me to clarify. Also happy if some parts are left for future PRs.
await sleepAsync(conf.refreshInterval * 1000 * 60) | ||
|
||
when isMainModule: | ||
waitFor main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we keep apps simple, but perhaps to remain consistent with other apps the individual procedural blocks in main()
can be extracted to separate proc
s. The main processing loop (under while true:
) can then be named, scheduled with a timer and we'll just call runForever()
after everything has been set up and started - i.e. at the end of the isMainModule:
block? Hopefully this makes sense in terms of future extensibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 4de12f2
0102b78
to
dc6478b
Compare
@jm-clius just fixed the comments! with an emphasis on having a clearer Also added the Note that i would prefer to not user timers for recurrent tasks, as looks like they are not meant for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check the comments I have written so far.
I have not finished reviewing the main()
function. I need to jump on something else. I will continue reviewing it later.
proc installHandler*(router: var RestRouter, allPeers: CustomPeersTableRef) = | ||
router.api(MethodGet, "/allpeersinfo") do () -> RestApiResponse: | ||
let values = toSeq(allPeers.keys()).mapIt(allPeers[it]) | ||
return RestApiResponse.response($(%values), contentType="application/json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The %s
for JSON serialization is discouraged. In nwaku, following the nimbus example, we are using nim-json-serializarion
. Check the waku/v2/rest
module for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, fixed in 4de12f2
proc startMetricsServer*(serverIp: ValidIpAddress, serverPort: Port) = | ||
info "Starting metrics HTTP server", serverIp, serverPort | ||
|
||
try: | ||
startMetricsHttpServer($serverIp, serverPort) | ||
except Exception as e: | ||
raiseAssert("Exception while starting metrics HTTP server: " & e.msg) | ||
|
||
info "Metrics HTTP server started", serverIp, serverPort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving from exceptions to Results
in the nwaku codebase. This should catch the exception and return a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, good catch, missed it in the refactoring.
i'm confused since with {.push raises: [].}
this shouldn't compile, or? as its missing the {.raises: [xxx].}
# GET /allpeersinfo | ||
proc installHandler*(router: var RestRouter, allPeers: CustomPeersTableRef) = | ||
router.api(MethodGet, "/allpeersinfo") do () -> RestApiResponse: | ||
let values = toSeq(allPeers.keys()).mapIt(allPeers[it]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check std/tables
docs for proc values()
: https://nim-lang.org/docs/tables.html#values.i%2CTableRef%5BA%2CB%5D
let values = toSeq(allPeers.keys()).mapIt(allPeers[it]) | |
let values = toSeq(allPeers.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch thanks. 4de12f2
try: | ||
result = chronos.seconds(parseInt(p)) | ||
except CatchableError as e: | ||
raise newException(ConfigurationError, "Invalid timeout value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message will be misleading if you add another "duration" config parameter. Make it more generic: "Invalid duration value"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix 4de12f2
# known issue: confutils.nim(775, 17) Error: can raise an unlisted exception: ref IOError | ||
{.pop.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unnecessary if you mimic wakunode2
and the WakuConfig
load method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind elaborating? I was following wakunode2. Isn't it the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is no longer necessary, given that we are wrapping the load proc with a try/catch and returning a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which "comment" do you mean? the "#known issue:" comment? or your comment?
dc6478b
to
4de12f2
Compare
4de12f2
to
27fcb37
Compare
@LNSD Fixed all the comments. Is there anything else you want to address? No pressure if you need more time for review, just want to make sure its not blocked for no reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the situation and from my POV, merge it if you think it is what you should do
waku.nimble
Outdated
buildBinary name, "tools/wakucanary/", "-d:chronicles_log_level=TRACE -d:chronicles_runtime_filtering:on" | ||
|
||
task networkmonitor, "Build network monitor tool": | ||
let name = "networkmonitor" | ||
buildBinary name, "tools/networkmonitor/", "-d:chronicles_log_level=TRACE -d:chronicles_runtime_filtering:on" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -d:chronicles_runtime_filtering:on"
should be specified in the nim.cfg
next to the tool/app binary. Check the other binaries' nim.cfg
for reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, added it in 725cd9f
The thing I don't like is that you don't get in the compiling logs the flags you are using in nim.cfg
, which imho can lead to problems. Increasing verbosity
the max I managed to see was this. used this as reference.
Hint: used config file '/Users/alrevuelta/Github/nwaku/vendor/nimbus-build-system/vendor/Nim/config/nim.cfg' [Conf]
Hint: used config file '/Users/alrevuelta/Github/nwaku/vendor/nimbus-build-system/vendor/Nim/config/config.nims' [Conf]
Hint: used config file '/Users/alrevuelta/Github/nwaku/config.nims' [Conf]
Hint: used config file '/Users/alrevuelta/Github/nwaku/tools/networkmonitor/nim.cfg' [Conf]
Unless I'm missing something chronicles_runtime_filtering
is not present in the logs while being a flag, which imho is dangerous. Since I like being explicity, I prefered to have it directly in the makefile.
But note that I adapted the code to your suggestion following the repo "pattern".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's get this merged, @alrevuelta. Also, thanks for robust discussions above, everyone. I think within the context of this tool, which has a specific research/monitoring aim and not a general production audience, we can always increment in terms of more app features such as SIGTERM handling, etc.
27fcb37
to
725cd9f
Compare
since |
Related to #1010
Description:
README.md
for instructions and available metrics.Limitations:
Usage:
And:
http://localhost:8008/metrics
http://localhost:8009/allpeersinfo