Exposing Metrics #1356

marten-seemann · 2022-03-11T14:22:26Z

libp2p currently collects a few metrics (for example, for TCP and QUIC). We also collect metrics for the resource manager, however, that is currently left as a responsibility to the application. This is how we end up with slightly different implementations (and metrics names) in go-ipfs and lotus.
In the (very near!) future, we want to expose more metrics from the swarm regarding transports, security protocols and muxers. This would have allowed us to detect the muxer prioritisation bug (ipfs/kubo#8750) a lot earlier.

We therefore should have a coherent libp2p metrics story.

Open questions:

In the past, we debated if the metrics code should live within the respective repository, or if that code should live separately and hook in via tracers. Note that this issue mostly goes away once we make progress with our repo consolidation (and that's what we should base our designs on).
Some people have expressed concerns introducing an OpenCensus / Prometheus dependency into the default build introduces too much bloat. Is that a valid concern? Does it really add too much overhead? If so, would it make sense to hide all metrics collection across our code base behind a build flag?
OpenCensus vs. Prometheus: In the past, we've used both tools, very inconsistently. If I understand correctly, OpenCensus is essentially another abstraction layer between the tracer and Prometheus, and would in theory allow the usage of other tools than Prometheus, although it might be debatable how much we care about that.

Thoughts, @vyzo @aschmahmann @mxinden @Stebalien @lanzafame?

Tracking the various components we want to instrument:

Supporting work (need to close this issue):

defer:

mxinden · 2022-03-14T18:17:50Z

3. OpenCensus vs. Prometheus: In the past, we've used both tools, very inconsistently. If I understand correctly, OpenCensus is essentially another abstraction layer between the tracer and Prometheus, and would in theory allow the usage of other tools than Prometheus, although it might be debatable how much we care about that.

Slight preference for Prometheus for the sake of simplicity. I have not had the need of tracing (enabled e.g. through OpenCensus) in a peer-to-peer application.

Stebalien · 2022-03-14T18:59:13Z

The latest is "opentelemetry" (a merger of opencensus and opentracing". I'd prefer that because, as you say, it's backend agnostic and is just a facade unless used.

In go-ipfs, we used to rely heavily on https://github.com/ipfs/go-metrics-interface/ to avoid depending on a specific metrics implementation, but opentelemetry should make that point moot.

marten-seemann · 2022-03-15T05:07:49Z

The latest is "opentelemetry" (a merger of opencensus and opentracing". I'd prefer that because, as you say, it's backend agnostic and is just a facade unless used.

Are we already using that somewhere in our stack?

marten-seemann · 2022-03-15T11:06:12Z

This doesn't look very reassuring at this point: https://opentelemetry.io/docs/instrumentation/go/manual/#creating-metrics

vyzo · 2022-03-15T13:33:15Z

I very much dislike the idea of going with the flavour of the day; applications should be free to use whatever they want, and we already have two different systems in prod (lotus used opencensus, ipfs uses raw prometheus).

Can't we just define a very thin interface and use that?

aschmahmann · 2022-03-15T14:48:52Z

Are we already using that somewhere in our stack?

I haven't personally made much use of it, but it shows up in some consumers of go-libp2p. A few I'm aware of:

Lotus (a consumer of go-libp2p) Initial switch to OpenTelemetry filecoin-project/lotus#7725
go-graphsync OpenTelemetry Tracing Request Manager ipfs/go-graphsync#271
A couple places in go-libp2p-kad-dht's FullRT Client integrate some basic opentelemetry tracing go-libp2p-kad-dht#734

Can't we just define a very thin interface and use that?

I don't have a strong preference here, but it seems like the options are:

Don't have exported metrics - not a viable option
Have hard coded metrics system(s) in every package - not great given people like multiple different backends
Use some popular logging interface (e.g. OpenTelemetry) - nice, but some people think is overkill also the fads and interfaces keep changing over time (I think the explanation of Initial switch to OpenTelemetry filecoin-project/lotus#7725 gives some view into the pain here)
Roll our own interface for metrics that's totally better than everyone else's stupid interface (e.g. https://github.com/ipfs/go-metrics-interface/) - ok, but also https://xkcd.com/927/

If people are interested in 4, but not in using go-metrics-interface I'd like to understand why and how that same argument is unlikely to be reused in 6-12 months to argue why our new interface isn't good enough and needs an alternative.

I happen to be inclined to use some standard interface so we don't have to make all of our own wrappers + interfaces here and we can let other people mess around with creating "the perfect logging interface™️" while we work on p2p networking 😄. If people want to roll our own here then that's fine by me too.

marten-seemann · 2022-03-16T09:03:15Z

I haven't personally made much use of it, but it shows up in some consumers of go-libp2p. A few I'm aware of:

Lotus (a consumer of go-libp2p) Initial switch to OpenTelemetry filecoin-project/lotus#7725

go-graphsync OpenTelemetry Tracing Request Manager ipfs/go-graphsync#271

A couple places in go-libp2p-kad-dht's FullRT Client integrate some basic opentelemetry tracing go-libp2p-kad-dht#734

Thank you for the links! These are all tracer integrations though, not metrics. I'd be happy to use OpenTelemetry for metrics as well, and it looks like their metrics support is currently in the alpha phase.

I'm a bit wary of rolling our own interface, mostly for https://xkcd.com/927/ reasons, as @aschmahmann pointed out.

guseggert · 2022-03-16T15:59:41Z

Unfortunately there isn't a good answer to OpenTelemetry vs. OpenCensus. OpenCensus is on a deprecation path, but the Go implementation of the OpenTelemetry Metrics API is not yet stable, so pick your poison.

I don't think creating our own wrapper would be a good idea. OpenTelemetry is already a wrapper and enables writing bridges to unknown future implementations. And those are likely to be written by the experts instead of us. There is an OpenCensus -> OpenTelemetry metrics bridge, but it has some limitations around the metric types that are supported due to complexities in translating. If the experts can't write a complete bridge, then I don't think we should burden ourselves with that too.

Note that the the Metrics API Spec for OpenTelemetry is marked as "stable". It is the Go implementation that is not yet stable. Here is the milestone for the stable release of the Metrics API for opentelemetry-go: https://github.com/open-telemetry/opentelemetry-go/milestone/14 (it is scheduled for March 31). Also the metrics Collector API and wire format (OTLP) are stable.

It's pretty clear that industry resources are now devoted to OpenTelemetry, and a stable Go Metrics API is imminent, so I think we should hop on that bandwagon.

BigLep · 2022-03-16T17:41:57Z

Good discussion here - thanks folks for contributing.

One thing that will be useful to get defined is the metrics we're going to measure. Things that I think are important:

names - ideally the names (or other dimensions of the metric) make clear whether this is a metric that is consistently implemented across libp2p implementations or whether it's an implementation-specific metrics (e.g., metric implemented uniquely for go-libp2p). In addition, the names should be as self-documenting as possible.
definitions - be clear about what the metrics are measuring. Especially for any timing metrics, we need to be clear about when the stopwatch starts and when it ends. I don't want to leave that up for interpretation.

In terms of where this kind of info lives, I leave it up to others but maybe this warrants a spec or a docs page?

Good stuff!

vyzo · 2022-03-17T06:56:29Z

I would like to propose something different.
How about a monitoring interface, that can also be used for collecting metrics?
That way the users can do whatever they like, and we can provide implementations for prometheus, open* and whatnot.

lanzafame · 2022-03-24T02:15:10Z

As the person who introduced OpenCensus, some background thoughts that I had at the time.

I didn't overly care about tracing at the time but appreciated the fact that tracing was bundled in.
I was mostly concerned about having something that was backend agnostic because I was working on ipfs-cluster at the time and didn't want to have to force our uses to choose prometheus over influx for example (granted things have come along way since then and everyone supports prometheus scraping even in push-model systems)
I really appreciated the concept of each library exposing the metrics that they cared about but the top level application was able to choose which it wanted to actually record. This is why I pushed hard on the Views being exposed as well as there being a DefaultViews slice being exposed so that if an application dev only wanted 1 of 10 metrics in DefaultViews they had the ability to just register that 1 metric.

Personally, I really don't like where OpenTelemetry took metrics, last I looked, OTel did away with a lot of the nice things about OC metrics. Personally, from what I have followed of OTel, they didn't really care about the metrics side of things and still don't. My guess was ego and NIH syndrome are involved. On top of that the spec is very java-esk and that has resulted in less than ideal situations in the go implementation.

Currently, still in the position of consuming metrics from libp2p applications and libraries, the things I care about are consistent prefixes, i.e. libp2p_<subsytem>_<metric>, and not having to put PRs up to remove peer ids from metric labels because it kills prometheus performance and costs money when using cloud prometheus offerings.

How about a monitoring interface, that can also be used for collecting metrics?
That way the users can do whatever they like, and we can provide implementations for prometheus, open* and whatnot.

@vyzo Please don't. It just means another package needs to be written and then the ability to discover that package will be reduced. i.e. Estuary just implemented the metrics for the libp2p resource manager, but they are now specific to the estuary codebase and anyone else wanting to have resource manager metrics has to duplicate the exact same code. There is zero user benefit to that course of action.

vyzo · 2022-03-24T05:19:59Z

@lanzafame i didnt suggest that the application implements the interface, the library would provide the default implementation.

I just want to avoid polluting the code with metrics invocations, which would be very ugly and tied to the particular choice.

Our code should be agnostic of such things.

marten-seemann · 2022-04-19T12:46:15Z

OpenTelemetry Metrics support has shipped: https://opentelemetry.uptrace.dev/guide/go-metrics.html

lanzafame · 2022-05-09T00:26:31Z

@marten-seemann it still hasn't officially shipped, metrics are still listed as Alpha: https://github.com/open-telemetry/opentelemetry-go.

achingbrain · 2022-06-06T16:04:36Z

FWIW js-libp2p has a metrics capability similar to what's being discussed here. It collects various named metrics as simple key/value pairs that are updated over time but not retained. There's no dependency on Prometheus or any other monitoring system which keeps it lightweight and flexible.

When there's a libp2p app that runs as a daemon, it makes the metrics available in some way. For example js-ipfs has a HTTP API that sets up an endpoint that Prometheus calls periodically to record the libp2p metrics, plus additional metrics about the environment - memory usage, CPU etc.

An app could write these stats out to a file instead and skip Prometheus integration entirely, whatever works for the app author.

marten-seemann · 2022-12-15T08:57:03Z

Adding another consideration to the list: performance. As it turns out, OpenCensus generates tons of garbage: #1955.

vyzo · 2022-12-15T09:10:07Z

once more i want to say i told ya so...

vyzo · 2022-12-15T09:11:04Z

Master Fu you should listen to sometimes...

marten-seemann · 2022-12-15T10:45:58Z

once more i want to say i told ya so...

@vyzo This is not really helpful, even if you did say so.

Which in this case you didn't.

You suggested to provide a default implementation, without specifying which library to use. Chances are we would have used OpenCensus for that.

Please don't take this the wrong way. I'm still happy about constructive comments that help us to come to a decision here.

marten-seemann · 2023-05-11T11:33:07Z

To finish this project, we should have a libp2p folder in the PL Grafana instance. That would allow us to pull up the libp2p metrics for every PL-run libp2p node.

In addition, @sukunrt suggested to make metrics of one libp2p node (probably on the IPFS network) available publicly, possibly linked from libp2p.io. @dhuseby, wdyt?

p-shahi · 2023-05-11T13:46:41Z

That sounds like a good idea, there should be an accompanying blog post as well.

marten-seemann added kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up exp/intermediate Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week labels Mar 11, 2022

marten-seemann mentioned this issue Mar 15, 2022

feat: export rcmgr metrics to prometheus ipfs/kubo#8785

Merged

BigLep mentioned this issue Jun 3, 2022

guide for how to set limits for an application libp2p/go-libp2p-resource-manager#28

Closed

marten-seemann mentioned this issue Jun 14, 2022

feat: Usable out of the box metrics libp2p/go-libp2p-resource-manager#50

Closed

lidel mentioned this issue Jun 20, 2022

Feature: Create a metrics view ipfs/ipfs-webui#1220

Open

p-shahi added this to go-libp2p Aug 26, 2022

BigLep moved this to 🥞 Todo in go-libp2p Aug 26, 2022

p-shahi mentioned this issue Aug 26, 2022

Guide for libp2p node monitoring libp2p/docs#158

Closed

This was referenced Nov 20, 2022

expose the security protocol on the ConnectionState #1907

Merged

swarm: minimal set of metrics #1910

Closed

marten-seemann mentioned this issue Dec 14, 2022

rcmgr: Use prometheus for metrics (OpenCensus generates garbage) #1955

Closed

This was referenced Jan 27, 2023

Libp2p introspection #775

Closed

Metrics instrumentation #522

Closed

This was referenced Feb 4, 2023

go-libp2p v0.25 #1916

Closed

metrics: make it easier to update / discover Grafana dashboard #2055

Closed

sukunrt mentioned this issue Feb 25, 2023

Next Step Microgrant: go-libp2p impovements for metrics, dialing logic and AutoNat v2 filecoin-project/devgrants#1431

Closed

p-shahi assigned sukunrt Feb 27, 2023

p-shahi mentioned this issue May 11, 2023

metrics in go-libp2p libp2p/blog#69

Closed

marten-seemann mentioned this issue Jun 14, 2023

dashboards: finish metrics effort #2362

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing Metrics #1356

Exposing Metrics #1356

marten-seemann commented Mar 11, 2022 •

edited by p-shahi

Loading

mxinden commented Mar 14, 2022

Stebalien commented Mar 14, 2022

marten-seemann commented Mar 15, 2022

marten-seemann commented Mar 15, 2022

vyzo commented Mar 15, 2022 •

edited

Loading

aschmahmann commented Mar 15, 2022 •

edited

Loading

marten-seemann commented Mar 16, 2022

guseggert commented Mar 16, 2022 •

edited

Loading

BigLep commented Mar 16, 2022

vyzo commented Mar 17, 2022

lanzafame commented Mar 24, 2022

vyzo commented Mar 24, 2022

marten-seemann commented Apr 19, 2022

lanzafame commented May 9, 2022

achingbrain commented Jun 6, 2022

marten-seemann commented Dec 15, 2022

vyzo commented Dec 15, 2022

vyzo commented Dec 15, 2022

marten-seemann commented Dec 15, 2022

marten-seemann commented May 11, 2023

p-shahi commented May 11, 2023

Exposing Metrics #1356

Exposing Metrics #1356

Comments

marten-seemann commented Mar 11, 2022 • edited by p-shahi Loading

mxinden commented Mar 14, 2022

Stebalien commented Mar 14, 2022

marten-seemann commented Mar 15, 2022

marten-seemann commented Mar 15, 2022

vyzo commented Mar 15, 2022 • edited Loading

aschmahmann commented Mar 15, 2022 • edited Loading

marten-seemann commented Mar 16, 2022

guseggert commented Mar 16, 2022 • edited Loading

BigLep commented Mar 16, 2022

vyzo commented Mar 17, 2022

lanzafame commented Mar 24, 2022

vyzo commented Mar 24, 2022

marten-seemann commented Apr 19, 2022

lanzafame commented May 9, 2022

achingbrain commented Jun 6, 2022

marten-seemann commented Dec 15, 2022

vyzo commented Dec 15, 2022

vyzo commented Dec 15, 2022

marten-seemann commented Dec 15, 2022

marten-seemann commented May 11, 2023

p-shahi commented May 11, 2023

marten-seemann commented Mar 11, 2022 •

edited by p-shahi

Loading

vyzo commented Mar 15, 2022 •

edited

Loading

aschmahmann commented Mar 15, 2022 •

edited

Loading

guseggert commented Mar 16, 2022 •

edited

Loading