Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Metrics Cleanup and Expansion #2169

Closed
2 tasks
jackzampolin opened this issue Aug 28, 2018 · 32 comments
Closed
2 tasks

Prometheus Metrics Cleanup and Expansion #2169

jackzampolin opened this issue Aug 28, 2018 · 32 comments

Comments

@jackzampolin
Copy link
Member

jackzampolin commented Aug 28, 2018

Currently (excluding the go_*, promhttp_*, and process_*) we have the following prometheus metrics exported:

Metric type
consensus_block_interval_seconds histogram
consensus_block_size_bytes gauge
consensus_byzantine_validators gauge
consensus_byzantine_validators_power gauge
consensus_height gauge
consensus_missing_validators gauge
consensus_missing_validators_power gauge
consensus_num_txs gauge
consensus_rounds gauge
consensus_total_txs gauge
consensus_validators gauge
consensus_validators_power gauge
mempool_size gauge
p2p_peers gauge

There are a couple of additional metrics that have been requested by the validator community (tendermint/tendermint#2272):

Metric type tags notes
consensus_latest_block_height gauge none /status sync_info number
consensus_catching_up gauge none either 0 (syncing) or 1 (synced)
p2p_peer_receive_bytes_total counter peer_id for debugging p2p issues
p2p_peer_send_bytes_total counter peer_id for debugging p2p issues
p2p_peer_pending_send_bytes gauge peer_id for debugging p2p issues

NOTE: May be able to handle the p2p metrics by adding SendMonitor and RecvMonitor readings as histograms

Other metrics that would be amazing to add:

Metric type tags notes
mempool_num_txs gauge none the number of transactions in the mempool at time of poll
mempool_tx_size_bytes histogram none a histogram of transaction size
mempool_failed_txs gauge none number of failed transactions
mempool_recheck_times histogram none number of transactions rechecked in the mempool
p2p_num_txs gauge peer_id number of transactions submitted by each peer_id
p2p_block_parts gauge peer_id number of blockparts transmitted by peer
p2p_pending_send_bytes gauge peer_id amount of data pending to be sent to peer
consensus_block_gas gauge none amount of gas in the previous block (needs work)
consensus_block_processing_time histogram none time between BeginBlock and EndBlock
sdk_num_transactions counter type number of transactions by type
sdk_gov_proposals counter type number proposals

Other work that needs to be done on the prometheus implementation:

  • Add configuration option that allows for a metric prefix for all metrics coming out of the instance. (i.e. tm_* or gaiad_*)
  • Add ability to deactivate metric families from the config file (consensus, p2p, mempool, etc...)

related: tendermint/tendermint#2272

@alexanderbez
Copy link
Contributor

This is awesome aggregation work @jackzampolin ! Wouldn't this need to be handled on the TM side though?

@jackzampolin
Copy link
Member Author

I think a lot of it needs to be handled there, but there is also some stuff in the SDK that needs to be handled. I think its going to be critical to come up with a framework where we can tie metrics to modules and enable monitoring in some parts of the system, but not others. I thought we could use this issue as a place to define what that would look like, and which metrics we want, then open separate issues on tendermint for implementation.

@alexanderbez
Copy link
Contributor

I see I see. @xla pointed out some good insight in #1409. Coming up with an interface for modules to export some sort of metrics shouldn't be that far from trivial, but we have to be very particular about what we want to expose and track.

@Hyung-bharvest
Copy link
Contributor

I am highly interested in p2p_peer data, because it will give us strong tools to analyze malicious attack including layer 7. I would like to know "p2p_peer_pending_recieve(send)_cnt(bytes)" which implies how many packets(data) received(sent) from(to) each peer.
I also want to know "REAL" public IP(Not the IP from external_IP in config.toml) of each peer, if gaiad has it from raw tcp data for analysis and defence purpose.

@mdyring
Copy link

mdyring commented Aug 30, 2018

@jackzampolin agreed all of the above would be very intersting!

Related to #2131 it would make good sense to include some AVL stats as well as database (insert/update/delete counts + whatever else can be gathered), to further diagnose this.

It would also be valuable for monitoring moving forward.

@mdyring
Copy link

mdyring commented Aug 30, 2018

@jackzampolin I'd like to add gaia_proposals_total (or similar wording) to track number of proposals.

This should make it trivial to setup monitoring that will alert the operator that a new proposal was added.

@jackzampolin
Copy link
Member Author

@mdyring can you mention the names of the IAVL stats we should have?

@melekes
Copy link
Contributor

melekes commented Aug 31, 2018

mempool_num_txs

that's the same as existing mempool_size?

mempool_recheck_times | histogram | none | number of transactions rechecked in the mempool

the name seems inconsistent with description to me. Do you want a histogram of how much it took to recheck txs (i.e. 1s - 18 txs, 2s - 10 txs, ...) or the number of transactions rechecked (gauge)?

p2p_num_txs

peers are not aware of the term "transactions". they operate on channels with raw bytes.

number of transactions submitted by each peer_id

moreover, peers do not submit transactions (clients do). peers exchange transactions between each other using mempool.

p2p_block_parts | gauge | peer_id | number of blockparts transmitted by peer

same. it should be either (consensus_proposal_block_parts) or (blockstore_block_parts)

consensus_block_gas | gauge | none | amount of gas in the previous block (needs work)

as far as I know, it's not currently implemented.

@mdyring
Copy link

mdyring commented Aug 31, 2018

@jackzampolin well basically anything that could be used to pinpoint performance issues.

I am not that familiar with the AVL codebase, but I think the below might be useful.

Metric type tags notes
tm_avl_immutable_gets_total gauge none Incremented for each immutable_tree get*()
tm_avl_mutable_gets_total gauge none Incremented for each mutable_tree get*()
tm_avl_mutable_sets_total gauge none Incremented for each mutable_tree set()
tm_avl_mutable_removes_total gauge none Incremented for each mutable_tree remove()
tm_avl_mutable_rollbacks_total gauge none Incremented for each mutable_tree rollback()
tm_avl_mutable_load_versions_total gauge none Incremented for each mutable_tree load_version()
tm_avl_mutable_save_versions_total gauge none Incremented for each mutable_tree save_version()
tm_avl_mutable_delete_versions_total gauge none Incremented for each mutable_tree delete_version()
tm_avl_nodedb_saves_total gauge none Incremented for each nodedb save()

@xla
Copy link

xla commented Aug 31, 2018

@jackzampolin Great work, thanks for putting this together. I'd propose to break the issue apart in three distinct pieces:

  • Cosmos metrics - anything which can be tracked fromt the SDK codebase
  • Tendermint metrics - extension, rework of the prom metric
  • Low-level libs - likely raw performance for resource utilisation

It will make the discussion easier and we should also think in this context how we can expose the monitoring facilities of tm meaningful to the SDK.

@xla
Copy link

xla commented Aug 31, 2018

As @melekes pointed out most of the metrics concerning blocks and transactions belong in the realm of consensus.

mempool_num_txs

that's the same as existing mempool_size?

It seems both are inaccurate, we should encode the nature of the transactions and probably have metrics for the stages a transaction can go through, so the operator has insight into how many are pending, how many have been processed overall.

@jackzampolin
Copy link
Member Author

@mslipper and team will be handling this work. I think next steps here are to open issues in tm, sdk and iavl that detail metrics there and begin working on implementation per our call today.

@mslipper
Copy link
Contributor

mslipper commented Sep 6, 2018

Proposal for how we can handle this:

Currently, we have a MetricsProvider function that returns Metrics structs for the cs, p2p, and mempl packages. Since we're aiming to provide additional metrics for more packages than just these, I'd like to propose making the following changes to Tendermint's Metrics subsystem:

The Metrics structs as they are defined right now return the raw go-kit Counter, Gauge, and Histogram interfaces. Since Go doesn't support union types yet, I suggest we wrap the supported metrics for a given package in an interface with the following signature:

type Metrics interface {
	Counters() []Counter
	Gauges() []Gauge
	Histograms() []Histogram
	Family() string
}

Family() is used to identify the class of metrics exposed by the Metrics in order to programmatically enable/disable them.

We'll need to refactor MetricsProvider to return a slice of Metrics interfaces rather than the three package-specific structs it does now.

To support command-line flags that enable or disable particular metrics, we'll allow an additional string slice argument to NewNode that defines a list of metric categories to enable. By default, all metrics will be enabled. On node startup, the the node will iterate through the slice of Metrics objects provided by MetricsProvider disable any that are not in the list of enabled metric families.

@jackzampolin
Copy link
Member Author

This looks great @mslipper, nice and extensible as well as being a small change from the current implementation. 👍 from me

@melekes
Copy link
Contributor

melekes commented Sep 7, 2018

Since we're aiming to provide additional metrics for more packages than just these, I'd like to propose making the following changes to Tendermint's Metrics subsystem

most probably, but here I don't see any metrics outside of existing consensus, mempool and p2p packages.

What problem exactly are you trying to solve?

@jackzampolin
Copy link
Member Author

@melekes What about the IAVL and sdk module specific (gov mentioned) metrics mentioned above? This comment above also speaks to this:

I think its going to be critical to come up with a framework where we can tie metrics to modules and enable monitoring in some parts of the system, but not others. I thought we could use this issue as a place to define what that would look like, and which metrics we want, then open separate issues on tendermint for implementation.

@melekes
Copy link
Contributor

melekes commented Sep 7, 2018

Add ability to deactivate metric families from the config file

Prometheus metrics are cheap (it's just counters in RAM, which are periodically read by outside process). I am pretty sure you can configure Prometheus server to not read some metrics. Just want to make sure that you understand that it might be easier to expose everything & read only what user needs VERSUS framework where we enable monitoring in some parts of the system, but not others.

@jackzampolin
Copy link
Member Author

@melekes If you don't think that should be a requirement how would it change the implementation? I think having each module export metrics makes metrics a first class citizen in our ecosystem and allows devs an easy way to add them when building. Do you have any issues with this approach?

@jackzampolin
Copy link
Member Author

Lets go ahead and move forward with this proposal!

@melekes
Copy link
Contributor

melekes commented Sep 8, 2018

I think having each module export metrics makes metrics a first class citizen in our ecosystem and allows devs an easy way to add them when building.

This sounds amazing! I am not opposite to adding more metrics.

The thing I am struggling to understand is why each module needs to export:

type Metrics interface {
	Counters() []Counter
	Gauges() []Gauge
	Histograms() []Histogram
	Family() string
}

@melekes
Copy link
Contributor

melekes commented Sep 8, 2018

if you don't think that should be a requirement how would it change the implementation?

this requires no changes in Tendermint.

@jackzampolin
Copy link
Member Author

@melekes That way each module can export metrics of any type. Some of those functions will likely return empty slices in practice. The family name is for specifying the metric namespace.

@melekes
Copy link
Contributor

melekes commented Sep 11, 2018

OK, maybe I don't understand something. Please go ahead with the proposal.

@melekes
Copy link
Contributor

melekes commented Sep 11, 2018

That way each module can export metrics of any type

you don't need to export metrics (as an interface or something else) in order to export them to Prometheus!!!

@xla
Copy link

xla commented Sep 11, 2018

I share @melekes concerns, the proposed solution seems to not address the original problem, which should be challenged. Can we expand why we need such granularity on metric configuration. This would only be reasonable if certain metrics impose performance hits. As long as this is not the case we should only support toggling of metrics and the metircs provider (prometheus for now).

@jackzampolin
Copy link
Member Author

@melekes You do need to register the metrics with prometheus however. If the application has multiple modules and each module has some metrics it wants to expose to operators/clients then having an interface to export those metrics seems sensible.

@xla can you expand on proposed solution seems to not address the original problem?

And that is correct, as long as there is no performance hit we would only need to support configuration for toggling metrics and the provider.

@melekes
Copy link
Contributor

melekes commented Sep 11, 2018

You do need to register the metrics with prometheus however.

True, but you don't need to export any functions. All you need is to call it (you can do it in the constructor within the package). The fact that Tendermint exposes Metrics struct and PrometheusMetrics() / NopMetrics() functions makes sense for Tendermint because some of our users use Tendermint as a library and they specifically requested the ability to change default MetricsProvider.

  1. https://github.com/tendermint/tendermint/blob/af1a405768daf8c1e8a99fca3f331b93baa20851/mempool/metrics.go#L21
  2. https://github.com/go-kit/kit/blob/master/metrics/prometheus/prometheus.go#L58

@mslipper
Copy link
Contributor

Taking discussion to Slack.

@xla
Copy link

xla commented Sep 17, 2018

@xla can you expand on proposed solution seems to not address the original problem?

The original problem is that we want to expose a variety of new metrics, which boils down to set a gauge, add a counter or observe a histogram in the right place in the code. Fine grained control down to the metric level is introducing complexity that is asking for overhead we should just shy away from, as it will open to quite complex configuration matrix. In the most naive way we can always collect those metrics and have the configuration influence if the prometheus endpoint is enabled or not.

Addition of metrics should be confined to the package boundaries and not neccesitate any changes to how we setup metrics. The one change we should drive is the support for a common prefix (namespace in the context of prometheus) as mentioned in tendermint/tendermint#2272 Speaking from the tendermint perspective only here.

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 18, 2018
mslipper added a commit to mslipper/tendermint that referenced this issue Sep 18, 2018
mslipper added a commit to mslipper/tendermint that referenced this issue Sep 18, 2018
mslipper added a commit to mslipper/tendermint that referenced this issue Sep 19, 2018
mslipper added a commit to mslipper/tendermint that referenced this issue Sep 21, 2018
xla pushed a commit to tendermint/tendermint that referenced this issue Sep 25, 2018
* Add additional metrics to p2p and consensus
Partially addresses cosmos/cosmos-sdk#2169.
* WIP
* Updates from code review
* Updates from code review
* Add instrumentation namespace to configuration
* Fix test failure
* Updates from code review
* Add quotes
* Add atomic load
* Use storeint64
* Use addInt64 in writePacketMsgTo
mslipper added a commit to mslipper/tendermint that referenced this issue Sep 27, 2018
xla pushed a commit to tendermint/tendermint that referenced this issue Oct 10, 2018
* Add additional metrics
Continues addressing cosmos/cosmos-sdk#2169.
* Add nop metrics to fix NPE
* Tweak buckets, code review
* Update buckets
* Update docs with new metrics
* Code review updates
@hleb-albau
Copy link
Contributor

@melekes So, as far as I understand, there is global DefaultRegisterer. To add your cosmos-based app metrics, all you need is to register metrics using default Prometheus functions. Is it correct?

@jackzampolin
Copy link
Member Author

Going to go ahead and close this. @hleb-albau please open another issue for documenting the process of adding more metrics.

@tnachen tnachen reopened this Aug 23, 2019
@jackzampolin
Copy link
Member Author

This issue confuses tendermint and sdk metrics. I'm going to close this in anticipation of @tnachen opening another issue on prometheus metrics in SDK based apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants