Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing Caliper Metrics to Prometheus #1353

Open
davidkel opened this issue May 24, 2022 · 1 comment
Open

Providing Caliper Metrics to Prometheus #1353

davidkel opened this issue May 24, 2022 · 1 comment
Labels
component/core Related to the core code-base enhancement New feature or request epic

Comments

@davidkel
Copy link
Contributor

davidkel commented May 24, 2022

This doesn't cover the caliper capability that can extract prometheus data into a report as defined by the benchmark configuration file, however it may true to include caliper data sent to prometheus

There are 2 ways to get caliper data to prometheus and both are configured through a benchmark file. There is a scrape method which requires the following info

metricPath: override for the metrics path to be scraped (default /metrics).
scrapePort: override for the port to be used when configuring the scrape sever (default 3000).
processMetricCollectInterval: time interval for default metrics collection, enabled when present
defaultLabels: object of key:value pairs to augment the default labels applied to the exposed metrics during collection.
histogramBuckets: override for the histogram to be used for collection of caliper_tx_e2e_latency

as well as a push method which requires a prometheus push gateway server and the following config

pushInterval: push interval in milliseconds
pushUrl: URL for Prometheus Push Gateway
processMetricCollectInterval: time interval for default metrics collection, enabled when present
defaultLabels: object of key:value pairs to augment the default labels applied to the exposed metrics during collection.
histogramBuckets: override for the histogram to be used for collection of caliper_tx_e2e_latency

These are at the worker level which causes a some issues

  1. If we use the scrape method and we have 10 workers say then if the workers are launched via a forked process then each worker listens on it's own unique port based on the scrapePort in the config (also 3000 is not a great default port). That means we have to configure prometheus to scape from 10 sources, making changing the number of workers (or running a different benchmark with a different worker count) arduous as it requires you to change the prometheus configuration However it may be possible to prometheus service discovery to help (it supports Azure VMs, EC2 Instances and docker for example, but nothing I can see yet for general VMs)
  2. If we use non forked workers then there is a problem, if we have multiple workers on a VM how do we ensure ports don't clash ? you could have a unique benchmark file for the worker specifying a different scrape port, you could run each worker in a docker container and remap the same port to expose a diferrent port as an example but on the whole it's a bit horrible

It makes more sense to have a single scrape port being made available from the manager process and it would be good to expose the individual worker stats as well as the combined stats as viewed and output by the default manager observer and would basically remove the push gateway and make caliper manager effectively take the role of the push gateway

This would also be a great way to graph caliper's take on how it is loading the SUT, the question is should the be part of the benchmark file configuration ? I don't think so personally but I think it's currently there for convenience for the scrape and push methods now whereas really they are worker configuration details.

My proposal would be to

  1. keep the push gateway mechanism for a worker
  2. introduce a scrape mechanism at the manager
  3. Find a recipe that can make scraping directly from workers a viable option in multiple environments eg Native VMs, K8s etc
@davidkel
Copy link
Contributor Author

When we move the scrape mechanism from worker to manager we will lose the ability to scrape system metrics for individual workers which the prometheus client provides as it exposes them directly via prometheus. We would want to still capture the same information and forward this worker information back to the manager to collate and scrape so maybe #1043 can help with capturing those metrics so we can forward them back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/core Related to the core code-base enhancement New feature or request epic
Projects
None yet
Development

No branches or pull requests

1 participant