Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: Export and analysis of benchmarking metrics #1399

Closed
1 of 2 tasks
doodlesbykumbi opened this issue Apr 6, 2021 · 3 comments
Closed
1 of 2 tasks

SPIKE: Export and analysis of benchmarking metrics #1399

doodlesbykumbi opened this issue Apr 6, 2021 · 3 comments

Comments

@doodlesbykumbi
Copy link
Contributor

doodlesbykumbi commented Apr 6, 2021

Overview

The goal here is to have a good answer to the question, "Given some well-defined metrics that we know how to measure, how do we record and export them and to where?"

Opentelemetry supports many exporters e.g. Cloudwatch or Prometheus.

The idea is to show an end to end pipeline that

  1. Takes, as input, metrics from some example source that is interchangeable with input from Secretless (see SPIKE: Secretless benchmark metrics are defined #1398). For example we can use an HTTP server that generates fake metrics when its routes are called.
  2. Exports to some metrics backend. Prometheus is a good option (OSS).
  3. Analyse. Grafana can be used to query Prometheus.

The spikes are to de-risk the general approach

Definition of done

  • Proof of concept of a pipeline of measuring metrics, exporting them and carrying out some statistical analysis.
  • A list of recommendations for components that can be used as a metrics backend () and for carrying out analysis (e.g. grafana, cloudwatch insights). Pros and cons included.
@doodlesbykumbi doodlesbykumbi changed the title Exporting metrics and analysing them SPIKE: Export and analysis of benchmarking metrics Apr 6, 2021
@BradleyBoutcher BradleyBoutcher self-assigned this Apr 7, 2021
@BradleyBoutcher
Copy link
Contributor

BradleyBoutcher commented Apr 14, 2021

So after researching Grafana more, I've revise my initial plan. This article details an architecture I think we should imitate: https://dzone.com/articles/go-microservices-part-15-monitoring-with-prometheu. Rather than the metrics running locally or being viewable during the Jenkins run, I think we should use a remote Grafana instance to monitor "snapshot builds" of Seceretless, which use some reusable configuration.

Essentially, when Secretless succeeds in a Jenkins pipeline, Jenkins then deploys a snapshot instance of that commit to remote cluster. In the same cluster, we have an instance of Prometheus and Grafana. We build a simple discovery service to poll the namespace where Secretless instances run, which builds a list of endpoints for Prometheus to query. The discovery service outputs a simple json document Prometheus references, and then scrapes the noted endpoints. Then, Grafana can be used to analyze the data from Prometheus.

In the article, they use the following setup, which we'll adjust to fit our needs:

-     adding a /metrics endpoint to each microservice served by the prometheus httphandler .
-     instrumenting our go-code so the latencies and response sizes of our restful endpoints are made available at /metrics .
-     writing and deploying a 'docker swarm mode'-specific discovery microservice which lets prometheus know where to find /metrics endpoints to scrape in an ever-changing microservice landscape.
-     deploying the prometheus server in our docker swarm mode cluster.
-     deployment of grafana in our docker swarm mode cluster.
-     querying and graphing in grafana.

However, instead of docker swarm, we run this in a remote Kubernetes cluster. There's lots of excellent guides on using helm to deploy Prometheus and Grafana, like this one: https://www.fosstechnix.com/install-prometheus-and-grafana-on-kubernetes-using-helm/

I can't speak to the implementation of a /metrics endpoints being integrated into Secretless, but I think that would be the most efficient route to take. This would also make it possible to run metrics locally using the same setup. I'm going to use this article to try and setup the exact same kubernetes configuration I describe above, but locally, with a mock Secretless server.

I should also call out that, since secretless can proxy to multiple endpoints, exposing metrics on a single /metrics route for the secretless server will need to account for this, and have some way of aggregating results while keeping them identifiable by the endpoint. Prometheus does support a "tagging" system that supports this, and would be a matter of formatting the metrics output, which we'll need to do anyways.

An important point to remember is cleanup. Since prometheus and grafana are meant to be non-temporary monitoring solutions, we'll need to set up some kind of simple cleanup service for the cluster to remove stale secretless instances. Monitoring locally, this will be trivial.

@doodlesbykumbi
Copy link
Contributor Author

@BradleyBoutcher I think the pipeline you have in mind (metrics -> prometheus -> grafana) is exactly what we want for the POC.

Essentially, when Secretless succeeds in a Jenkins pipeline...

Glad you're already thinking about how this would work in CI. For the moment the goal is to have the pipeline defined and working on a single Secretless (latest release) instance. With the pipeline defined it'll be possible to then define and run benchmarking "experiments" on the current Secretless snapshot.

We want to keep POC lightweight so we should be thinking to deploy the components locally with Docker/Docker Compose. A subset of the steps you describe above should get us what we want:

  1. Run a mock http server that generates some metrics (using Open when its endpoints are called. We're currently exploring OpenTelemetry so this might be a useful example to work with https://github.com/open-telemetry/opentelemetry-go/tree/main/example/prom-collector. The Prometheus exported for OpenTelemetry helps with exposing that /metrics route.
  2. Run Prometheus and pull from the route in (1)
  3. Run Grafana and query Prometheus in (2), and prove that analysis in possible on the metrics from (1).

For (1) it doesn't have to be a mock HTTP server you could have some method that populates the fake metrics data, like how temperature is being set to random values via a loop in the OpenTelemetry example https://github.com/open-telemetry/opentelemetry-go/blob/main/example/prom-collector/main.go#L121. However, an HTTP server seems like a natural way to have fine grain dynamic control of the generated metrics.

I should also call out that, since Secretless can proxy to multiple endpoints... Prometheus does support a "tagging" system that supports this...

You're right. I think as part of making the POC complete we'd want to explore Prometheus labels within the context of OpenTelemetry. In part that's why I suggest using an HTTP server for (1), for the POC we can label the metrics for a given route with the corresponding HTTP method and path.

@doodlesbykumbi
Copy link
Contributor Author

doodlesbykumbi commented May 7, 2021

Outcome

The outcome from this spike is available in the POC in the telemetry branch.

version: '3.7'
services:
prometheus:
image: prom/prometheus:v2.1.0
volumes:
- ./prometheus.yml/:/etc/prometheus/prometheus.yml
# - prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- 9090:9090
grafana:
image: grafana/grafana
# user: "472"
depends_on:
- prometheus
ports:
- 3000:3000

The branch demonstrates a locally runnable pipeline (via docker-compose) of

  1. Secretless Prometheus metrics endpoint on :2222
  2. Prometheus
  3. Grafana

The pipeline works as follows

  1. Secretless, for an experiment, is configured to proxy some connection of interest
  2. Secretless runs, collects (labelled) metrics via OpenTelemetry and advertises a Prometheus metrics endpoint
  3. Prometheus is setup to pull from the Secretless Prometheus metrics endpoint at some regular interval
  4. Grafana is setup to use Prometheus as a datasource
  5. Analysis can be carried out in Grafana. Some examples are provided in the branch, averages and percentiles for latency etc.

Remaining questions

  • What is the impact of Telemetry, if any ?
  • What is a good UX for toggling Telemetry on and off ?
  • What are the pros and cons of push vs pull metric collection, and how does it impact the data available at analysis time.
  • At present the implementation relies on a Prometheus pull metrics endpoint. What are the configuration options (e.g. polling interval) available and what impact do they have to the data that is available at analysis time.

@doodlesbykumbi doodlesbykumbi self-assigned this May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants