SPIKE: Export and analysis of benchmarking metrics #1399

doodlesbykumbi · 2021-04-06T11:14:59Z

Overview

The goal here is to have a good answer to the question, "Given some well-defined metrics that we know how to measure, how do we record and export them and to where?"

Opentelemetry supports many exporters e.g. Cloudwatch or Prometheus.

The idea is to show an end to end pipeline that

Takes, as input, metrics from some example source that is interchangeable with input from Secretless (see SPIKE: Secretless benchmark metrics are defined #1398). For example we can use an HTTP server that generates fake metrics when its routes are called.
Exports to some metrics backend. Prometheus is a good option (OSS).
Analyse. Grafana can be used to query Prometheus.

The spikes are to de-risk the general approach

Definition of done

Proof of concept of a pipeline of measuring metrics, exporting them and carrying out some statistical analysis.
A list of recommendations for components that can be used as a metrics backend () and for carrying out analysis (e.g. grafana, cloudwatch insights). Pros and cons included.

BradleyBoutcher · 2021-04-14T15:48:57Z

So after researching Grafana more, I've revise my initial plan. This article details an architecture I think we should imitate: https://dzone.com/articles/go-microservices-part-15-monitoring-with-prometheu. Rather than the metrics running locally or being viewable during the Jenkins run, I think we should use a remote Grafana instance to monitor "snapshot builds" of Seceretless, which use some reusable configuration.

Essentially, when Secretless succeeds in a Jenkins pipeline, Jenkins then deploys a snapshot instance of that commit to remote cluster. In the same cluster, we have an instance of Prometheus and Grafana. We build a simple discovery service to poll the namespace where Secretless instances run, which builds a list of endpoints for Prometheus to query. The discovery service outputs a simple json document Prometheus references, and then scrapes the noted endpoints. Then, Grafana can be used to analyze the data from Prometheus.

In the article, they use the following setup, which we'll adjust to fit our needs:

-     adding a /metrics endpoint to each microservice served by the prometheus httphandler .
-     instrumenting our go-code so the latencies and response sizes of our restful endpoints are made available at /metrics .
-     writing and deploying a 'docker swarm mode'-specific discovery microservice which lets prometheus know where to find /metrics endpoints to scrape in an ever-changing microservice landscape.
-     deploying the prometheus server in our docker swarm mode cluster.
-     deployment of grafana in our docker swarm mode cluster.
-     querying and graphing in grafana.

However, instead of docker swarm, we run this in a remote Kubernetes cluster. There's lots of excellent guides on using helm to deploy Prometheus and Grafana, like this one: https://www.fosstechnix.com/install-prometheus-and-grafana-on-kubernetes-using-helm/

I can't speak to the implementation of a /metrics endpoints being integrated into Secretless, but I think that would be the most efficient route to take. This would also make it possible to run metrics locally using the same setup. I'm going to use this article to try and setup the exact same kubernetes configuration I describe above, but locally, with a mock Secretless server.

I should also call out that, since secretless can proxy to multiple endpoints, exposing metrics on a single /metrics route for the secretless server will need to account for this, and have some way of aggregating results while keeping them identifiable by the endpoint. Prometheus does support a "tagging" system that supports this, and would be a matter of formatting the metrics output, which we'll need to do anyways.

An important point to remember is cleanup. Since prometheus and grafana are meant to be non-temporary monitoring solutions, we'll need to set up some kind of simple cleanup service for the cluster to remove stale secretless instances. Monitoring locally, this will be trivial.

doodlesbykumbi · 2021-04-14T21:39:34Z

@BradleyBoutcher I think the pipeline you have in mind (metrics -> prometheus -> grafana) is exactly what we want for the POC.

Essentially, when Secretless succeeds in a Jenkins pipeline...

Glad you're already thinking about how this would work in CI. For the moment the goal is to have the pipeline defined and working on a single Secretless (latest release) instance. With the pipeline defined it'll be possible to then define and run benchmarking "experiments" on the current Secretless snapshot.

We want to keep POC lightweight so we should be thinking to deploy the components locally with Docker/Docker Compose. A subset of the steps you describe above should get us what we want:

Run a mock http server that generates some metrics (using Open when its endpoints are called. We're currently exploring OpenTelemetry so this might be a useful example to work with https://github.com/open-telemetry/opentelemetry-go/tree/main/example/prom-collector. The Prometheus exported for OpenTelemetry helps with exposing that /metrics route.
Run Prometheus and pull from the route in (1)
Run Grafana and query Prometheus in (2), and prove that analysis in possible on the metrics from (1).

For (1) it doesn't have to be a mock HTTP server you could have some method that populates the fake metrics data, like how temperature is being set to random values via a loop in the OpenTelemetry example https://github.com/open-telemetry/opentelemetry-go/blob/main/example/prom-collector/main.go#L121. However, an HTTP server seems like a natural way to have fine grain dynamic control of the generated metrics.

I should also call out that, since Secretless can proxy to multiple endpoints... Prometheus does support a "tagging" system that supports this...

You're right. I think as part of making the POC complete we'd want to explore Prometheus labels within the context of OpenTelemetry. In part that's why I suggest using an HTTP server for (1), for the POC we can label the metrics for a given route with the corresponding HTTP method and path.

doodlesbykumbi · 2021-05-07T15:38:29Z

Outcome

The outcome from this spike is available in the POC in the telemetry branch.

secretless-broker/telemetry/docker-compose.yml

Lines 1 to 23 in 1de7c61

    
           version: '3.7' 
        
           services: 
        
             prometheus: 
        
               image: prom/prometheus:v2.1.0 
        
               volumes: 
        
                 - ./prometheus.yml/:/etc/prometheus/prometheus.yml 
        
           #      - prometheus_data:/prometheus 
        
               command: 
        
                 - '--config.file=/etc/prometheus/prometheus.yml' 
        
                 - '--storage.tsdb.path=/prometheus' 
        
                 - '--web.console.libraries=/usr/share/prometheus/console_libraries' 
        
                 - '--web.console.templates=/usr/share/prometheus/consoles' 
        
               ports: 
        
                 - 9090:9090 
        
             grafana: 
        
               image: grafana/grafana 
        
           #    user: "472" 
        
               depends_on: 
        
                 - prometheus 
        
               ports: 
        
                 - 3000:3000

The branch demonstrates a locally runnable pipeline (via docker-compose) of

Secretless Prometheus metrics endpoint on :2222
Prometheus
Grafana

The pipeline works as follows

Secretless, for an experiment, is configured to proxy some connection of interest
Secretless runs, collects (labelled) metrics via OpenTelemetry and advertises a Prometheus metrics endpoint
Prometheus is setup to pull from the Secretless Prometheus metrics endpoint at some regular interval
Grafana is setup to use Prometheus as a datasource
Analysis can be carried out in Grafana. Some examples are provided in the branch, averages and percentiles for latency etc.

Remaining questions

What is the impact of Telemetry, if any ?
What is a good UX for toggling Telemetry on and off ?
What are the pros and cons of push vs pull metric collection, and how does it impact the data available at analysis time.
At present the implementation relies on a Prometheus pull metrics endpoint. What are the configuration options (e.g. polling interval) available and what impact do they have to the data that is available at analysis time.

doodlesbykumbi added kind/spike component/secretless-broker labels Apr 6, 2021

doodlesbykumbi changed the title ~~Exporting metrics and analysing them~~ SPIKE: Export and analysis of benchmarking metrics Apr 6, 2021

BradleyBoutcher self-assigned this Apr 7, 2021

BradleyBoutcher added the triage/needs-info label Apr 19, 2021

BradleyBoutcher removed their assignment Apr 20, 2021

izgeri added the on-hold label May 3, 2021

doodlesbykumbi mentioned this issue May 7, 2021

Secretless Benchmarking #1409

Closed

4 tasks

doodlesbykumbi self-assigned this May 7, 2021

doodlesbykumbi closed this as completed May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE: Export and analysis of benchmarking metrics #1399

SPIKE: Export and analysis of benchmarking metrics #1399

doodlesbykumbi commented Apr 6, 2021 •

edited

Loading

BradleyBoutcher commented Apr 14, 2021 •

edited

Loading

doodlesbykumbi commented Apr 14, 2021

doodlesbykumbi commented May 7, 2021 •

edited

Loading

SPIKE: Export and analysis of benchmarking metrics #1399

SPIKE: Export and analysis of benchmarking metrics #1399

Comments

doodlesbykumbi commented Apr 6, 2021 • edited Loading

Overview

Definition of done

BradleyBoutcher commented Apr 14, 2021 • edited Loading

doodlesbykumbi commented Apr 14, 2021

doodlesbykumbi commented May 7, 2021 • edited Loading

Outcome

Remaining questions

doodlesbykumbi commented Apr 6, 2021 •

edited

Loading

BradleyBoutcher commented Apr 14, 2021 •

edited

Loading

doodlesbykumbi commented May 7, 2021 •

edited

Loading