[Action/Falco] Investigate benchmark test data collection #50

nikimanoledaki · 2024-02-07T14:25:43Z

The first benchmark test for Falco will be a baseline test.

We can start with a script that runs as a Cron Job in Kubernetes. In the future, we can automate this using self-hosted GitHub Action runners (see Proposal 1: Actions Runner Controller (ARC)).

Pre-requisite

Add a Kubernetes Cron Job manifest in clusters/projects/falco/cron.yaml

Benchmark Steps

1. For the given duration (e.g. 15 min): sleep OR create some kernel stress
2. Write the Prometheus metrics in the output of the job for each of the metrics (e.g. Kepler's kepler_container_joules_total ) for the given duration (e.g. 15min).

Acceptance Criteria

Provide a script with reproducible steps
Have a set duration for the workload, for example, 15 minutes
Track and export the metrics that are the output of this test in a consistent manner

The text was updated successfully, but these errors were encountered:

raymundovr · 2024-02-07T18:29:56Z

Hi,

I'd love to work on this one, however, I'd like to get some help to clarify the reqs before jumping into it.

I notice in the parent issue that another script has been implemented for an "infra-component" and that this issue could work in a similar fashion, correct? If so, could you point me out to this?
~~2. For suggested step 2, what do you mean with point a) "do not deploy the microservice workload"?~~ Update: found it within the document.

If someone else has it more clear and wants to pair on this, I'm also open to it :)

nikimanoledaki · 2024-02-08T14:41:04Z

@raymundovr thank you for volunteering to contribute to this! 🥳

I notice in the parent issue that another script has been implemented for an "infra-component" and that this issue could work in a similar fashion, correct? If so, could you point me out to this?

I think @AntonioDiTuri was referring to how we deployed some infrastructure-level components with Flux. Essentially, we can add a Kubernetes manifest in a directory watched by Flux and Flux will apply it in the cluster, like this ConfigMap: https://github.com/cncf-tags/green-reviews-tooling/blob/main/clusters/base/kepler-grafana.yaml

As part of this issue, we would need to add a manifest for a Kubernetes CronJob in the clusters/base/ directory. The CronJob itself would contain the steps listed in the description.

Before jumping into it, I agree that we should refine the requirements a bit more. Shared this with Flaco maintainer @incertum and waiting for her feedback: falcosecurity/cncf-green-review-testing#13

nikimanoledaki · 2024-02-16T10:59:00Z

@raymundovr hi! 👋 I created an issue to bootstrap ARC (#58) but there are quite a few requirements and we have no guarantee that ARC will be up and running in time for us to go straight to the self-hosted runners solution by KubeCon.

I propose that we get a head start with the bash script + CronJob as a temporary solution. That way we're not blocked by the authorization with PAT keys, secrets, etc etc. And then when ARC is ready, we can port the individual steps to a GitHub Actions workflow (it's relatively easy to split/convert bash scripts into GA workflow steps). What matters the most is that we implement the steps themselves one way or another so that we can gather sample metrics!

To deploy the CronJob, we can add a CronJob manifest in clusters/projects/falco for Flux to deploy it. Flux can deploy raw manifest, similarly to how it's deploying this ConfigMap that contains the Kepler dashboard: https://github.com/cncf-tags/green-reviews-tooling/blob/main/clusters/base/kepler-grafana.yaml

Then we can the steps one by one :)

What do you think? :)

nikimanoledaki · 2024-02-16T11:08:20Z

I believe the most challenging part will be the last step, which is to find a way to gather + store the metrics per test. Pushgateway is one option, which @rossf7 suggested (Slack context), but really we will understand more once we get to that step!

raymundovr · 2024-02-17T11:47:09Z

Hi @nikimanoledaki
Thank you for elaborating further.

Unfortunately I don't have a full overview of the cluster resources and its accessibility to give a more informed opinion on when ARC could be ready, therefore for me it's ok to start with the cron job, as suggested :)
~~Will also comment out the microservices.yaml, as discussed and identified last time.~~
Update: Waiting on Falco team to explain a bit further the purpose of the microservices in conjuction with the stress test, see here.

rossf7 · 2024-02-19T10:15:45Z

Hi @raymundovr,
makes total sense to start with the cron job and the steps in the job and the github action should be very similar.

Just a heads up that I've started work on adding the self hosted runner in #63 I'll let you know once its deployed and tested.

raymundovr · 2024-02-20T17:28:32Z

Hi @rossf7 @nikimanoledaki

I've been researching and playing with kepler at work during the last couple of days and have gained an understanding on the architecture and the options that could be considered to obtain, and set, an appropiate base line and test scenarios measurements.
I'd like to share this with you and discuss further the possible next steps :)

First, I'd like to mention that I think that the idea of creating any task / job to gather kepler metrics to push them into Prometheus might be redundant as kepler itself deploys with kepler-exporter, which is then scraped by Prometheus, then it is possible to observe these metrics via Grafana. The deployment that you've already launched already shows this, see here.

With this in consideration, I'd like to discuss the possibility to take the corresponding measurements from Prometheus (using PromQL / API queries) and present them in a consistent way, for example:

Take the measurements from Prometheus on a node running Falco as its sole Deployment (i.e., nothing else is running). This came to my mind after reading the comment from the Falco team on the stress test and its interaction with Falco.
Launch a CronJob to sleep for a certain time and take the measurements.
Launch the stress test, for the same amount of time as the sleep, and take measurements.
Launch the stress test + demo microservice deployment and take measurements.

The measurements could be taken by a task running on a separate node, prompting Prometheus, as mentioned before, for the corresponding Pod or Namespace where Falco is running. Then it will output a table-like format which can be used as a way to calculate, for example, max, min, average and median.

We can then decide where to store this output, or perhaps make it available as a service? We'll still need a way to trigger this measurements task, though.

What do you think?

rossf7 · 2024-02-21T10:39:59Z

Hi @raymundovr,
thanks for sharing this.

Take the measurements from Prometheus on a node running Falco as its sole Deployment

Yes, you can query Prometheus for the Kepler metrics and we should have just Falco running on its node.

We've set node labels for this so you can add the selectors cncf-project=wg-green-reviews and cncf-label-sub=internal to the CronJob and the stress test pods?

For microservices-demo it doesn't look like the helm chart lets you set a node selector :( they do also support kustomize and it might let us patch the deployments?

https://github.com/GoogleCloudPlatform/microservices-demo/tree/main/helm-chart

Then it will output a table-like format which can be used as a way to calculate, for example, max, min, average and median.

Perfect, logging the results will let us validate the test steps and the measurements.

We can then decide where to store this output, or perhaps make it available as a service?

Yes, we still need to work on this part of the design. We could write the results to S3 for example but lets tackle this as a later step.

cc @nikimanoledaki @AntonioDiTuri

raymundovr · 2024-02-21T20:04:07Z

Hi,

Thank you @rossf7 for the suggestion. I'm not sure that the labels are exported as selectors into Prometheus.
I'm currently under the impression that querying by namespace would be a quick and viable way.

I have started to put something togeher, please check https://github.com/raymundovr/sustainability-metrics/blob/main/main.go

In order to continue I'll probably need access to Prometheus, is there any chance to get that? I don't mind setting up any kind of tunnel if necessary.

What do you think of this approach?

cc @nikimanoledaki

rossf7 · 2024-02-22T19:05:00Z

Hi @raymundovr,
as discussed this morning you have readonly access to the cluster now.

Your code looks good! When you're ready we can move it to this repo. I think we could have the go module in the root and your code in cmd/main.go. WDYT?

For the github action we can use setup-go to install go and then run the binary. I don't think we need a container image yet. The action will also need a kubeconfig so we could use port forwarding for Prometheus or make the Prometheus API public?

I'm not sure that the labels are exported as selectors into Prometheus.
I'm currently under the impression that querying by namespace would be a quick and viable way.

Yes, you're right we can't use the k8s labels. I think using the container_namespace prometheus label as you're doing is good.

There is an another factor which is the Falco team would like 3 deployments of Falco on different nodes as Falco has multiple drivers. See falcosecurity/cncf-green-review-testing#2

So far just a single node with the modern-ebpf driver is provisioned. We could use the instance prometheus label which has the node name. I don't really like that approach but I can't see a better option right now.

AntonioDiTuri · 2024-02-23T15:56:39Z

Thanks @raymundovr and @rossf7 for moving this forward!
I took a look at the code and it looks like a good starting point.

The metrics that @raymundovr selected for the moment are:

Id: "kepler_dram",
Query: (`sum by (pod_name, container_namespace)(irate(kepler_container_dram_joules_total{container_namespace=~"%s",pod_name=~".*"}[1m]))

Id: "kepler_package",
Query: (`sum by (pod_name, container_namespace (irate(kepler_container_package_joules_total{container_namespace=~"%s",pod_name=~".*"}[1m]))`, *projectNamespace),
Id: "cpu_utilization_node",
Query: (`instance:node_cpu_utilisation:rate5m{job="node-exporter", instance="%s", cluster=""} != 0`, *node),

@nikimanoledaki since you have some experience with the kepler metrics, do you think this is enough?

For a first implementation I guess it is fine to print the metrics in the output, then we can refine it later :)

raymundovr · 2024-02-24T11:42:47Z

Thank you @rossf7 and @AntonioDiTuri for taking a look and providing feedback.
Indeed, what I'm outlning a bit here is:

A time based test with a series of queries at a given interval.
A way to define queries that might be interesting to observe for a test. Currently all are static, but nothing impedes that in the future this changes.
A parameterized script, where important things can come as arguments. Can include desired queries later on.

What do you think @nikimanoledaki ?

raymundovr · 2024-02-24T17:15:57Z

Update: cleaned things a bit and added kepler_container_joules_total metric.

nikimanoledaki mentioned this issue Feb 7, 2024

[Tracking] Gather metrics for idle Falco #34

Open

4 tasks

nikimanoledaki added this to TAG-Environmental-Sustainability Feb 7, 2024

nikimanoledaki moved this to Backlog in TAG-Environmental-Sustainability Feb 7, 2024

nikimanoledaki added board/wg-green-reviews help wanted Extra attention is needed good first issue Good for newcomers labels Feb 7, 2024

nikimanoledaki changed the title ~~[Action/Falco] Create a script to run the idle benchmark test~~ [Action/Falco] Complete idle benchmark test Feb 16, 2024

raymundovr mentioned this issue Feb 17, 2024

Add idle-benchmark cron job definition #60

Closed

rossf7 mentioned this issue Feb 19, 2024

feat: Add GitHub Actions self hosted runner #63

Closed

nikimanoledaki changed the title ~~[Action/Falco] Complete idle benchmark test~~ [Action/Falco] Complete baseline benchmark test Feb 19, 2024

nikimanoledaki removed good first issue Good for newcomers help wanted Extra attention is needed labels Feb 20, 2024

nikimanoledaki assigned raymundovr and nikimanoledaki Feb 20, 2024

nikimanoledaki added this to the Measure the cloud native sustainability footprint of Falco manually milestone Feb 20, 2024

nikimanoledaki changed the title ~~[Action/Falco] Complete baseline benchmark test~~ [Action/Falco] Investigate benchmark test data collection Feb 20, 2024

nikimanoledaki removed this from the [Q1 24] Measure the cloud native sustainability footprint of Falco manually milestone Apr 11, 2024

nikimanoledaki added this to the [Q2 24] Deploy, Run, Report: Automate the sustainability footprint pipeline milestone Apr 11, 2024

leonardpahlke removed the status in TAG-Environmental-Sustainability Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Action/Falco] Investigate benchmark test data collection #50

[Action/Falco] Investigate benchmark test data collection #50

nikimanoledaki commented Feb 7, 2024 •

edited

Loading

raymundovr commented Feb 7, 2024 •

edited

Loading

nikimanoledaki commented Feb 8, 2024

nikimanoledaki commented Feb 16, 2024

nikimanoledaki commented Feb 16, 2024

raymundovr commented Feb 17, 2024 •

edited

Loading

rossf7 commented Feb 19, 2024

raymundovr commented Feb 20, 2024 •

edited

Loading

rossf7 commented Feb 21, 2024

raymundovr commented Feb 21, 2024

rossf7 commented Feb 22, 2024

AntonioDiTuri commented Feb 23, 2024

raymundovr commented Feb 24, 2024

raymundovr commented Feb 24, 2024

[Action/Falco] Investigate benchmark test data collection #50

[Action/Falco] Investigate benchmark test data collection #50

Comments

nikimanoledaki commented Feb 7, 2024 • edited Loading

Pre-requisite

Benchmark Steps

Acceptance Criteria

raymundovr commented Feb 7, 2024 • edited Loading

nikimanoledaki commented Feb 8, 2024

nikimanoledaki commented Feb 16, 2024

nikimanoledaki commented Feb 16, 2024

raymundovr commented Feb 17, 2024 • edited Loading

rossf7 commented Feb 19, 2024

raymundovr commented Feb 20, 2024 • edited Loading

rossf7 commented Feb 21, 2024

raymundovr commented Feb 21, 2024

rossf7 commented Feb 22, 2024

AntonioDiTuri commented Feb 23, 2024

raymundovr commented Feb 24, 2024

raymundovr commented Feb 24, 2024

nikimanoledaki commented Feb 7, 2024 •

edited

Loading

raymundovr commented Feb 7, 2024 •

edited

Loading

raymundovr commented Feb 17, 2024 •

edited

Loading

raymundovr commented Feb 20, 2024 •

edited

Loading