Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(validator): add support to validate essential metrics produced by Kepler #1834

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vprashar2929
Copy link
Collaborator

@vprashar2929 vprashar2929 commented Nov 4, 2024

This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:

  • Node Exporter Comparison

    • Validates node_rapl_<package|core|dram> metrics against kepler_node_<package|core|dram>{dev}
  • Kepler Process Comparison

    • Compares kepler_process_<package|core|dram|platform|other|uncore>{latest} metrics to
      kepler_process_<package|core|dram|platform|other|uncore>{dev}
  • Kepler Node Comparison

    • Validates kepler_node_<package|core|dram|platform|other|uncore>{latest} against
      kepler_node_<package|core|dram|platform|other|uncore>{dev}

Additionally, the following changes are made to existing functionality:

  • Adds a new metric_validations.yaml file which includes promql queries for comparisons along with threshold values
  • Update the existing stressor.sh script to now support few more parameters to make it more flexible
    • warmup time: time to wait before starting the stressor
    • cooldown time: time to wait after the stressor is finished
    • repeats: number of times to repeat the stressor. Since for
      regression test we don't want to repeat the stressor multiple times
  • Adds a new validator-regression.yaml file which includes the configuration for the regression test

@vprashar2929 vprashar2929 marked this pull request as draft November 4, 2024 11:21
Copy link
Contributor

github-actions bot commented Nov 4, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request enhances the validator module by introducing a new validate_metrics command to the CLI, which validates Kepler metrics and generates reports. Key modifications include:

  • Added validate_metrics command with --duration and --report-dir options
  • Modified PrometheusJob named tuple to include dev and latest fields
  • Updated load function to initialize new fields when loading configuration from a file
  • Introduced max_mae parameter to validation_from_yaml function for Maximum Absolute Error computation
  • Added new functions: validate_metrics, ScriptResult, and write_md_report

Impact: These changes expand the validator's capabilities for handling and validating Kepler metrics, but do not affect exported function signatures or global data structures. The external interface and behavior are modified, requiring configuration file updates.

Observations/Suggestions:

  • The changes seem well-structured and organized, with clear intentions to enhance the validator's features.
  • It would be beneficial to include tests for the new validate_metrics command and its associated functions to ensure their correctness and robustness.
  • Consider adding documentation or comments to explain the purpose and usage of the new max_mae parameter and the ScriptResult and write_md_report functions.
  • Review the configuration file updates required for these changes to ensure a smooth transition for users.

@vprashar2929 vprashar2929 force-pushed the add-kep-reg branch 4 times, most recently from 827aefd to 564fa4c Compare November 5, 2024 17:49
@vprashar2929 vprashar2929 requested a review from sthaha November 5, 2024 17:53

validations:
# absolute power comparison
- name: Total - absolute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate these invariants in the same version of dev

  • kepler_node_<pkg|core|uncore|dram|other>{dev} = sum of ( process_<pkg|core|uncore|dram|other>{dev} )
  • kepler_node_<pkg|core|..> = node_exporter_rapl_<pkg|core...>
    *sum( kepler_process_bpf_cpu ) = node_exporter_cpu_time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for kepler_node_<pkg|core|dram...>{dev} = sum of (process_<pkg|core|dram....>){dev} do you mean
MAE of sum(rate(kepler_node<pkg|core|dram>){dev}[20s]) and sum(rate(process_<pkg|core|dram>{dev}[20s]))?

@vprashar2929 vprashar2929 force-pushed the add-kep-reg branch 6 times, most recently from 33d2963 to de1649f Compare November 9, 2024 14:18
@vprashar2929 vprashar2929 changed the title feat(validator): add support to validate kepler metrics feat(validator): add support to validate essential metrics produced by Kepler Nov 9, 2024
@vprashar2929 vprashar2929 marked this pull request as ready for review November 10, 2024 14:53
metal: metal # Job name for metal metrics, default is metal

url: http://localhost:9090 # Prometheus server URL
rate_interval: 60s # Rate interval for Promql, default is 20s, typically 4 x $scrape_interval
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly using rate interval as 60s because:

Prometheus scrape Interval = 3s
Data points for 12s Interval(i.e 4* scrape interval) = 12/3 = 4 data points
Data points for 60s interval = 60/3 = 20 data points

With 20 data points, we get a smoother and more reliable estimate. When comparing two sum(rate(...)) a stable rate reduces the variability in MAE calculations leading to more accurate assessments.

@vprashar2929 vprashar2929 force-pushed the add-kep-reg branch 2 times, most recently from 28889fe to fe32d16 Compare December 16, 2024 13:06
@@ -1,5 +1,5 @@
global:
scrape_interval: 5s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
scrape_interval: 3s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check why changed scrape interval, and update comment accordingly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting scrape every 3 seconds rather than every 5 seconds, over a typical time window will collect significantly more data

@vprashar2929
Copy link
Collaborator Author

Here is sample CI run that would look like for reference once we have this merged: https://github.com/sustainable-computing-io/kepler-metal-ci/actions/runs/12366281744/job/34512777104

My idea is to use the equinix runners on demand on PR's. Reviewers or authors can add a comment in the PR something like /test-regression which will trigger a workflow like this which can test if metrics produced by PR code base Kepler are off to what is already present in latest

…y Kepler

This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:

- Node Exporter Comparison
   - Validates `node_rapl_<package|core|dram>` metrics against `kepler_node_<package|core|dram>{dev}`

- Kepler Process Comparison
   - Compares `kepler_process_<package|core|dram|platform|other|uncore>{latest}` metrics to
      `kepler_process_<package|core|dram|platform|other|uncore>{dev}`

- Kepler Node Comparison
   - Validates `kepler_node_<package|core|dram|platform|other|uncore>{latest}` against
      `kepler_node_<package|core|dram|platform|other|uncore>{dev}`

Additionally, the following changes are made to existing functionality:

- Adds a new `metric_validations.yaml` file which includes promql queries for comparisons along with threshold values
- Update the existing `stressor.sh` script to now support few more parameters to make it more flexible
  - warmup time: time to wait before starting the stressor
  - cooldown time: time to wait after the stressor is finished
  - repeats: number of times to repeat the stressor. Since for
    regression test we don't want to repeat the stressor multiple times
- Adds a new `validator-regression.yaml` file which includes the configuration for the regression test

Signed-off-by: vprashar2929 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants