Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unintended skip in metric collection on Azure Monitor #37203

Merged

Conversation

zmoog
Copy link
Contributor

@zmoog zmoog commented Nov 26, 2023

Proposed commit message

Due to the use of time.Now() to set the value of the reference time used to decide if collect the value of the metric, the metricset may skip a metric collection.

Made two changes to the Azure Monitor metricset:

  • moved referenceTime into the Fetch() and truncated its value to seconds to have a more predictable reference time for comparison.
  • Updated the MetricRegistry.NeedsUpdate() method to use referenceTime vs. using "now" to compare with the time grain duration.

Current tests seem fine, with PT5M time grain and collection periods of 300s and 60s.

I am also adding some structured logging messages to track registry decisions at the debug log level.

Here's how to parse the structured logs to get a nice table view:

$ cat metricbeat.log.json | grep "MetricRegistry" | jq -r  '[.key, .needs_update, .reference_time, .now, .time_grain_start_time//"n/a", .last_collection_at//"n/a"] | @tsv'
fdd3a07a3cabd90233c083950a4bc30c        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:30.967Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6ee8809577a906538473e3e5e98dc893        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.257Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6aedb7dffafbfe9ca19e0aa01436d30a        false   2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.757Z        2023-11-26T15:46:30.000Z        2023-11-26T15:48:30.000Z

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Here's the compute vm dashboard running a custom agent build from this PR:

CleanShot 2023-11-26 at 17 43 11@2x

Logs

Due to the use of `time.Now()` to set the value of the reference time
used to decide if collect the value of the metric, the metricset
may skip a metric collection.

Made two changes to the Azure Monitor metricset:

- moved `referenceTime` into the `Fetch()` and truncated its value to
  seconds to have a more predictable reference time for comparison.
- Updated the `MetricRegistry.NeedsUpdate()` method to use
  `referenceTime` vs. using "now" to compare with the time grain
  duration.

Current tests seem fine, with PT5M time grain and collection periods
of 300s and 60s.

I am also adding some structured logging messages to track registry
decisions at the debug log level.

Here's how to parse the structured logs to get a nice table view:

```shell
$ cat metricbeat.log.json | grep "MetricRegistry" | jq -r  '[.key, .needs_update, .reference_time, .now, .time_grain_start_time//"n/a", .last_collection_at//"n/a"] | @TSV'
fdd3a07a3cabd90233c083950a4bc30c        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:30.967Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6ee8809577a906538473e3e5e98dc893        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.257Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6aedb7dffafbfe9ca19e0aa01436d30a        false   2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.757Z        2023-11-26T15:46:30.000Z        2023-11-26T15:48:30.000Z
```
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2023
@zmoog zmoog changed the title Fix unintended skip in metric collection Fix unintended skip in metric collection on Azure Monitor Nov 26, 2023
@mergify mergify bot assigned zmoog Nov 26, 2023
Copy link
Contributor

mergify bot commented Nov 26, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @zmoog? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@zmoog zmoog added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Nov 26, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2023
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-26T16:11:54.386+0000

  • Duration: 56 min 39 sec

Test stats 🧪

Test Results
Failed 0
Passed 1518
Skipped 96
Total 1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-26T17:28:27.428+0000

  • Duration: 11 min 1 sec

Test stats 🧪

Test Results
Failed 0
Passed 3
Skipped 0
Total 3

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-26T17:35:12.068+0000

  • Duration: 52 min 39 sec

Test stats 🧪

Test Results
Failed 0
Passed 1518
Skipped 96
Total 1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@tommyers-elastic tommyers-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zmoog zmoog marked this pull request as ready for review November 28, 2023 09:50
@zmoog zmoog requested a review from a team as a code owner November 28, 2023 09:50
Copy link
Contributor

@lucian-ioan lucian-ioan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

x-pack/metricbeat/module/azure/client.go Outdated Show resolved Hide resolved
x-pack/metricbeat/module/azure/client.go Outdated Show resolved Hide resolved
@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 7 min 12 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 7 min 25 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 11 min 33 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-28T17:49:27.159+0000

  • Duration: 53 min 36 sec

Test stats 🧪

Test Results
Failed 0
Passed 1518
Skipped 96
Total 1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@zmoog zmoog merged commit 110cc31 into elastic:main Nov 28, 2023
21 checks passed
@zmoog zmoog deleted the zmoog/fix-unintended-skip-collection-on-azure-metrics branch November 28, 2023 21:16
@zmoog zmoog added the backport-v8.11.0 Automated backport with mergify label Nov 28, 2023
mergify bot pushed a commit that referenced this pull request Nov 28, 2023
* Fix unintended skip in metric collection

Due to the use of `time.Now()` to set the value of the reference time
used to decide if collect the value of the metric, the metricset
may skip a metric collection.

Made two changes to the Azure Monitor metricset:

- moved `referenceTime` into the `Fetch()` and truncated its value to
  seconds to have a more predictable reference time for comparison.
- Updated the `MetricRegistry.NeedsUpdate()` method to use
  `referenceTime` vs. using "now" to compare with the time grain
  duration.

Current tests seem fine, with PT5M time grain and collection periods
of 300s and 60s.

I am also adding some structured logging messages to track registry
decisions at the debug log level.

Here's how to parse the structured logs to get a nice table view:

```shell
$ cat metricbeat.log.json | grep "MetricRegistry" | jq -r  '[.key, .needs_update, .reference_time, .now, .time_grain_start_time//"n/a", .last_collection_at//"n/a"] | @TSV'
fdd3a07a3cabd90233c083950a4bc30c        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:30.967Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6ee8809577a906538473e3e5e98dc893        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.257Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6aedb7dffafbfe9ca19e0aa01436d30a        false   2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.757Z        2023-11-26T15:46:30.000Z        2023-11-26T15:48:30.000Z
```

---------

Co-authored-by: Richa Talwar <[email protected]>
(cherry picked from commit 110cc31)
zmoog added a commit that referenced this pull request Nov 28, 2023
…37222)

* Fix unintended skip in metric collection

Due to the use of `time.Now()` to set the value of the reference time
used to decide if collect the value of the metric, the metricset
may skip a metric collection.

Made two changes to the Azure Monitor metricset:

- moved `referenceTime` into the `Fetch()` and truncated its value to
  seconds to have a more predictable reference time for comparison.
- Updated the `MetricRegistry.NeedsUpdate()` method to use
  `referenceTime` vs. using "now" to compare with the time grain
  duration.

Current tests seem fine, with PT5M time grain and collection periods
of 300s and 60s.

I am also adding some structured logging messages to track registry
decisions at the debug log level.

Here's how to parse the structured logs to get a nice table view:

```shell
$ cat metricbeat.log.json | grep "MetricRegistry" | jq -r  '[.key, .needs_update, .reference_time, .now, .time_grain_start_time//"n/a", .last_collection_at//"n/a"] | @TSV'
fdd3a07a3cabd90233c083950a4bc30c        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:30.967Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6ee8809577a906538473e3e5e98dc893        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.257Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6aedb7dffafbfe9ca19e0aa01436d30a        false   2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.757Z        2023-11-26T15:46:30.000Z        2023-11-26T15:48:30.000Z
```

---------

Co-authored-by: Richa Talwar <[email protected]>
(cherry picked from commit 110cc31)

Co-authored-by: Maurizio Branca <[email protected]>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
)

* Fix unintended skip in metric collection

Due to the use of `time.Now()` to set the value of the reference time
used to decide if collect the value of the metric, the metricset
may skip a metric collection.

Made two changes to the Azure Monitor metricset:

- moved `referenceTime` into the `Fetch()` and truncated its value to
  seconds to have a more predictable reference time for comparison.
- Updated the `MetricRegistry.NeedsUpdate()` method to use
  `referenceTime` vs. using "now" to compare with the time grain
  duration.

Current tests seem fine, with PT5M time grain and collection periods
of 300s and 60s.

I am also adding some structured logging messages to track registry
decisions at the debug log level.

Here's how to parse the structured logs to get a nice table view:

```shell
$ cat metricbeat.log.json | grep "MetricRegistry" | jq -r  '[.key, .needs_update, .reference_time, .now, .time_grain_start_time//"n/a", .last_collection_at//"n/a"] | @TSV'
fdd3a07a3cabd90233c083950a4bc30c        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:30.967Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6ee8809577a906538473e3e5e98dc893        true    2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.257Z        2023-11-26T15:46:30.000Z        2023-11-26T15:46:30.000Z
6aedb7dffafbfe9ca19e0aa01436d30a        false   2023-11-26T15:51:30.000Z        2023-11-26T15:51:35.757Z        2023-11-26T15:46:30.000Z        2023-11-26T15:48:30.000Z
```

---------

Co-authored-by: Richa Talwar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 Automated backport with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Monitor may skip a metric collection depending on the timing
5 participants