Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

Closed
drewftw opened this issue May 8, 2024 · 11 comments

Comments

@drewftw
Copy link

drewftw commented May 8, 2024

Component(s)

receiver/hostmetrics

What happened?

Description

Otel Collector running on Windows Server 2019 was observed to have high CPU spikes (3-7%) each time the hostmetrics receiver collection process ran which was set to an interval of 1 minute.

image

After testing the issue was narrowed down to the process scraper. The following shows the Otel collector CPU usage when only the process scraper is enabled.

image

After reenabling all other hostmetrics scrapers except for the process scraper, we can see the magnitude of the CPU spikes come down significantly (<0.5%).

Screenshot 2024-04-30 at 2 24 55 PM

Steps to Reproduce

On a machine running Windows Server 2019, download the v0.94 version of Otel collector from https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.94.0.

Modify the config.yaml to enable the hostmetrics process scraper and set the collection interval (see config attached to the issue for an example).

Run the otel collector exe

Monitor the CPU usage of the otel collector on Task Manager or graph the usage using perfmon

Expected Result

CPU usage comparable to observed levels on Linux collectors (<0.5%)

Actual Result

CPU spikes to 3-7%

Collector version

v0.93.0

Environment information

Environment

OS: Windows Server 2019

OpenTelemetry Collector configuration

receivers:
  hostmetrics:
    collection_interval: 1m
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      load:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      network:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      processes:
      process:
        mute_process_exe_error: true
        metrics:
          process.cpu.utilization:
            enabled: true
          process.memory.utilization:
            enabled: true
  docker_stats:
    collection_interval: 1m
    metrics:
      container.cpu.throttling_data.periods:
        enabled: true
      container.cpu.throttling_data.throttled_periods:
        enabled: true
      container.cpu.throttling_data.throttled_time:
        enabled: true
  prometheus:
    config:
      scrape_configs:
        - job_name: $InstanceId/otel-self-metrics-collector-$Region
          scrape_interval: 1m
          static_configs:
            - targets: ['0.0.0.0:9999']
  otlp:
    protocols:
      grpc:
      http:

exporters:
  debug:
    verbosity: normal
  otlp:
    endpoint: <endpoint>

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 500
    spike_limit_mib: 100
  batch:
    send_batch_size: 8192
    send_batch_max_size: 8192
    timeout: 2000ms
  filter:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          # comment a metric to remove from exclusion rule
          - otelcol_exporter_queue_capacity
          - otelcol_exporter_enqueue_failed_spans
          - otelcol_exporter_enqueue_failed_log_records
          - otelcol_exporter_enqueue_failed_metric_points
          - otelcol_exporter_send_failed_metric_points
          - otelcol_process_runtime_heap_alloc_bytes
          - otelcol_process_runtime_total_alloc_bytes
          - otelcol_processor_batch_timeout_trigger_send
          - otelcol_process_runtime_total_sys_memory_bytes
          - otelcol_process_uptime
          - otelcol_scraper_errored_metric_points
          - otelcol_scraper_scraped_metric_points
          - scrape_samples_scraped
          - scrape_samples_post_metric_relabeling
          - scrape_series_added
          - scrape_duration_seconds
          # - up
  resourcedetection:
    detectors: ec2, env, system
    ec2:
      tags:
        - ^Environment$
    system:
      hostname_sources: ["os"]
      resource_attributes:
        host.id:
          enabled: true

extensions:
  health_check:
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:9999
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection]
    metrics:
      receivers: [otlp, hostmetrics, prometheus]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection, filter]
    logs:
      receivers: [otlp]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection]

Log output

No response

Additional context

Additional details: Windows 2019 was running on an m5x.large EC2

@drewftw drewftw added bug Something isn't working needs triage New item requiring triage labels May 8, 2024
Copy link
Contributor

github-actions bot commented May 8, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@braydonk
Copy link
Contributor

braydonk commented May 8, 2024

Hi @drewftw,

I made improvements to the CPU usage of the process scraper in v0.99.0 of the collector. Would you be able to update the collector and give that a try? Hopefully that should make it better.

@drewftw
Copy link
Author

drewftw commented May 8, 2024

Hey @braydonk, thanks for your quick response! Sure I can try v0.99 and see if it helps the issue. I've been testing with v0.94 since thats the version my users are on, they haven't upgraded yet

@braydonk
Copy link
Contributor

braydonk commented May 8, 2024

Here's the issue with the explanation for the CPU usage and how it was fixed in v0.99.0. #28849

We can't be 100% sure you aren't running into something different since this was focused on Linux, but it's worth seeing if this helps in your scenario.

@drewftw
Copy link
Author

drewftw commented May 23, 2024

@braydonk We're still observing a similar pattern after upgrading to v0.99.0. CPU spiking to 5% when metrics are being scraped. Anything I can investigate to provide more info?

Screenshot 2024-05-23 at 11 59 46 AM

@braydonk
Copy link
Contributor

Thanks for the info @drewftw. I don't expect I'll need anything from your environment; I expect this is the same thing many users are experiencing rather than a specific breakage. The inefficiencies that existed on Linux may exist in different ways on Windows. I'll replicate the same research I did on Linux in my Windows environment.

I expect I can set aside some time next week, I will keep this issue updated with progress.

@braydonk
Copy link
Contributor

braydonk commented May 29, 2024

I had time to investigate this today and I opened a PR with details and a fix!

@crobert-1 crobert-1 removed the needs triage New item requiring triage label May 29, 2024
@crobert-1
Copy link
Member

Removed needs triage as a code owner has opened a PR to resolve this issue.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2024
@braydonk
Copy link
Contributor

Not Stale. The PR for this is open and ready for review. Can be marked stale-exempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants