Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent remains Unhealthy even on updating invalid integration configuration to valid input. #2954

Closed
2 tasks
amolnater-qasource opened this issue Jun 28, 2023 · 9 comments
Assignees
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented Jun 28, 2023

Issue

Kibana Build details:

VERSION: 8.9.0 BC1
BUILD: 64385
COMMIT: 313dac73d8d3bc5930447f732e3ae163fb1b7f70

Host OS and Browser version: All, All

Preconditions:

  1. 8.9.0 BC1 Kibana cloud environment should be available.
  2. Few agents should be installed.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Select any agent and navigate to its agent policy>system-1 integration.
  3. Add invalid field data to Cpu metrics field say xxxxxx.
  4. Observe agent gets unhealthy and under agent details appropriate error is visible.
  5. Update the configuration to the expected correct input in field- percentages
  6. Observe even after 30 minutes agent remains Unhealthy.

Expected:
Agent should get back healthy on updating invalid integration configuration to valid input.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-06-28.11-00-17.mp4
ec2amaz-tc0oajr.-.Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-06-28.11-21-38.mp4

Debug Logs:
elastic-agent-diagnostics-2023-06-28T06-08-54Z-00.zip

Definition of done

  • Agents should get back healthy on switching from invalid input to valid input.
  • Test are in place to confirm Agent state is conform to what we are expecting when input status changed (in both ways).
@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:high Short-term priority; add to current release, or definitely next. labels Jun 28, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Jun 28, 2023

This is the failing input in state.yaml:

- id: system/metrics-default
  state:
    state: 2
    message: 'Healthy: communicating with pid ''5768'''
    units:
      ? unittype: 0
        unitid: system/metrics-default-system/metrics-system-6d2e7b5d-b166-466c-8fb5-db5c3a512387
      : state: 4
        message: '[failed to reload inputs: 1 error: Error creating runner from config:
          1 error: error validating config: invalid core.metrics value ''xxxxxxx''
          (valid options are percentages and ticks)]'
      ? unittype: 1
        unitid: system/metrics-default
      : state: 4
        message: '[failed to reload inputs: 1 error: Error creating runner from config:
          1 error: error validating config: invalid core.metrics value ''xxxxxxx''
          (valid options are percentages and ticks)]'

The configuration for the system/metrics in pre-config.yaml looks fine, and matches what is in beat-rendered-config.yaml:

- data_stream:
    namespace: windows
  id: system/metrics-system-6d2e7b5d-b166-466c-8fb5-db5c3a512387
  meta:
    package:
      name: system
      version: 1.34.0
  name: system-2
  package_policy_id: 6d2e7b5d-b166-466c-8fb5-db5c3a512387
  revision: 3
  streams:
  - core.metrics:
    - percentages
    data_stream:
      dataset: system.core
      type: metrics
    id: system/metrics-system.core-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - core
  - cpu.metrics:
    - percentages
    - normalized_percentages
    data_stream:
      dataset: system.cpu
      type: metrics
    id: system/metrics-system.cpu-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - cpu
    period: 10s
  - data_stream:
      dataset: system.diskio
      type: metrics
    diskio.include_devices: null
    id: system/metrics-system.diskio-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - diskio
    period: 10s
  - data_stream:
      dataset: system.filesystem
      type: metrics
    id: system/metrics-system.filesystem-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - filesystem
    period: 1m
    processors:
    - drop_event:
        when:
          regexp:
            system.filesystem.mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)
  - data_stream:
      dataset: system.fsstat
      type: metrics
    id: system/metrics-system.fsstat-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - fsstat
    period: 1m
    processors:
    - drop_event:
        when:
          regexp:
            system.fsstat.mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)
  - condition: ${host.platform} != 'windows'
    data_stream:
      dataset: system.load
      type: metrics
    id: system/metrics-system.load-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - load
    period: 10s
  - data_stream:
      dataset: system.memory
      type: metrics
    id: system/metrics-system.memory-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - memory
    period: 10s
  - data_stream:
      dataset: system.network
      type: metrics
    id: system/metrics-system.network-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - network
    network.interfaces: null
    period: 10s
  - data_stream:
      dataset: system.process
      type: metrics
    id: system/metrics-system.process-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - process
    period: 10s
    process.cgroups.enabled: false
    process.cmdline.cache.enabled: true
    process.include_cpu_ticks: false
    process.include_top_n.by_cpu: 5
    process.include_top_n.by_memory: 5
    processes:
    - .*
  - data_stream:
      dataset: system.process.summary
      type: metrics
    id: system/metrics-system.process.summary-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - process_summary
    period: 10s
  - data_stream:
      dataset: system.socket_summary
      type: metrics
    id: system/metrics-system.socket_summary-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - socket_summary
    period: 10s
  - data_stream:
      dataset: system.uptime
      type: metrics
    id: system/metrics-system.uptime-6d2e7b5d-b166-466c-8fb5-db5c3a512387
    metricsets:
    - uptime
    period: 10s
  type: system/metrics
  use_output: default

This is definitely a bug, one we haven't see before.

@LaZyDK
Copy link
Contributor

LaZyDK commented Sep 12, 2023

We are seeing something similar in 8.9.2.
The first "Custom Logs" integration is not enabled, hence the lack of errors.
Skærmbillede_2023-09-12_kl__13_18_54

@AndersonQ
Copy link
Member

@amolnater-qasource could you re-validate that issue? I cannot reproduce it on main nor on 8.10.2. It was most likely fixed by elastic/beats#36183

@LaZyDK
Copy link
Contributor

LaZyDK commented Sep 28, 2023

The issue has gone away in 8.10.1.

@amolnater-qasource
Copy link
Author

Hi @AndersonQ

Thank you for the update.

We have revalidated this issue on 8.10.2 and 8.11.0-SNAPSHOT kibana cloud environment and found it fixed now.

Observations:

  • Agent gets back healthy on updating invalid integration configuration to valid input.

Screen Recording:
8.11.0:

ec2amaz-u7odjck.-.Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-09-29.12-04-45.mp4

8.10.2:

Edit.integration.-.Windows.Agent.policy.1.-.Agent.policies.-.Fleet.-.Elastic.-.Google.Chrome.2023-09-29.12-08-59.mp4

Build details:
VERSION: 8.11.0 SNAPSHOT
BUILD: 67332
COMMIT: c20d177a036be73d7b1180dc17e644afa260994f

Hence we are closing this issue and marking as QA:Validated.

Thanks!!

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Sep 29, 2023
@harshitgupta-qasource
Copy link

harshitgupta-qasource commented Jan 24, 2024

Bug Conversion

  • Test-Case not required as this particular checkpoint is already covered in exploratory testing.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants