Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memqueue getting stuck on shutdown #37077

Merged
merged 1 commit into from
Nov 13, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Nov 9, 2023

Proposed commit message

There was a case when openState.publish could get stuck and ignore its shutdown signal, effectively preventing a Beat (confirmed with Filebeat) from gracefully terminating.

This commit fixes this by ensuring every channel read/write also checks for the shutdown signal.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist

How to test this PR locally

  1. Create a server that will reply with logs. I used my fork from flog
    Build/install and run with flog -t http -f json, this will start an HTTP server on :3000

  2. Build Filebeat from the x-pack folder

  3. Use the following filebeat.yml

    filebeat.yml

    filebeat.inputs:
      - type: httpjson
        interval: 5s
        config_version: 2
        request.url: http://localhost:3000
        response.pagination:
          - set:
              target: url.params.foo
              value: '[[.last_response.body.bytes]]'
        cursor:
          last_requested_at:
            value: '[[now]]'
        processors:
          - decode_json_fields:
              fields: ["message"]
              target: ""
              add_error_key: true
              overwrite_keys: true
    queue:
      mem:
        events: 32
        flush.min_events: 32
    
    output:
      elasticsearch:
        allow_older_versions: true
        hosts:
          - https://localhost:9200
        ssl.verification_mode: none
        protocol: https
        username: elastic
        password: changeme
    
    logging:
      level: debug
      to_stderr: true

  4. Do NOT deploy an Elasticsearch, you want the output blocked and the queue full!

  5. Start filebeat, it will log to stdout

  6. Wait until no more events are published and the only logs are metrics and reconnection attempts

    {"log.level":"info","@timestamp":"2023-11-10T18:57:08.999+0100","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":4805419008}}}},"cpu":{"system":{"ticks":30},"total":{"ticks":150,"value":150},"user":{"ticks":120}},"handles":{"limit":{"hard":524288,"soft":524288},"open":10},"info":{"ephemeral_id":"127caa76-6bef-4760-81ca-5465c5eb0604","uptime":{"ms":90041},"version":"8.12.0"},"memstats":{"gc_next":36516080,"memory_alloc":20845056,"memory_total":75824200,"rss":115867648},"runtime":{"goroutines":23}},"filebeat":{"events":{"active":33},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":33}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":1.27,"15":0.77,"5":0.87,"norm":{"1":0.0794,"15":0.0481,"5":0.0544}}}},"ecs.version":"1.6.0"}}
    {"log.level":"error","@timestamp":"2023-11-10T18:57:09.666+0100","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(https://localhost:9200)): Get \"https://localhost:9200\": dial tcp 127.0.0.1:9200: connect: connection refused","service.name":"filebeat","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-10T18:57:09.666+0100","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(https://localhost:9200)) with 6 reconnect attempt(s)","service.name":"filebeat","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-11-10T18:57:09.666+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":284},"message":"ES Ping(url=https://localhost:9200)","service.name":"filebeat","ecs.version":"1.6.0"}
    {"log.level":"error","@timestamp":"2023-11-10T18:57:09.667+0100","log.logger":"esclientleg","log.origin":{"file.name":"transport/logging.go","file.line":38},"message":"Error dialing dial tcp 127.0.0.1:9200: connect: connection refused","service.name":"filebeat","network":"tcp","address":"localhost:9200","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-11-10T18:57:09.667+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":288},"message":"Ping request failed with: Get \"https://localhost:9200\": dial tcp 127.0.0.1:9200: connect: connection refused","service.name":"filebeat","ecs.version":"1.6.0"}
    
  7. Hit CTRL+C or send a SIGTERM to Filebeat

  8. It should gracefully shutdown within a few seconds.

Even though the issue is within the queue implementation, some inputs will eventually shutdown (like filestream) even if the Publish call got stuck, the httpjson input will stay blocked for ever.

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 9, 2023
Copy link
Contributor

mergify bot commented Nov 9, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 9, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-10T14:50:58.561+0000

  • Duration: 134 min 18 sec

Test stats 🧪

Test Results
Failed 0
Passed 28670
Skipped 2015
Total 30685

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr force-pushed the fix-queue-deadlock-on-shutdown branch from abcc42e to 34dba90 Compare November 10, 2023 12:24
There was a case when openState.publish could get stuck and ignore its
shutdown signal, effectively preventing a Filebeat (or any Beat) from
gracefully terminating.

This commit fixes this by ensuring every channel read/write also
checks for the shutdown signal.
@belimawr belimawr force-pushed the fix-queue-deadlock-on-shutdown branch from 34dba90 to fef001f Compare November 10, 2023 14:50
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 10, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 10, 2023
@belimawr belimawr marked this pull request as ready for review November 10, 2023 18:11
@belimawr belimawr requested a review from a team as a code owner November 10, 2023 18:11
@cmacknz cmacknz requested a review from faec November 10, 2023 18:18
@belimawr belimawr added the backport-v8.11.0 Automated backport with mergify label Nov 13, 2023
@belimawr belimawr merged commit a473880 into elastic:main Nov 13, 2023
87 checks passed
mergify bot pushed a commit that referenced this pull request Nov 13, 2023
There was a case when openState.publish could get stuck and ignore its
shutdown signal, effectively preventing a Filebeat (or any Beat) from
gracefully terminating.

This commit fixes this by ensuring every channel read/write also
checks for the shutdown signal.

(cherry picked from commit a473880)
belimawr added a commit that referenced this pull request Nov 16, 2023
There was a case when openState.publish could get stuck and ignore its
shutdown signal, effectively preventing a Filebeat (or any Beat) from
gracefully terminating.

This commit fixes this by ensuring every channel read/write also
checks for the shutdown signal.

(cherry picked from commit a473880)

---------

Co-authored-by: Tiago Queiroz <[email protected]>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
There was a case when openState.publish could get stuck and ignore its
shutdown signal, effectively preventing a Filebeat (or any Beat) from
gracefully terminating.

This commit fixes this by ensuring every channel read/write also
checks for the shutdown signal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 Automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] Filebeat process does not exit when stopping with CTRL-C
3 participants