Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk queue re-sends last segment file when there is more than on segment on disk #32560

Open
Tracked by #16
belimawr opened this issue Aug 1, 2022 · 6 comments
Open
Tracked by #16
Labels
bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

belimawr commented Aug 1, 2022

When using the disk queue, when Filebeat is restarted and a new event is sent, the disk queue creates a second segment file, if both are present, whenever Filebeat is restarted, the last event is always re-sent.

The original Discuss Topic contains the steps to reproduce using the file output, here I used the console output, either way works (aka, the bug is reproducible)

Steps to Reproduce:

1. Use the config below and start Filebeat.

filebeat.inputs:
- type: filestream
  id: test-id
  enabled: true

  paths:
    - /tmp/test.log

output.console:
  enabled: true
  pretty: true

logging.level: debug
logging.to_files: true

queue.disk:
  max_size: 1GB
filebeat.shutdown_timeout: 30s

2. Add a line to the log file.

Contents of the log file (/tmp/test.log):

2022-07-29 09:00:00.000 0000001

You will see a event in the console with the line that was added to the file.

The disk queue folder (./data/diskqueue) contains two files:

-rw------- 1 root root 424 Jul 29 16:43 0.seg
-rw------- 1 root root 28 Jul 29 16:43 state.dat

3. Enter a second line to the log file:

Contents of the log file (/tmp/test.log):

2022-07-29 09:00:00.000 0000001
2022-07-29 09:01:00.000 0000002

You will see a event in the console with the line that was added to the file.

4. Restart Filebeat

No changes in the files/folders

5. Add a new log line

Contents of the log file (/tmp/test.log):

2022-07-29 09:00:00.000 0000001
2022-07-29 09:01:00.000 0000002
2022-07-29 09:01:00.000 0000003

You will see a event in the console with the line that was added to the file.

The diskqueue creates a new segment file, now there are two:

-rw------- 1 root root 424 Jul 29 16:43 0.seg
-rw------- 1 root root 216 Jul 29 16:46 1.seg
-rw------- 1 root root 28  Jul 29 16:46 state.dat

5. Restart Filebeat

You will see the last event sent (2022-07-29 09:01:00.000 0000003).

Every time you restart Filebeat the same event will be sent.

6. Keep sending events and restarting filebeat

If we send more events, the current segment files are deleted, a new one created. Whenever there is only one segment file, no duplicated data is sent.

If Filebeat is restarted once more, a second segment file will be created, and while there are two segment files, the last segment file will always be sent when Filebeat restarts.

Logs and observations

When there are two segment files in the diskqueue and Filebeat starts, the "last segment" seems to be resent before the registry is started, the diskqueue also logs a negative number of events: Found -4 existing events on queue start.

Here are some debug logs being sent to stderr with the console output enabled: filebeat.log

@belimawr belimawr added bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Aug 1, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Aug 2, 2022

@leehinman I think we will want to address this as part of the work to support the disk queue under agent.

CC @faec if you have any idea on a possible root cause.

@faec
Copy link
Contributor

faec commented Aug 3, 2022

Just tried this (it still reproduces on current main branch) and it looks like the problem is not in the disk queue per se but in the beats shutdown: nothing ever calls diskQueue.Close. When a segment file was still being written on shutdown, diskQueue.Close saves its final state (and deletes it if appropriate). Since that's never happening, we're essentially just cutting off the file handles without the final metadata. (The inconsistent metadata is why the logs report a negative number of events on startup.)

Once the queue realizes the metadata on the segment file is incomplete, it rescans the file and sends all the events (it's a conservative approach, if there is data there it sends it unless we're certain that it's already been sent, and the conflicting metadata prevents that).

So the root cause looks like something in the pipeline that isn't propagating the shutdown signal to the queue.

@faec
Copy link
Contributor

faec commented Aug 3, 2022

(In particular though, this means that we're unlikely to see this in the shipper since our shutdown process there is entirely different.)

@botelastic
Copy link

botelastic bot commented Aug 18, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@micw
Copy link

micw commented Jan 29, 2024

Hello,
I stumbled on this when testing out filebeat and then found this issue that is open for >1 year.
Looks like the root cause is known an the fix is simple (close the queue properly). Any chance to get a fix in near future?

@botelastic botelastic bot removed the Stalled label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

6 participants