Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Error with registry and truncated file #35571

Closed
elasticmachine opened this issue May 24, 2023 · 5 comments · Fixed by #35759
Closed

[Filebeat] Error with registry and truncated file #35571

elasticmachine opened this issue May 24, 2023 · 5 comments · Fixed by #35759
Assignees
Labels
automation bug build-failures Build failures in the CI. ci-reported Issues that have been automatically reported from the CI Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@elasticmachine
Copy link
Collaborator

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-05-24T19:59:22.849+0000

  • Duration: 132 min 20 sec

Test stats 🧪

Test Results
Failed 0
Passed 50869
Skipped 4792
Total 55661

Steps errors 5

Expand to view the steps failures

filebeat-goIntegTest - mage goIntegTest
  • Took 14 min 49 sec . View more details here
  • Description: mage goIntegTest
filebeat-goIntegTest - mage goIntegTest
  • Took 11 min 25 sec . View more details here
  • Description: mage goIntegTest
x-pack/metricbeat-goIntegTest - mage goIntegTest
  • Took 14 min 17 sec . View more details here
  • Description: mage goIntegTest
Dump mage variables
  • Took 0 min 28 sec . View more details here
  • Description: mage dumpVariables
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error "hudson.AbortException: script returned exit code 1"

@elasticmachine elasticmachine added automation ci-reported Issues that have been automatically reported from the CI build-failures Build failures in the CI. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 24, 2023
@elasticmachine
Copy link
Collaborator Author

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@leehinman leehinman changed the title Build 1561 for main with status FAILURE [Filebeat] Error with registry and truncated file May 30, 2023
@leehinman leehinman added the bug label May 30, 2023
@leehinman
Copy link
Contributor

In TestFilestreamTruncateBlockedOutput we see occasional errors with this test where the expected offset is never equal to the offset in the registry. after changing the test a little, I can see that the registry still has the offset of the original write, it doesn't have the offset after the second write or the truncate. Also the file does have a size of 0 on disk so the truncate function seems to have worked. This looks like a real bug in the registry.

It can be easily reproduced on arm64 Linux under UTM on OS X with:

while go test -count 1 -run TestFilestreamTruncateBlockedOutput -tags=integration; do :;  done

On OS X itself it rarely happens.

ok  	github.com/elastic/beats/v7/filebeat/input/filestream	2.053s
--- FAIL: TestFilestreamTruncateBlockedOutput (10.02s)
    environment_test.go:268:
        	Error Trace:	/home/hinman/src/beats/filebeat/input/filestream/environment_test.go:268
        	            				/home/hinman/src/beats/filebeat/input/filestream/input_integration_test.go:746
        	Error:      	Condition never satisfied
        	Test:       	TestFilestreamTruncateBlockedOutput
        	Messages:   	expected offset: '0', cursor offset: '4', err: '', filesize: '0'

@belimawr
Copy link
Contributor

I got something, there are two possible code paths that "detect" a file truncation, that depends on whether the harvester is running or not. Here is the output of the test running on a successful and failed run:

==================================================
=========== Test: File truncated, waiting for 0 offset in the registry
{"log.level":"info","@timestamp":"2023-06-13T11:59:10.793+0200","log.logger":"file_watcher","log.origin":{"file.name":"filestream/fswatch.go","file.line":159},"message":"**************** File truncated!!!!!!!!!!!!!!!!!! PrevSize: 23","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-13T11:59:10.793+0200","log.origin":{"file.name":"input-logfile/harvester.go","file.line":316},"message":"Resource 'filestream::fake-ID::native::693457-34' currently in use, waiting...","source_file":"filestream::fake-ID::native::693457-34","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-13T11:59:10.793+0200","log.origin":{"file.name":"filestream/input.go","file.line":321},"message":"Reader was closed. Closing. Offset: 23, Reader: *readfile.LimitReader","source_file":"filestream::fake-ID::native::693457-34","path":"/tmp/TestFilestreamTruncateBlockedOutput3246129483/001/test.log","state-id":"native::693457-34","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-13T11:59:10.793+0200","log.origin":{"file.name":"input-logfile/harvester.go","file.line":318},"message":"Resource 'filestream::fake-ID::native::693457-34' finally released. Now locked by me","source_file":"filestream::fake-ID::native::693457-34","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-13T11:59:12.798+0200","log.origin":{"file.name":"filestream/input.go","file.line":321},"message":"Reader was closed. Closing. Offset: 15, Reader: *readfile.LimitReader","source_file":"filestream::fake-ID::native::693457-34","path":"/tmp/TestFilestreamTruncateBlockedOutput3246129483/001/test.log","state-id":"native::693457-34","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-13T11:59:12.798+0200","log.origin":{"file.name":"input-logfile/clean.go","file.line":52},"message":"failed to start the registry cleaning routine: %!w(*errors.errorString=&{context canceled})","input_type":"filestream","ecs.version":"1.6.0"}
PASS
ok      github.com/elastic/beats/v7/filebeat/input/filestream   2.341s
==================================================
=========== Test: File truncated, waiting for 0 offset in the registry
{"log.level":"info","@timestamp":"2023-06-13T11:59:15.580+0200","log.origin":{"file.name":"filestream/input.go","file.line":319},"message":"File was truncated. Nothing to read now. Offset: 23, Path=/tmp/TestFilestreamTruncateBlockedOutput848308770/001/test.log. Reader: *readfile.LimitReader","source_file":"filestream::fake-ID::native::693885-34","path":"/tmp/TestFilestreamTruncateBlockedOutput848308770/001/test.log","state-id":"native::693885-34","ecs.version":"1.6.0"}
--- FAIL: TestFilestreamTruncateBlockedOutput (10.02s)
    environment_test.go:263: 
                Error Trace:    /home/tiago/devel/beats/filebeat/input/filestream/environment_test.go:263
                                                        /home/tiago/devel/beats/filebeat/input/filestream/input_integration_test.go:750
                Error:          Condition never satisfied
                Test:           TestFilestreamTruncateBlockedOutput
                Messages:       expecting offset 0
FAIL
exit status 1
FAIL    github.com/elastic/beats/v7/filebeat/input/filestream   10.341s

The test is run with the following script:

run_test.sh

#!/bin/bash

while
    echo "=================================================="
    go test -count 1 -run TestFilestreamTruncateBlockedOutput -tags=integration
do true ; done

In the first case the truncation is detected by the fileWatcher.watch method. The fileWatcher runs on its own goroutine and is responsible for picking up new files watched by the input. It runs on a interval set by prospector.scanner.check_interval. It sends an event that ends up being processed by fileProspector.onFSEvent that will correctly update the registry offset.

This is not the code path that is reading the file.

In the second case (that leads to test failure) is the "reading the file" codepath, when filestream.readFromSource gets the next message from the reader, it receives an error, logs a message stating the file was truncated and the offset is being set to zero (even though it is not) and returns.

It seems that by the time the prospector runs again, the file already contains more data so it does not perceive it as a truncation.

The exact commit I used with extra debug messages: belimawr@342f93a

@belimawr
Copy link
Contributor

I believe I understood the root of the problem: The file watcher relies on the modification time to identify a truncate/write event:

if prevInfo.ModTime() != info.ModTime() {
if prevInfo.Size() > info.Size() || w.resendOnModTime && prevInfo.Size() == info.Size() {
select {
case <-ctx.Done():
return
case w.events <- truncateEvent(path, info):
}
} else {
select {
case <-ctx.Done():
return
case w.events <- writeEvent(path, info):
}
}
}

However if the file is written to, read by Filebeat than truncated all of this within the same second (which is more than enough time to read a few log lines on a test), the file watcher will not detect it, leading to no registry change, hence the test fails with a timeout while waiting to read the 0 offset from the registry.

@cmacknz
Copy link
Member

cmacknz commented Jun 13, 2023

However if the file is written to, read by Filebeat than truncated all of this within the same second (which is more than enough time to read a few log lines on a test), the file watcher will not detect it, leading to no registry change, hence the test fails with a timeout while waiting to read the 0 offset from the registry.

This definitely seems like a real bug then, one that has been there for quite a while looking at the last time this code block was touched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation bug build-failures Build failures in the CI. ci-reported Issues that have been automatically reported from the CI Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
4 participants