Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix filestream flaky truncated file test #35759

Merged
merged 6 commits into from
Jun 20, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jun 13, 2023

What does this PR do?

This PR fixes a flaky test. The flakiness was due to the
modification date of a file not changing between two different
operations.

It also improves some of inputTestingEnvironment and fixes some other log messages.

A more detailed explanation is on the related issue.

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
    - [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist
## How to test this PR locally
## Related issues
## Use cases
## Screenshots
## Logs

@belimawr belimawr added the bug label Jun 13, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 13, 2023
@belimawr belimawr force-pushed the fix-filestream-flaky-test branch from f5a5142 to 1d0f26c Compare June 13, 2023 14:49
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jun 13, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 13, 2023
@belimawr belimawr marked this pull request as ready for review June 13, 2023 14:49
@belimawr belimawr requested a review from a team as a code owner June 13, 2023 14:49
@belimawr belimawr requested review from rdner and faec June 13, 2023 14:49
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@mergify
Copy link
Contributor

mergify bot commented Jun 13, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

belimawr added 3 commits June 13, 2023 16:50
This commit fixes the formatting issue and only logs the error if it
is not `context.Canceled`
This commit fixes and improves some log messages.
This commit fixes a flaky test. The flakiness was due to the
modification date of a file not changing between two different
operations.
@belimawr belimawr force-pushed the fix-filestream-flaky-test branch from 1d0f26c to e16c542 Compare June 13, 2023 14:51
faec
faec previously approved these changes Jun 13, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 13, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-06-20T13:39:30.525+0000

  • Duration: 11 min 33 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

This commit brings the improvements from
elastic#35754 to this PR.
@belimawr belimawr force-pushed the fix-filestream-flaky-test branch from 08b431e to e13e183 Compare June 13, 2023 16:06
@belimawr belimawr requested review from leehinman and faec June 13, 2023 16:06
leehinman
leehinman previously approved these changes Jun 13, 2023
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmacknz
Copy link
Member

cmacknz commented Jun 13, 2023

Based on #35571 (comment)

Would it not be better to remove the modification time check in the block below, and instead look at the size first?

if prevInfo.ModTime() != info.ModTime() {
if prevInfo.Size() > info.Size() || w.resendOnModTime && prevInfo.Size() == info.Size() {
select {
case <-ctx.Done():
return
case w.events <- truncateEvent(path, info):
}
} else {
select {
case <-ctx.Done():
return
case w.events <- writeEvent(path, info):
}
}
}

Can't this problem happen outside of our tests? It looks like we are only fixing the test, and not what seems to be an actual limitation in Filebeat:

	// Sleep for a second before modifying the file again
	// this ensures the modification time of the file will
	// be different between the last write and the truncation.
	time.Sleep(time.Second)

@belimawr
Copy link
Contributor Author

Can't this problem happen outside of our tests? It looks like we are only fixing the test, and not what seems to be an actual limitation in Filebeat:

	// Sleep for a second before modifying the file again
	// this ensures the modification time of the file will
	// be different between the last write and the truncation.
	time.Sleep(time.Second)

I've been thinking a lot about it since yesterday and I believe the filestream input works fine (maybe there is a very edge case not fully covered). On a "normal" operation that is what I expect to happen:

  1. Filestream input is reading a log file
  2. It gets truncated
  3. Filestream is on the "read path", the reader returns an error and it gets closed
  4. New data is appended to the file
  5. The Filewatcher detects the modification on the file, the new file size is smaller than before, so it's a truncation, then the cursor is set to 0.

The problem with the test is that we write data then truncate the file within the same second (that's the failure scenario), then we never touch it again. Which effectively puts the file on this "limbo". However there is no new data to be harvested.

The only edge case I see is if outside of a test scenario 3 and 4 also happen within the same second and no other data is appended to the file, in that case we'd lose this bit of data. This is a very specific, and likely very rare, edge case.

Our documentation already warns about truncating files and the problems it can bring: https://www.elastic.co/guide/en/beats/filebeat/current/file-log-rotation.html

I haven't looked the internals of our copy truncate prospector (https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_rotation_external_strategy_copytruncate), but it's supposed to help on those situations, however it's not GA yet.

Based on #35571 (comment)

Would it not be better to remove the modification time check in the block below, and instead look at the size first?

if prevInfo.ModTime() != info.ModTime() {
if prevInfo.Size() > info.Size() || w.resendOnModTime && prevInfo.Size() == info.Size() {
select {
case <-ctx.Done():
return
case w.events <- truncateEvent(path, info):
}
} else {
select {
case <-ctx.Done():
return
case w.events <- writeEvent(path, info):
}
}
}

That is indeed an option. I would change

if prevInfo.ModTime() != info.ModTime() {
to

prevInfo.ModTime() != info.ModTime() || prevInfo.Size() != info.Size()

so we don't affect the "resend on mod time" feature.

fswatcher now uses the modification time and changes in the file size
to trigger either a truncate or write event.
@belimawr
Copy link
Contributor Author

After thinking more about it and talking with @cmacknz , I decided to implement his suggestion from #35759 (comment).

@faec, @leehinman could you have a final look at it?

@belimawr belimawr requested a review from leehinman June 15, 2023 08:18
@belimawr belimawr dismissed stale reviews from leehinman and faec June 15, 2023 09:41

bfb6387 introduced changes to the fswatcher

@belimawr belimawr requested a review from leehinman June 19, 2023 15:58
@cmacknz cmacknz requested a review from a team as a code owner June 20, 2023 13:32
@cmacknz cmacknz force-pushed the fix-filestream-flaky-test branch from 5392af1 to 7d7659b Compare June 20, 2023 13:39
@cmacknz cmacknz removed the request for review from a team June 20, 2023 13:39
@cmacknz
Copy link
Member

cmacknz commented Jun 20, 2023

Accidentally pushed to this branch because I had it checked out 🤦‍♂️ , force pushed the commits away.

@belimawr belimawr merged commit bdb67bc into elastic:main Jun 20, 2023
@belimawr belimawr deleted the fix-filestream-flaky-test branch June 20, 2023 14:18
@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-06-20T13:33:01.783+0000

  • Duration: 87 min 11 sec

Test stats 🧪

Test Results
Failed 0
Passed 27437
Skipped 2013
Total 29450

Steps errors 1

Expand to view the steps failures

x-pack/metricbeat-goIntegTest - mage goIntegTest
  • Took 14 min 46 sec . View more details here
  • Description: mage goIntegTest

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr added the backport-v8.8.0 Automated backport with mergify label Jun 20, 2023
mergify bot pushed a commit that referenced this pull request Jun 20, 2023
fswatcher now uses the modification time and changes in the file size
to trigger either a truncate or write event.

Other small changes:
- Fixed some log formatting issue and only log the error if it
is not `context.Canceled`
- Fixes and improves some log messages.
- Brings the improvements from
#35754 to this PR.

(cherry picked from commit bdb67bc)

# Conflicts:
#	filebeat/input/filestream/filestream.go
#	filebeat/input/filestream/internal/input-logfile/clean.go
belimawr added a commit that referenced this pull request Jun 21, 2023
fswatcher now uses the modification time and changes in the file size
to trigger either a truncate or write event.

Other small changes:
- Fixed some log formatting issue and only log the error if it
is not `context.Canceled`
- Fixes and improves some log messages.
- Brings the improvements from
#35754 to this PR.

(cherry picked from commit bdb67bc)

# Conflicts:
#	filebeat/input/filestream/filestream.go
#	filebeat/input/filestream/internal/input-logfile/clean.go

---------

Co-authored-by: Tiago Queiroz <[email protected]>
@reakaleek reakaleek mentioned this pull request Jul 19, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.8.0 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] Error with registry and truncated file
5 participants