Use `fingerprint` file identity by default and migrate file state from `native or` path` #41762

belimawr · 2024-11-22T23:23:44Z

Proposed commit message

This commit changes the default file_identity from native to
fingerprint, any previous state from native (or path) is
automatically migrated to fingerprint whe Filestream is starting.

The Filestream input has always had the ability to update file identifiers,
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from native (inode + device ID) and
path to fingerprint without any data duplication.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Because the fingerprint is the new default file identity, files are now only ingested when they reach at least 1024 bytes. The old default behaviour can be enabled by setting the file identity to native and disabling the fingerprint in the scanner.

filebeat.inputs:
  - type: filestream
    id: "8.x-default-behaviour"
    paths:
      - /tmp/flog.log
    file_identity.native: ~
    prospector:
      scanner:
        fingerprint.enabled: false

Author's Checklist

Test with dynamic config reload
Test with Kubernetes
Test with Elastic-Agent
Fix all the tests that break with the new behaviour
Investigate which integration tests are going to break in the Elastic-Agent repo

Regarding the Elastic-Agent integration tests, most tests actually use the log input because when they were written, Filestream was not available as an integration package. The very few other test that use Filestrem either generate a log file large enough or are skipped as flaky.

How to test this PR locally

Create a log file with at least a few log lines and more than 1kb (e.g: /tmp/flog.log, 15 log lines), you can use flog with Docker:
```
docker run -it --rm mingrammer/flog -n 15 > /tmp/flog.log
```

Start Filebeat with the following configuration

filebeat.yml (native)

filebeat.inputs:
  - type: filestream
    id: "test-migrate-ID"
    paths:
      - /tmp/flog.log
    file_identity.native: ~
    prospector:
      scanner:
        check_interval: 0.1s
        fingerprint.enabled: false

queue.mem:
  flush.timeout: 0s

output.file:
  path: ${path.home}
  filename: "output-file"
  rotate_on_startup: false

logging:
  level: debug
  selectors:
    - input
    - input.filestream
    - input.filestream.prospector
  metrics:
    enabled: false

Wait until the file is fully ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)
Ensure all events have been published to the output (wc -l ./output-file* should return 15)
Stop Filebeat

Change the file identity to fingerprint. It's the new default, hence it's not explicitly set.

filebeat.yml (fingerprint)

filebeat.inputs:
  - type: filestream
    id: "test-migrate-ID"
    paths:
      - /tmp/flog.log
    prospector:
      scanner:
        check_interval: 0.1s

queue.mem:
  flush.timeout: 0s

output.file:
  path: ${path.home}
  filename: "output-file"
  rotate_on_startup: false

logging:
  level: debug
  selectors:
    - input
    - input.filestream
    - input.filestream.prospector
  metrics:
    enabled: false

Start Filebeat
Wait until the Filebeat "finds the end of the file" (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)
Ensure no extra event was published ((wc -l ./output-file* should still return 15)

Add 10 more lines to the file:

docker run -it --rm mingrammer/flog -n 10 >> /tmp/flog.log

Wait until the new lines are ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)
Ensure all events have been published to the output with no duplication (wc -l ./output-file* should return 25)

Related issues

Closes Use fingerprint file identity by default and migrate all existing filestream inputs to it #40197

Use cases

Dealing with identity reuse (e.g: inode reuse) without facing re-ingestion of data with Filestream input

~~## Screenshots~~

Logs

The `sourceStore.UpdateIdentifiers` has always been part of the fileProspector.Init, its purpose is to update the identifiers in the registry if the file identity has changed, however it was generating the wrong key and not updating the in memory registry (store.ephemeralStore). This commit fixes it and also removes `sourceStore.FixUpIdentifiers` because it just a working version of `sourceStore.UpdateIdentifiers`. Now there is a single method to manipulate identifiers in the `sourceStore`.

This commit checks if 'source' matches the real file by calculating the registry key using the old identifier, if they match, then update the registry.

mergify · 2024-11-22T23:24:22Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-11-22T23:24:23Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

A working test that migrated the file identity from inode to fingerprint.

…-migrate-file-identity

This commit adds a test to validate the case when there are multiple registry entries from different files but with the same path. That's the case when there is log rotation.

…-migrate-file-identity

rdner · 2024-12-11T14:47:28Z

Let's make sure it's also tested with dynamic config reload and with the Elastic Agent control protocol.

When I worked on take_over (log->filestream input migration) I discovered that we have separate code paths for applying dynamic configuration and it requires special handling for state changes.

I'm not saying it's not handled here, just we need to include this into testing procedures.

belimawr · 2024-12-11T16:07:43Z

Let's make sure it's also tested with dynamic config reload and with the Elastic Agent control protocol.

When I worked on take_over (log->filestream input migration) I discovered that we have separate code paths for applying dynamic configuration and it requires special handling for state changes.

I'm not saying it's not handled here, just we need to include this into testing procedures.

Thanks Denis! Do you mean at least a manual test or an integration test?

The prospector initialisation happens much after any code path for starting/configuring an input, it should be totally agnostic from how the input was configured started. So I believe those cases are also covered. However, I do agree it is good to at least perform some manual test, just to be on the safe side.

inode_marker is not supported on Windows, so remove it from all tests. Small improvements are done to the code and documentation.

…-migrate-file-identity

mergify · 2024-12-11T23:28:14Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-12-11T23:28:15Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

mergify · 2024-12-12T16:30:43Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 40197-filestream-migrate-file-identity upstream/40197-filestream-migrate-file-identity
git merge upstream/main
git push upstream 40197-filestream-migrate-file-identity

…-migrate-file-identity

elasticmachine · 2024-12-12T16:32:53Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

mergify · 2024-12-12T16:33:14Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-12-12T16:33:14Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

…-migrate-file-identity

belimawr · 2024-12-16T16:20:53Z

The Windows test failure is unrelated to this PR, I created a flaky test issue: #42059

…-migrate-file-identity

belimawr · 2024-12-16T16:23:12Z

I merged main onto this branch/PR, let's see if CI gets green with a re-run

belimawr added 3 commits November 21, 2024 13:25

Fix tests

a2798fe

Check if source matches the real file

a4ff07a

This commit checks if 'source' matches the real file by calculating the registry key using the old identifier, if they match, then update the registry.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 22, 2024

mergify bot assigned belimawr Nov 22, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 22, 2024

belimawr mentioned this pull request Nov 25, 2024

Use fingerprint file identity by default and migrate all existing filestream inputs to it #40197

Open

belimawr changed the title ~~40197 filestream migrate file identity~~ Fix file identity migration on Filestream input Nov 25, 2024

belimawr added the bug label Nov 25, 2024

belimawr changed the title ~~Fix file identity migration on Filestream input~~ Enable Filestream input to change file identity to fingerprint without re-ingesting files Nov 25, 2024

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 25, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 25, 2024

belimawr added 13 commits December 6, 2024 17:29

Improve conditions to update registry and comments

3ee0e78

Fix exiting tests

4bcebe7

Working test

12ac2f3

A working test that migrated the file identity from inode to fingerprint.

Run mage check and add all generated files

57e6129

Add unit tests for all common cases

2de77ca

Merge branch 'main' of github.com:elastic/beats into 40197-filestream…

817155f

…-migrate-file-identity

Add integration tests

c1915a4

Clean up test config

6f33fab

fix exiting tests

9bd1bf6

Add test for corner case

937e671

This commit adds a test to validate the case when there are multiple registry entries from different files but with the same path. That's the case when there is log rotation.

Update tests to use require function

fd8872a

Ensure old entries are removed from the registry

2af67ec

Merge branch 'main' of github.com:elastic/beats into 40197-filestream…

4834d43

…-migrate-file-identity

Update docs, changelog and fix lint warnings

d8404b4

belimawr added 2 commits December 11, 2024 18:21

Remove inode_marker from tests and small improvements

4e73c1e

inode_marker is not supported on Windows, so remove it from all tests. Small improvements are done to the code and documentation.

Merge branch 'main' of github.com:elastic/beats into 40197-filestream…

a91a4d4

…-migrate-file-identity

belimawr removed the backport-8.x Automated backport to the 8.x branch with mergify label Dec 11, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 11, 2024

belimawr changed the title ~~Enable Filestream input to change file identity to fingerprint without re-ingesting files~~ Use fingerprint file identity by default and migrate file state from native or path` Dec 11, 2024

belimawr added 3 commits December 11, 2024 19:08

Make fingerprint the default file identity

7c8a3ae

Update old tests to use the old file identity

0feb3bb

update reference

6730cb7

Merge branch 'main' of github.com:elastic/beats into 40197-filestream…

1e92ff2

…-migrate-file-identity

belimawr removed the backport-8.x Automated backport to the 8.x branch with mergify label Dec 12, 2024

belimawr marked this pull request as ready for review December 12, 2024 16:32

belimawr requested a review from a team as a code owner December 12, 2024 16:32

belimawr requested review from rdner and faec December 12, 2024 16:32

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 12, 2024

belimawr added 3 commits December 12, 2024 13:05

Fix Filestream tests

c1693f2

Fix filestream integration tests

09002a1

Fix more tests

9758447

belimawr mentioned this pull request Dec 12, 2024

[Flaky test] TestSQSReceiver - missing call(s) to *awss3.MockSQSAPI.ReceiveMessage #41458

Open

belimawr added 2 commits December 13, 2024 18:44

Fix more tests

68c4a64

Merge branch 'main' of github.com:elastic/beats into 40197-filestream…

6feba3f

…-migrate-file-identity

Merge branch 'main' of github.com:elastic/beats into 40197-filestream…

e858f0e

…-migrate-file-identity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `fingerprint` file identity by default and migrate file state from `native or` path` #41762

Use `fingerprint` file identity by default and migrate file state from `native or` path` #41762

belimawr commented Nov 22, 2024 •

edited

Loading

mergify bot commented Nov 22, 2024

mergify bot commented Nov 22, 2024

rdner commented Dec 11, 2024

belimawr commented Dec 11, 2024

mergify bot commented Dec 11, 2024

mergify bot commented Dec 11, 2024

mergify bot commented Dec 12, 2024

elasticmachine commented Dec 12, 2024

mergify bot commented Dec 12, 2024

mergify bot commented Dec 12, 2024

belimawr commented Dec 16, 2024

belimawr commented Dec 16, 2024

Use fingerprint file identity by default and migrate file state from native or path` #41762

Are you sure you want to change the base?

Use fingerprint file identity by default and migrate file state from native or path` #41762

Conversation

belimawr commented Nov 22, 2024 • edited Loading

Proposed commit message

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Logs

mergify bot commented Nov 22, 2024

mergify bot commented Nov 22, 2024

rdner commented Dec 11, 2024

belimawr commented Dec 11, 2024

mergify bot commented Dec 11, 2024

mergify bot commented Dec 11, 2024

mergify bot commented Dec 12, 2024

elasticmachine commented Dec 12, 2024

mergify bot commented Dec 12, 2024

mergify bot commented Dec 12, 2024

belimawr commented Dec 16, 2024

belimawr commented Dec 16, 2024

Use `fingerprint` file identity by default and migrate file state from `native or` path` #41762

Use `fingerprint` file identity by default and migrate file state from `native or` path` #41762

belimawr commented Nov 22, 2024 •

edited

Loading