filebeat: ID for filestream is required #30717

belimawr · 2022-03-07T17:13:09Z

What does this PR do?

It makes mandatory to configure an ID to all filestream inputs. If an ID is added to a existing, working, configuration, no data is duplicated.

How does it work

When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input.

Now that we're making the IDs mandatory, we need a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat.

They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation.

Why is it important?

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

~~## Author's Checklist~~

How to test this PR locally

Create a simple config that reads a static file using filestream, do not set an ID for this input
Run Filebeat (a non patched version) and wait all events to be ingested
Stop Filebeat
Add an ID to your filestream input
Start a patched version of Filebeat
No new events should be ingested
The registry should have its entries updated with the new input ID

Related issues

Closes Filestream input duplicating events after every restart #30061

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

elasticmachine · 2022-03-07T17:27:58Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-03-24T15:57:17.894+0000
Duration: 39 min 52 sec

Test stats 🧪

Test	Results
Failed	0
Passed	980
Skipped	197
Total	1177

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

filebeat/input/filestream/internal/input-logfile/store.go

cmacknz · 2022-03-07T19:33:38Z

Could you add a small explanation of what your proposed change is doing? I suspect this is a tricky part of the code and some more explanation/comments would help make it easier to review.

filebeat/docs/inputs/input-filestream-file-options.asciidoc

filebeat/input/filestream/internal/input-logfile/manager.go

jlind23 · 2022-03-15T14:32:22Z

@belimawr could you please backport it to 7.17.X and all versions above?

elasticmachine · 2022-03-16T11:36:47Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2022-03-16T11:38:22Z

@cmacknz @kvch it's ready for review again. Let me know what you think of the python integration tests. I'm wondering if I should add more test cases or not.

v1v · 2022-03-17T09:20:56Z

/test

filebeat/tests/system/test_input.py

Co-authored-by: Craig MacKenzie <[email protected]>

Add some general exclude rules and update the linter's Go version to match `.go-version`

belimawr · 2022-03-24T10:40:16Z

I fixed all lint issues, rebased onto main and force pushed.

belimawr · 2022-03-24T10:42:08Z

dev-tools/templates/.golangci.yml

@@ -142,7 +148,7 @@ linters-settings:
    # Enable to require an explanation of nonzero length after each nolint directive. Default is false.
    require-explanation: true
    # Enable to require nolint directives to mention the specific linter being suppressed. Default is false.
-    require-specific: true
+    require-specific: false


At least one line was failing in 3+ linters, but they're reported one at a time... It looks too strict to a repo like Beats.

belimawr · 2022-03-24T10:42:32Z

.golangci.yml

@@ -114,7 +120,7 @@ linters-settings:

  gosimple:
    # Select the Go version to target. The default is '1.13'.
-    go: "1.17.6"
+    go: "1.17.8"


Now it matches the .go-version file/version.

belimawr · 2022-03-24T10:43:56Z

@rdner I did some small modifications in the lint config to reduce noisy, what do you think?

rdner

I think we should leave an explanation comment for what we exclude, so people would not have to look it up all the time.

source https://staticcheck.io/docs/checks/

dev-tools/templates/.golangci.yml

rdner · 2022-03-24T10:56:51Z

.golangci.yml

+    # it on its name.
+    - linters:
+        - stylecheck
+      text: "ST1003:"


Suggested change

text: "ST1003:"

text: "ST1003:" # Poorly chosen identifier

I had it at the top, but I agree the yaml structure didn't help:

# Exclude package name contains '-' issue because we have at least one package with # it on its name.

What about having it like that:

# Exclude package name contains '-' issue because we have at least one poorly chosen # package with'-' it on its name. - text: "ST1003:" linters: - stylecheck

I missed that, I thought that comment was not related to the particular exception but now I see it's a list and the comment is on top of the item.

I committed the changes on 63c0d7a, it should be more clear now.

The original yaml snipped I copied from is pretty confusing, but after taking some time to decipher it, I came up with something I hope is more readable.

belimawr · 2022-03-24T15:56:35Z

CI is failing: [2022-03-24T15:26:18.127Z] Error: exec: "pytest": executable file not found in %PATH% 😭

belimawr · 2022-03-24T15:56:41Z

/test

…30717) When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input. But if there are more than one input without an ID, then data is duplicated when Filebeat is restarted because the it's not possible to identify which input owns each file. We are adding a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat. They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data. Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation. Log errors are written when we detect filestream inputs without an ID or when more than one input is set with the same ID. Some files are refactored to pass the new linter, some test code also had some small refactoring. Co-authored-by: Craig MacKenzie <[email protected]> (cherry picked from commit c61b219) # Conflicts: # .golangci.yml # dev-tools/templates/.golangci.yml # filebeat/docs/inputs/input-filestream-file-options.asciidoc # filebeat/input/filestream/internal/input-logfile/manager.go # filebeat/input/filestream/parsers_integration_test.go

…stream input without ID (#30717) (#30996) When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input. But if there are more than one input without an ID, then data is duplicated when Filebeat is restarted because the it's not possible to identify which input owns each file. We are adding a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat. They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data. Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation. Log errors are written when we detect filestream inputs without an ID or when more than one input is set with the same ID. Some files are refactored to pass the new linter, some test code also had some small refactoring. Co-authored-by: Craig MacKenzie <[email protected]> (cherry picked from commit c61b219) # Conflicts: # .golangci.yml # dev-tools/templates/.golangci.yml # filebeat/docs/inputs/input-filestream-file-options.asciidoc # filebeat/input/filestream/internal/input-logfile/manager.go # filebeat/input/filestream/parsers_integration_test.go Co-authored-by: Tiago Queiroz <[email protected]>

* filebeat: Migrates registry entries from filestream input without ID (#30717) When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input. But if there are more than one input without an ID, then data is duplicated when Filebeat is restarted because the it's not possible to identify which input owns each file. We are adding a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat. They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data. Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation. Log errors are written when we detect filestream inputs without an ID or when more than one input is set with the same ID. Some files are refactored to pass the new linter, some test code also had some small refactoring. Co-authored-by: Craig MacKenzie <[email protected]> (cherry picked from commit c61b219)

…lastic#30717) When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input. But if there are more than one input without an ID, then data is duplicated when Filebeat is restarted because the it's not possible to identify which input owns each file. We are adding a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat. They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data. Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation. Log errors are written when we detect filestream inputs without an ID or when more than one input is set with the same ID. Some files are refactored to pass the new linter, some test code also had some small refactoring. Co-authored-by: Craig MacKenzie <[email protected]>

…30717) When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input. But if there are more than one input without an ID, then data is duplicated when Filebeat is restarted because the it's not possible to identify which input owns each file. We are adding a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat. They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data. Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation. Log errors are written when we detect filestream inputs without an ID or when more than one input is set with the same ID. Some files are refactored to pass the new linter, some test code also had some small refactoring. Co-authored-by: Craig MacKenzie <[email protected]>

belimawr added bug Filebeat Filebeat skip-ci Skip the build in the CI but linting backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Mar 7, 2022

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 7, 2022

mergify bot assigned belimawr Mar 7, 2022

belimawr force-pushed the filestream-add-id-migration branch from 31a5f22 to d28345b Compare March 7, 2022 17:58

belimawr removed the skip-ci Skip the build in the CI but linting label Mar 7, 2022

cmacknz reviewed Mar 7, 2022

View reviewed changes

filebeat/input/filestream/internal/input-logfile/store.go Outdated Show resolved Hide resolved

cmacknz reviewed Mar 7, 2022

View reviewed changes

filebeat/input/filestream/internal/input-logfile/store.go Show resolved Hide resolved

kvch reviewed Mar 8, 2022

View reviewed changes

filebeat/docs/inputs/input-filestream-file-options.asciidoc Show resolved Hide resolved

kvch reviewed Mar 8, 2022

View reviewed changes

filebeat/input/filestream/internal/input-logfile/manager.go Show resolved Hide resolved

belimawr force-pushed the filestream-add-id-migration branch 2 times, most recently from 861fc48 to 4f5bb40 Compare March 15, 2022 10:50

belimawr added the backport-7.17 Automated backport to the 7.17 branch with mergify label Mar 16, 2022

mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Mar 16, 2022

belimawr marked this pull request as ready for review March 16, 2022 11:36

belimawr requested review from cmacknz and kvch March 16, 2022 11:37

belimawr added the review label Mar 16, 2022

belimawr force-pushed the filestream-add-id-migration branch from cb397d1 to 2afb636 Compare March 17, 2022 10:27

cmacknz reviewed Mar 17, 2022

View reviewed changes

filebeat/tests/system/test_input.py Outdated Show resolved Hide resolved

belimawr and others added 6 commits March 24, 2022 11:38

ID is not required any more

3e98ed6

Update CHANGELOG.next.asciidoc

700f6ff

Co-authored-by: Craig MacKenzie <[email protected]>

Update filebeat/input/filestream/internal/input-logfile/manager.go

35211e7

Co-authored-by: Craig MacKenzie <[email protected]>

More PR improvements

8689786

Fix lint config

262f158

Add some general exclude rules and update the linter's Go version to match `.go-version`

make linter happy

6ff985a

belimawr force-pushed the filestream-add-id-migration branch from 3701b54 to 6ff985a Compare March 24, 2022 10:39

belimawr commented Mar 24, 2022

View reviewed changes

rdner reviewed Mar 24, 2022

View reviewed changes

rdner approved these changes Mar 24, 2022

View reviewed changes

belimawr added backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify labels Mar 24, 2022

improving linter again

63c0d7a

belimawr merged commit c61b219 into elastic:main Mar 24, 2022

belimawr deleted the filestream-add-id-migration branch March 24, 2022 17:34

mergify bot mentioned this pull request Mar 24, 2022

[7.17](backport #30717) filebeat: ID for filestream is required #30996

Merged

mergify bot mentioned this pull request Mar 24, 2022

[8.0](backport #30717) filebeat: ID for filestream is required #30997

Closed

mergify bot mentioned this pull request Mar 24, 2022

[8.1](backport #30717) filebeat: ID for filestream is required #30998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filebeat: ID for filestream is required #30717

filebeat: ID for filestream is required #30717

belimawr commented Mar 7, 2022 •

edited

Loading

elasticmachine commented Mar 7, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

cmacknz commented Mar 7, 2022

jlind23 commented Mar 15, 2022

elasticmachine commented Mar 16, 2022

belimawr commented Mar 16, 2022

v1v commented Mar 17, 2022

belimawr commented Mar 24, 2022

belimawr Mar 24, 2022

belimawr Mar 24, 2022

belimawr commented Mar 24, 2022

rdner left a comment

rdner Mar 24, 2022

belimawr Mar 24, 2022

rdner Mar 24, 2022

belimawr Mar 24, 2022

belimawr Mar 24, 2022

belimawr commented Mar 24, 2022

belimawr commented Mar 24, 2022

filebeat: ID for filestream is required #30717

filebeat: ID for filestream is required #30717

Conversation

belimawr commented Mar 7, 2022 • edited Loading

What does this PR do?

How does it work

Why is it important?

Checklist

How to test this PR locally

Related issues

elasticmachine commented Mar 7, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

cmacknz commented Mar 7, 2022

jlind23 commented Mar 15, 2022

elasticmachine commented Mar 16, 2022

belimawr commented Mar 16, 2022

v1v commented Mar 17, 2022

belimawr commented Mar 24, 2022

belimawr Mar 24, 2022

Choose a reason for hiding this comment

belimawr Mar 24, 2022

Choose a reason for hiding this comment

belimawr commented Mar 24, 2022

rdner left a comment

Choose a reason for hiding this comment

rdner Mar 24, 2022

Choose a reason for hiding this comment

belimawr Mar 24, 2022

Choose a reason for hiding this comment

rdner Mar 24, 2022

Choose a reason for hiding this comment

belimawr Mar 24, 2022

Choose a reason for hiding this comment

belimawr Mar 24, 2022

Choose a reason for hiding this comment

belimawr commented Mar 24, 2022

belimawr commented Mar 24, 2022

belimawr commented Mar 7, 2022 •

edited

Loading

elasticmachine commented Mar 7, 2022 •

edited by jenkins-beats-ci bot

Loading