Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filebeat: ID for filestream is required #30717

Merged
merged 27 commits into from
Mar 24, 2022

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Mar 7, 2022

What does this PR do?

It makes mandatory to configure an ID to all filestream inputs. If an ID is added to a existing, working, configuration, no data is duplicated.

How does it work

When a filestream input does not have an ID, we set the default .global ID at runtime and this ID is used to identify which files in the registry belong to each input.

Now that we're making the IDs mandatory, we need a way to migrate those entries in the registry to the newly set ID. Ideally this will happen only once when the users update to this new version of Filebeat.

They way we do it is by looking the registry for files that have their path matching the input but have the input ID as .global, those entries are then copied with the new key (which uses the input ID) so Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method UpdateIdentifiers from the sourceStore but for this PR it was also needed to update the in-memory store right away. Also borrowing from UpdateIdentifiers usage, this is done during the file prospector initialisation.

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist

How to test this PR locally

  1. Create a simple config that reads a static file using filestream, do not set an ID for this input
  2. Run Filebeat (a non patched version) and wait all events to be ingested
  3. Stop Filebeat
  4. Add an ID to your filestream input
  5. Start a patched version of Filebeat
  6. No new events should be ingested
  7. The registry should have its entries updated with the new input ID

Related issues

## Use cases
## Screenshots
## Logs

@belimawr belimawr added bug Filebeat Filebeat skip-ci Skip the build in the CI but linting backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Mar 7, 2022
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 7, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 7, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-24T15:57:17.894+0000

  • Duration: 39 min 52 sec

Test stats 🧪

Test Results
Failed 0
Passed 980
Skipped 197
Total 1177

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr force-pushed the filestream-add-id-migration branch from 31a5f22 to d28345b Compare March 7, 2022 17:58
@belimawr belimawr removed the skip-ci Skip the build in the CI but linting label Mar 7, 2022
@cmacknz
Copy link
Member

cmacknz commented Mar 7, 2022

Could you add a small explanation of what your proposed change is doing? I suspect this is a tricky part of the code and some more explanation/comments would help make it easier to review.

@belimawr belimawr force-pushed the filestream-add-id-migration branch 2 times, most recently from 861fc48 to 4f5bb40 Compare March 15, 2022 10:50
@jlind23
Copy link
Collaborator

jlind23 commented Mar 15, 2022

@belimawr could you please backport it to 7.17.X and all versions above?

@belimawr belimawr added the backport-7.17 Automated backport to the 7.17 branch with mergify label Mar 16, 2022
@mergify mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Mar 16, 2022
@belimawr belimawr marked this pull request as ready for review March 16, 2022 11:36
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@belimawr belimawr requested review from cmacknz and kvch March 16, 2022 11:37
@belimawr
Copy link
Contributor Author

@cmacknz @kvch it's ready for review again. Let me know what you think of the python integration tests. I'm wondering if I should add more test cases or not.

@v1v
Copy link
Member

v1v commented Mar 17, 2022

/test

@belimawr belimawr force-pushed the filestream-add-id-migration branch from cb397d1 to 2afb636 Compare March 17, 2022 10:27
belimawr and others added 6 commits March 24, 2022 11:38
Co-authored-by: Craig MacKenzie <[email protected]>
Add some general exclude rules and update the linter's Go version to
match `.go-version`
@belimawr belimawr force-pushed the filestream-add-id-migration branch from 3701b54 to 6ff985a Compare March 24, 2022 10:39
@belimawr
Copy link
Contributor Author

I fixed all lint issues, rebased onto main and force pushed.

@@ -142,7 +148,7 @@ linters-settings:
# Enable to require an explanation of nonzero length after each nolint directive. Default is false.
require-explanation: true
# Enable to require nolint directives to mention the specific linter being suppressed. Default is false.
require-specific: true
require-specific: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one line was failing in 3+ linters, but they're reported one at a time... It looks too strict to a repo like Beats.

@@ -114,7 +120,7 @@ linters-settings:

gosimple:
# Select the Go version to target. The default is '1.13'.
go: "1.17.6"
go: "1.17.8"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it matches the .go-version file/version.

@belimawr
Copy link
Contributor Author

@rdner I did some small modifications in the lint config to reduce noisy, what do you think?

Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave an explanation comment for what we exclude, so people would not have to look it up all the time.

source https://staticcheck.io/docs/checks/

dev-tools/templates/.golangci.yml Outdated Show resolved Hide resolved
# it on its name.
- linters:
- stylecheck
text: "ST1003:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
text: "ST1003:"
text: "ST1003:" # Poorly chosen identifier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it at the top, but I agree the yaml structure didn't help:

    # Exclude package name contains '-' issue because we have at least one package with
    # it on its name.

What about having it like that:

    # Exclude package name contains '-' issue because we have at least one poorly chosen
    # package with'-' it on its name.
    - text: "ST1003:" 
      linters:
        - stylecheck

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that, I thought that comment was not related to the particular exception but now I see it's a list and the comment is on top of the item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I committed the changes on 63c0d7a, it should be more clear now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original yaml snipped I copied from is pretty confusing, but after taking some time to decipher it, I came up with something I hope is more readable.

@belimawr belimawr added backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify labels Mar 24, 2022
@belimawr
Copy link
Contributor Author

CI is failing: [2022-03-24T15:26:18.127Z] Error: exec: "pytest": executable file not found in %PATH% 😭

@belimawr
Copy link
Contributor Author

/test

@belimawr belimawr merged commit c61b219 into elastic:main Mar 24, 2022
@belimawr belimawr deleted the filestream-add-id-migration branch March 24, 2022 17:34
mergify bot pushed a commit that referenced this pull request Mar 24, 2022
…30717)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit c61b219)

# Conflicts:
#	.golangci.yml
#	dev-tools/templates/.golangci.yml
#	filebeat/docs/inputs/input-filestream-file-options.asciidoc
#	filebeat/input/filestream/internal/input-logfile/manager.go
#	filebeat/input/filestream/parsers_integration_test.go
mergify bot pushed a commit that referenced this pull request Mar 24, 2022
…30717)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit c61b219)

# Conflicts:
#	.golangci.yml
#	dev-tools/templates/.golangci.yml
#	filebeat/docs/inputs/input-filestream-file-options.asciidoc
#	filebeat/input/filestream/internal/input-logfile/manager.go
#	filebeat/input/filestream/parsers_integration_test.go
mergify bot pushed a commit that referenced this pull request Mar 24, 2022
…30717)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit c61b219)

# Conflicts:
#	.golangci.yml
#	dev-tools/templates/.golangci.yml
#	filebeat/docs/inputs/input-filestream-file-options.asciidoc
#	filebeat/input/filestream/internal/input-logfile/manager.go
#	filebeat/input/filestream/parsers_integration_test.go
belimawr added a commit that referenced this pull request Mar 25, 2022
…stream input without ID (#30717) (#30996)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit c61b219)

# Conflicts:
#	.golangci.yml
#	dev-tools/templates/.golangci.yml
#	filebeat/docs/inputs/input-filestream-file-options.asciidoc
#	filebeat/input/filestream/internal/input-logfile/manager.go
#	filebeat/input/filestream/parsers_integration_test.go

Co-authored-by: Tiago Queiroz <[email protected]>
rdner pushed a commit that referenced this pull request Mar 28, 2022
* filebeat: Migrates registry entries from filestream input without ID (#30717)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit c61b219)
kush-elastic pushed a commit to kush-elastic/beats that referenced this pull request May 2, 2022
…lastic#30717)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
…30717)

When a filestream input does not have an ID, we set the default
.global ID at runtime and this ID is used to identify which files in
the registry belong to each input. But if there are more than one
input without an ID, then data is duplicated when Filebeat is
restarted because the it's not possible to identify which input owns
each file.

We are adding a way to migrate those entries in the registry to the
newly set ID. Ideally this will happen only once when the users update
to this new version of Filebeat.

They way we do it is by looking the registry for files that have their
path matching the input but have the input ID as .global, those
entries are then copied with the new key (which uses the input ID) so
Filebeat can continue its operation without duplicating data.

Most of the logic is the same of the already existing method
UpdateIdentifiers from the sourceStore but for this PR it was also
needed to update the in-memory store right away. Also borrowing from
UpdateIdentifiers usage, this is done during the file prospector
initialisation.

Log errors are written when we detect filestream inputs without an ID
or when more than one input is set with the same ID.

Some files are refactored to pass the new linter, some test code also
had some small refactoring.

Co-authored-by: Craig MacKenzie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify bug Filebeat Filebeat review Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filestream input duplicating events after every restart
7 participants