-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new fingerprint
file identity
#35734
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
a782869
to
bf25f92
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
This is a new alternative to existing options like `native`, `path` and `inode_marker`. Unlike the existing options, this file identity does not rely on any file system metadata and uses only the file size and its content. Users can specify what amount of bytes is used to fingerprint the beginning of each file. This identity is supposed to be more stable and less affected by the environment/setup of the users.
So, we can handle removals and postponed ingestion. Also, a few performance optimizations included.
3fc3bac
to
d47387a
Compare
@@ -33,6 +33,7 @@ import ( | |||
type config struct { | |||
Reader readerConfig `config:",inline"` | |||
|
|||
ID string `config:"id"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need for better logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely clear what id represents: is it the identifier of the config object itself? The filestream input? Is it stable through restarts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an input ID in an input-level configuration. You can find it here https://www.elastic.co/guide/en/beats/filebeat/current/configuration-filebeat-options.html#CO11-1
filebeat/input/filestream/fswatch.go
Outdated
s.log.Debug("stat(%s) failed: %s", file, err) | ||
fileID := fd.FileID() | ||
if knownFilename, exists := uniqueIDs[fileID]; exists { | ||
s.log.Infof("%q points to an already known ingest target %q [%s==%s]. Skipping", fd.Filename, knownFilename, fileID, fileID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we promote it to WARN
level? Can be potentially noisy but very useful for troubleshooting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think INFO
is fine. There's neither any degraded behavior nor is there necessarily any action a user could take. And I know some customers are spooked by WARN
level messages in the logs. So I'd leave this as INFO
. We (maintainers) should be able to search for this message pretty easily during troubleshooting.
I realize this log message, specifically the part about the filesize, is coming from an error returned by the However, for this particular "error" about needing a minimum file size for fingerprinting, I think it might be valuable to log it as at a Side note: it might be good to include units (bytes) in the message for the file sizes - observed and expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments, mostly minor. Overall this is looking great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice work @rdner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, only one nit so far, still working through everything though, will do another pass tomorrow
This pull request is now in conflicts. Could you fix it? 🙏
|
This is a new alternative to existing options like `native`, `path` and `inode_marker`. Unlike the existing options, this file identity does not rely on any file system metadata and uses only the file size and its content. Users can specify what amount of bytes is used to fingerprint the beginning of each file, optionally it's possible to set an offset from the beginning. This identity is supposed to be more stable and less affected by the environment/setup of the users. This change also contains a few performance optimisations of how we work with the filesystem and watch for file changes. (cherry picked from commit b701377) # Conflicts: # filebeat/input/filestream/config.go # filebeat/input/filestream/fswatch.go # filebeat/input/filestream/fswatch_test.go # filebeat/input/filestream/identifier.go # filebeat/input/filestream/prospector.go # filebeat/input/filestream/prospector_test.go # libbeat/metric/system/cgroup/cgcommon/metrics_test.go
* Add new `fingerprint` file identity (#35734) This is a new alternative to existing options like `native`, `path` and `inode_marker`. Unlike the existing options, this file identity does not rely on any file system metadata and uses only the file size and its content. Users can specify what amount of bytes is used to fingerprint the beginning of each file, optionally it's possible to set an offset from the beginning. This identity is supposed to be more stable and less affected by the environment/setup of the users. This change also contains a few performance optimisations of how we work with the filesystem and watch for file changes. (cherry picked from commit b701377) # Conflicts: # filebeat/input/filestream/config.go # filebeat/input/filestream/fswatch.go # filebeat/input/filestream/fswatch_test.go # filebeat/input/filestream/identifier.go # filebeat/input/filestream/prospector.go # filebeat/input/filestream/prospector_test.go # libbeat/metric/system/cgroup/cgcommon/metrics_test.go * Resolve conflicts --------- Co-authored-by: Denis <[email protected]>
Hello 👋 |
This is a new alternative to existing options like `native`, `path` and `inode_marker`. Unlike the existing options, this file identity does not rely on any file system metadata and uses only the file size and its content. Users can specify what amount of bytes is used to fingerprint the beginning of each file, optionally it's possible to set an offset from the beginning. This identity is supposed to be more stable and less affected by the environment/setup of the users. This change also contains a few performance optimisations of how we work with the filesystem and watch for file changes.
UPD
See #36078 for benchmarks results.
What does this PR do?
This is a new alternative to existing options like
native
,path
andinode_marker
.Unlike the existing options, this file identity does not rely on any file system metadata and uses only the file size and its content.
Users can specify what amount of bytes is used to fingerprint the beginning of each file.
This identity is supposed to be more stable and less affected by the environment/setup of the users.
Why is it important?
Some of customers experiencing problems with relying on file system metadata, so we need the alternative file identity option that does not rely on any file system metadata.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
Edge cases
offset
+length
its ingested is delayed until the file grows in sizelength
-offset
) Filebeat will fail to start with:This requirement comes from the SHA block size.
offset
orlength
values will lead to the full re-ingestion of all files matched by thepaths
.Documentation updates
How to test this PR locally
Note the path placeholders you need to fix first.
Check if the following log messages showed up in the Filebeat's output:
touch some.log
) in/path/to/your/logs
This means the scanner detected the file but its ingestion is delayed until the file size reaches the fingerprint size, which is 1024 bytes by default.
you can see that the message changed and contains the new file size:
and you should see that Filebeat started reading from it:
The file is an ingest target now and we can see a file entry in the registry log:
As you can see there is no offset yet, because we have not written a new line character and our configuration is line-based.
Now you should see this log message in Filebeat's output:
If we look at the registry again, we'll see the updated offset:
Related issues