Update documentation of filestream input with the new improvements (#…

…25303) (#25522) (cherry picked from commit d29e006) Co-authored-by: Noémi Ványi <[email protected]>
elastic · May 4, 2021 · e6dd3b1 · e6dd3b1
1 parent ee2647e
commit e6dd3b1
Show file tree

Hide file tree

Showing 3 changed files with 176 additions and 8 deletions.
diff --git a/filebeat/docs/inputs/input-filestream-file-options.asciidoc b/filebeat/docs/inputs/input-filestream-file-options.asciidoc
@@ -37,6 +37,30 @@ a `gz` extension:
 
 See <<regexp-support>> for a list of supported regexp patterns.
 
+===== `prospector.scanner.include_files`
+
+A list of regular expressions to match the files that you want {beatname_uc} to
+include. If a list of regexes is provided, only the files that are allowed by
+the patterns are harvested.
+
+By default no files are excluded. This option is the counterpart of
+`prospector.scanner.exclude_files`.
+
+The following example configures {beatname_uc} to exclude files that
+are not under `/var/log`:
+
+["source","yaml",subs="attributes"]
+----
+{beatname_lc}.inputs:
+- type: {type}
+  ...
+  prospector.scanner.include_files: ['^/var/log/.*']
+----
+
+NOTE: Patterns should start with `^` in case of absolute paths.
+
+See <<regexp-support>> for a list of supported regexp patterns.
+
 ===== `prospector.scanner.symlinks`
 
 The `symlinks` option allows {beatname_uc} to harvest symlinks in addition to
@@ -57,6 +81,12 @@ This is, for example, the case for Kubernetes log files.
 
 Because this option may lead to data loss, it is disabled by default.
 
+===== `prospector.scanner.resend_on_touch`
+
+If this option is enabled a file is resent if its size has not changed
+but its modification time has changed to a later time than before.
+It is disabled by default to avoid accidentally resending files.
+
 
 [float]
 [id="{beatname_lc}-input-{type}-scan-frequency"]
@@ -117,6 +147,35 @@ If a file that's currently being harvested falls under `ignore_older`, the
 harvester will first finish reading the file and close it after
 `close.on_state_change.inactive` is reached. Then, after that, the file will be ignored.
 
+[float]
+[id="{beatname_lc}-input-{type}-ignore-inactive"]
+===== `ignore_inactive`
+
+If this option is enabled, {beatname_uc} ignores every file that has not been
+updated since the selected time. Possible options are `since_first_start` and
+`since_last_start`. The first option ignores every file that has not been updated since
+the first start of {beatname_uc}. It is useful when the Beat might be restarted
+due to configuration changes or a failure. The second option tells
+the Beat to read from files that have been updated since its start.
+
+The files affected by this setting fall into two categories:
+
+* Files that were never harvested
+* Files that were harvested but weren't updated since `ignore_inactive`.
+
+For files that were never seen before, the offset state is set to the end of
+the file. If a state already exist, the offset is not changed. In case a file is
+updated again later, reading continues at the set offset position.
+
+The setting relies on the modification time of the file to
+determine if a file is ignored. If the modification time of the file is not
+updated when lines are written to a file (which can happen on Windows), the
+setting may cause {beatname_uc} to ignore files even though content was added
+at a later time.
+
+To remove the state of previously harvested files from the registry file, use
+the `clean_inactive` configuration option.
+
 [float]
 [id="{beatname_lc}-input-{type}-close-options"]
 ===== `close.*`
@@ -218,7 +277,7 @@ single log event to a new file. This option is disabled by default.
 
 [float]
 [id="{beatname_lc}-input-{type}-close-timeout"]
-===== `close.reader.timeout`
+===== `close.reader.after_interval`
 
 WARNING: Only use this option if you understand that data loss is a potential
 side effect. Another side effect is that multiline events might not be
@@ -393,4 +452,3 @@ Set the location of the marker file the following way:
 ----
 file_identity.inode_marker.path: /logs/.filebeat-marker
 ----
-
diff --git a/filebeat/docs/inputs/input-filestream-reader-options.asciidoc b/filebeat/docs/inputs/input-filestream-reader-options.asciidoc
@@ -141,3 +141,93 @@ The default is 16384.
 
 The maximum number of bytes that a single log message can have. All bytes after
 `mesage_max_bytes` are discarded and not sent. The default is 10MB (10485760).
+
+[float]
+===== `parsers`
+
+This option expects a list of parsers the log line has to go through. 
+
+Avaliable parsers:
+- `multiline`
+- `ndjson`
+
+In this example, {beatname_uc} is reading multiline messages that consist of 3 lines
+and encapsulated in single-line JSON objects.
+The multiline message is stored under the key `msg`.
+
+["source","yaml",subs="attributes"]
+----
+{beatname_lc}.inputs:
+- type: {type}
+  ...
+  parsers:
+    - ndjson:
+      keys_under_root: true
+      message_key: msg
+    - multiline:
+      type: counter
+      lines_count: 3
+----
+
+See the available parser settings in detail below.
+
+[float]
+===== `multiline`
+
+Options that control how {beatname_uc} deals with log messages that span
+multiple lines. See <<multiline-examples>> for more information about
+configuring multiline options.
+
+[float]
+===== `ndjson`
+
+These options make it possible for {beatname_uc} to decode logs structured as
+JSON messages. {beatname_uc} processes the logs line by line, so the JSON
+decoding only works if there is one JSON object per message.
+
+The decoding happens before line filtering. You can combine JSON
+decoding with filtering if you set the `message_key` option. This
+can be helpful in situations where the application logs are wrapped in JSON
+objects, as with like it happens for example with Docker.
+
+Example configuration:
+
+[source,yaml]
+----
+- ndjson:
+  keys_under_root: true
+  add_error_key: true
+  message_key: log
+----
+
+*`keys_under_root`*:: By default, the decoded JSON is placed under a "json" key
+in the output document. If you enable this setting, the keys are copied top
+level in the output document. The default is false.
+
+*`overwrite_keys`*:: If `keys_under_root` and this setting are enabled, then the
+values from the decoded JSON object overwrite the fields that {beatname_uc}
+normally adds (type, source, offset, etc.) in case of conflicts.
+
+*`expand_keys`*:: If this setting is enabled, {beatname_uc} will recursively
+de-dot keys in the decoded JSON, and expand them into a hierarchical object
+structure. For example, `{"a.b.c": 123}` would be expanded into `{"a":{"b":{"c":123}}}`.
+This setting should be enabled when the input is produced by an
+https://github.com/elastic/ecs-logging[ECS logger].
+
+*`add_error_key`*:: If this setting is enabled, {beatname_uc} adds a
+"error.message" and "error.type: json" key in case of JSON unmarshalling errors
+or when a `message_key` is defined in the configuration but cannot be used.
+
+*`message_key`*:: An optional configuration setting that specifies a JSON key on
+which to apply the line filtering and multiline settings. If specified the key
+must be at the top level in the JSON object and the value associated with the
+key must be a string, otherwise no filtering or multiline aggregation will
+occur.
+
+*`document_id`*:: Option configuration setting that specifies the JSON key to
+set the document id. If configured, the field will be removed from the original
+json document and stored in `@metadata._id`
+
+*`ignore_decoding_error`*:: An optional configuration setting that specifies if
+JSON decoding errors should be logged or not. If set to true, errors will not
+be logged. The default is false.
diff --git a/filebeat/docs/inputs/input-filestream.asciidoc b/filebeat/docs/inputs/input-filestream.asciidoc
@@ -10,10 +10,30 @@ experimental[]
 ++++
 
 Use the `filestream` input to read lines from active log files. It is the
-new, improved alternative to the `log` input. However, a few feature are
-missing from it, e.g. `multiline` or other special parsing capabilities.
-These missing options are probably going to be added again. We strive to
-achieve feature parity, if possible.
+new, improved alternative to the `log` input. It comes various improvements
+to the existing input:
+
+1. Checking of `close_*` options happens out of band. Thus, if an output is blocked
+{beatname_uc} is able to close the reader and it avoids keeping too many files open.
+
+2. Detailed metrics are available for all files that match the `paths` configuration
+regardless of the `harvester_limit`. This way, you can keep track of all files,
+even ones that are not actively read.
+
+3. The order of `parsers` is configurable. So it is possible to parse JSON lines and then
+aggregate the contents into a multiline event.
+
+4. Some position updates and metadata changes no longer depend on the publishing pipeline.
+If a the pipeline is blocked some changes are still applied to the registry.
+
+5. Only the most recent updates are serialized to the registry. In contrast, the `log` input
+has to serialize the complete registry on each ACK from the outputs. This makes the registry updates
+much quicker with this input.
+
+6. The input ensures that only offsets updates are written to the registry append only log.
+The `log` writes the complete file state.
+
+7. Stale entries can be removed from the registry, even if there is no active input.
 
 To configure this input, specify a list of glob-based <<filestream-input-paths,`paths`>>
 that must be crawled to locate and fetch the log lines.
@@ -158,10 +178,10 @@ on. If enabled it expands a single `**` into a 8-level deep `*` pattern.
 This feature is enabled by default. Set `prospector.scanner.recursive_glob` to false to
 disable it.
 
-include::../inputs/input-filestream-reader-options.asciidoc[]
-
 include::../inputs/input-filestream-file-options.asciidoc[]
 
+include::../inputs/input-filestream-reader-options.asciidoc[]
+
 [id="{beatname_lc}-input-{type}-common-options"]
 include::../inputs/input-common-options.asciidoc[]