[filebeat][Azure Blob Storage] - Improved documentation (elastic#41252)

* improved documentation
belimawr · Oct 18, 2024 · 7122078 · 7122078
1 parent fd28cb4
commit 7122078
Show file tree

Hide file tree

Showing 2 changed files with 15 additions and 6 deletions.
diff --git a/CHANGELOG.next.asciidoc b/CHANGELOG.next.asciidoc
@@ -318,6 +318,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Jounrald input now supports filtering by facilities {pull}41061[41061]
 - System module now supports reading from jounrald. {pull}41061[41061]
 - Add support to include AWS cloudwatch linked accounts when using log_group_name_prefix to define log group names. {pull}41206[41206]
+- Improved Azure Blob Storage input documentation. {pull}41252[41252]
 
 *Auditbeat*
 

diff --git a/x-pack/filebeat/docs/inputs/input-azure-blob-storage.asciidoc b/x-pack/filebeat/docs/inputs/input-azure-blob-storage.asciidoc
@@ -26,8 +26,8 @@ even though it can get expensive with dealing with a very large number of files.
     describing said error.
 
 [id="supported-types"]
-NOTE: NOTE: Currently only `JSON` and `NDJSON` are supported blob/file formats. Blobs/files may be also be gzip compressed.
-As for authentication types, we currently have support for `shared access keys` and `connection strings`.
+NOTE: `JSON`, `NDJSON` and `CSV` are supported blob/file formats. Blobs/files may be also be gzip compressed.
+`shared access keys`, `connection strings` and `Microsoft Entra ID RBAC` authentication types are supported.
 
 [id="basic-config"]
 *A sample configuration with detailed explanation for each field is given below :-*
@@ -224,10 +224,14 @@ This is a specific subfield of a container. It specifies the container name.
 [float]
 ==== `max_workers`
 
-This attribute defines the maximum number of workers (go routines / lightweight threads) are allocated in the worker pool (thread pool) for processing jobs 
-which read contents of file. More number of workers equals a greater amount of concurrency achieved. There is an upper cap of `5000` workers per container that 
-can be defined due to internal sdk constraints. This attribute can be specified both at the root level of the configuration as well at the container level. 
-The container level values will always take priority and override the root level values if both are specified.
+This attribute defines the maximum number of workers allocated to the worker pool for processing jobs which read file contents. It can be specified both at the root level of the configuration, and at the container level. Container level values override root level values if both are specified. Larger number of workers do not necessarily improve throughput, and this should be carefully tuned based on the number of files, the size of the files being processed and resources available. Increasing `max_workers` to very high values may cause resource utilization problems and may lead to bottlenecks in processing. Usually a maximum of `2000` workers is recommended. A very low `max_worker` count will drastically increase the number of network calls required to fetch the blobs, which may cause a bottleneck in processing.
+
+The batch size for workload distribution is calculated by the input to ensure that there is an even workload across all workers. This means that for a given `max_workers` parameter value, the input will calculate the optimal batch size for that setting. The `max_workers` value should be configured based on factors such as the total number of files to be processed, available system resources and network bandwidth.
+
+Example:
+
+- Setting `max_workers=3` would result in each request fetching `3 blobs` (batch size = 3), which are then distributed among `3 workers`.
+- Setting `max_workers=100` would fetch `100 blobs` (batch size = 100) per request, distributed among `100 workers`.
 
 [id="attrib-poll"]
 [float]
@@ -325,6 +329,8 @@ filebeat.inputs:
     - regex: '/Security-Logs/'
 ----
 
+The `file_selectors` operation is performed within the agent locally. The agent will download all the files and then filter them based on the `file_selectors`. This can cause a bottleneck in processing if the number of files are very high. It is recommended to use this attribute only when the number of files are limited or ample resources are available.
+
 [id="attrib-expand_event_list_from_field"]
 [float]
 ==== `expand_event_list_from_field`
@@ -385,6 +391,8 @@ filebeat.inputs:
     timestamp_epoch: 1627233600
 ----
 
+The Azure Blob Storage APIs don't provide a direct way to filter files based on timestamp, so the input will download all the files and then filter them based on the timestamp. This can cause a bottleneck in processing if the number of files are very high. It is recommended to use this attribute only when the number of files are limited or ample resources are available.
+
 [id="container-overrides"]
 *The sample configs below will explain the container level overriding of attributes a bit further :-*