You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.
Let's consider the case where the shipper is configured to use a disk queue with the Elasticsearch output. Let's also assume we use the default protobuf encoding over gRPC. If we reuse the existing structure of the beats publishing pipeline, the data flow will look like:
flowchart LR
A[Input] -->|Protobuf| B[Server]
B --> C[Processors]
C -->|CBOR| D[Disk Queue]
D -->|JSON| E[Elasticsearch]
Loading
The diagram shows that the data must be serialized multiple times:
To the protobuf wire format when the input sends events to the shipper using gRPC. This could optionally be replaced with JSON, but we would likely still need to deserialize it regardless.
It seems extremely worthwhile to restructure the pipeline to eliminate the amount of times the data must be serialized:
flowchart LR
A[Input] -->|Protobuf| B[Server]
B -->|Protobuf| C[Disk Queue]
C --> D[Processors]
D -->|JSON| E[Elasticsearch]
Loading
In this case we would change the disk queue's serialization format to protobuf, deferring deserialization until after data as been read from the queue. This leaves us with a single transformation from protobuf, to the shipper's internal data format, and then back to JSON (or whatever encoding the output requires).
If the memory queue were used instead of the disk queue, we could use the same strategy of storing serialized events in the memory queue and only decoding them when they are read from the queue. This would give us a way to deterministically calculate the number of bytes stored in the memory queue. Currently the memory queue size must be specified in events.
The output of this issue should be a proof of concept demonstrating that this reordering of the pipeline is possible and has the expected benefits. At minimum the work will need to include:
Modifying the gRPC server in the shipper to stop deserializing messages so they can be passed directly to the queue. The ideal option would be to keep the existing RPC definitions but implement a no-op codec. See the gRPC encoding documentation. We may need to write a custom set of RPC handlers instead of generating them:
Benchmark the performance of the modified pipeline and compare it to the original configuration. We do not have a set of repeatable performance tests yet, so we may choose to defer this work until we do.
The text was updated successfully, but these errors were encountered:
cmacknz
changed the title
Investigate changing the pipeline order to improve performance
[Meta] Investigate changing the pipeline order to improve performance
May 27, 2022
cmacknz
changed the title
[Meta] Investigate changing the pipeline order to improve performance
Investigate changing the pipeline order to improve performance
Mar 6, 2023
Let's consider the case where the shipper is configured to use a disk queue with the Elasticsearch output. Let's also assume we use the default protobuf encoding over gRPC. If we reuse the existing structure of the beats publishing pipeline, the data flow will look like:
The diagram shows that the data must be serialized multiple times:
It seems extremely worthwhile to restructure the pipeline to eliminate the amount of times the data must be serialized:
In this case we would change the disk queue's serialization format to protobuf, deferring deserialization until after data as been read from the queue. This leaves us with a single transformation from protobuf, to the shipper's internal data format, and then back to JSON (or whatever encoding the output requires).
If the memory queue were used instead of the disk queue, we could use the same strategy of storing serialized events in the memory queue and only decoding them when they are read from the queue. This would give us a way to deterministically calculate the number of bytes stored in the memory queue. Currently the memory queue size must be specified in events.
The output of this issue should be a proof of concept demonstrating that this reordering of the pipeline is possible and has the expected benefits. At minimum the work will need to include:
elastic-agent-shipper/api/shipper_grpc.pb.go
Line 145 in ca42ed1
bytes
payload with the required message type and serialization documented in the RPC call.The text was updated successfully, but these errors were encountered: