Control protocol checkin payloads can exceed the gRPC maximum message size when using autodiscovery #2460

cmacknz · 2023-04-05T19:29:59Z

Intro

Kubernetes Autodiscovery allows the agent to automatically add and update inputs to its policy as containers, pods, and nodes in a Kubernetes cluster come and go.

This works well, but when autodiscovery is configured such that it will generate inputs for each pod on a node or each node in a cluster the resulting agent policies can be so large they cause problems. For example, the officially recommended limit for the number of pods on a node is 110 but individual Kubernetes runtimes can allow more than this. Amazon EKS allows 737 pods per node for the largest node types, and we have seen individual nodes with 400+ pods in support cases.

Problem

We have a recent example where the agent failed in this case when a user was attempting to monitor 700+ pods. The agent logs were flooded with errors like:

{"log.level":"error","@timestamp":"2023-02-13T09:00:01.944+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":251},"message":"elastic-agent-client got error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4269853 vs. 4194304)","service.name":"filebeat","ecs.version":"1.6.0"}

The biggest problem caused by these large agent policies is that they begin to exceed the default size limits for the messages in the control protocol. This prevents the system from functioning by default.

This problem also affects diagnostics, see #1808. All communication between the agent and its subprocesses is affected by this problem, and it may also affect communication with Fleet server since the Fleet checkin payload contains an entry for each input.

Solution 1: Configurable Message Size Limits

There is a way to configure the maximum message size for the Elastic Agent itself, but there is a similar limit that must be configured for each client.

elastic-agent/elastic-agent.reference.yml

Lines 111 to 113 in b60b8b0

    
           #   # max_message_size limits the message size in agent internal communication 
        
           #   # default is 100MB 
        
           #   max_message_size: 104857600

Even if we could allow changing the size limits everywhere, this solution to the problem is not automatic and requires users to experience a failure and diagnose policy size as the problem.

Solution 2: Chunked gRPC Transfers

For a more transparent solution, we can introduce configuration chunking into the agent control protocol as described in #2460 (comment). This would solve the problem for the agent as it exists today.

However we are also working to implement a Kubernetes operator for the Elastic Agent, and in this case the agent policy would not be transported using the control protocol but rather stored in a Kubernetes primitive like a ConfigMap or a Secret, both of which have fixed 1 MB size limits. Changing the control protocol would not solve the problem in the case of an agent Kubernetes operator or other Kubernetes native technology.

Solution 3: Require Components to Render Input Templates

Yet another alternative approach is to change the way we generate inputs when using autodiscovery entirely. Today each discovered input is rendered completely in the agent policy as the need for them is discovered. This is convenient as it requires no logic in each input, but problematic because the generated agent policies can be enormous.

For a single example, it is common to generate a filestream input to collect logs from each container on a node using an input like (see the Dynamic Logs Path documentation):

- data_stream:
    namespace: production
  id: filestream-container-logs-b203b554-f873-4d4a-8b1e-3884282db838
  meta:
    package:
      name: kubernetes
      version: 1.31.0
  name: kubernetes-container-logs
  package_policy_id: b203b554-f873-4d4a-8b1e-3884282db838
  revision: 3
  streams:
  - data_stream:
      dataset: kubernetes.container_logs
      type: logs
    id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
    parsers:
    - container:
        format: auto
        stream: all
    paths:
    - /var/log/pods/*${kubernetes.pod.name}_${kubernetes.pod.uid}/${kubernetes.container.name}/*.log
    prospector.scanner.symlinks: true
  type: filestream
  use_output: default

The autodiscovery logic will further expand this configuration to add processors resulting in an even larger configuration, which is then repeated in the policy 100s of times:

- data_stream:
    namespace: production
  id: filestream-container-logs-b203b554-f873-4d4a-8b1e-3884282db838-kubernetes-008b0185-59e6-4d41-b9ac-450d7c19f974.deployment
  meta:
    package:
      name: kubernetes
      version: 1.31.0
  name: kubernetes
  original_id: filestream-container-logs-b203b554-f873-4d4a-8b1e-3884282db838
  package_policy_id: b203b554-f873-4d4a-8b1e-3884282db838
  policy:
    revision: 5
  processors:
  - add_fields:
      fields:
        id: 9ecb5ce8b4370ac1025f158b782375f0b8efd36a5dd4b1b3ee3073c58998b37e
        image:
          name: docker/some-docker-image
        runtime: cri-o
      target: container
  - add_fields:
      fields:
        container:
          name: deployment
        labels:
          some-label: my-label
        namespace: pod-namespace
        namespace_labels:
          some-label: my-namespace-label
        namespace_uid: dcf37b8c-bd61-4d7f-97b5-19fd8cfde8e7
        node:
          hostname: node-hostname
          labels:
            beta_kubernetes_io/arch: amd64
            beta_kubernetes_io/instance-type: Standard_D16s_v3
            beta_kubernetes_io/os: linux
            failure-domain_beta_kubernetes_io/region: westeurope
            failure-domain_beta_kubernetes_io/zone: westeurope-1
            kubernetes_io/arch: amd64
            kubernetes_io/hostname: node-hostname
            kubernetes_io/os: linux
            node-role_kubernetes_io/worker: ""
            node_kubernetes_io/instance-type: Standard_D16s_v3
            topology_disk_csi_azure_com/zone: westeurope-1
            topology_kubernetes_io/region: westeurope
            topology_kubernetes_io/zone: westeurope-1
          name: node-name
          uid: 3158c8c6-ea34-4321-9755-090e4a426079
        pod:
          ip: 10.128.8.146
          name: pod-name
          uid: 008b0185-59e6-4d41-b9ac-450d7c19f974
      target: kubernetes
  revision: 3
  streams:
  - data_stream:
      dataset: kubernetes.container_logs
      type: logs
    id: kubernetes-container-logs-pod-name-pattern-9ecb5ce8b4370ac1025f158b782375f0b8efd36a5dd4b1b3ee3073c58998b37e
    parsers:
    - container:
        format: auto
        stream: all
    paths:
    - /var/log/pods/*pod-name-pattern/deployment/*.log
    prospector.scanner.symlinks: true
  type: filestream

Rather than having the agent render these inputs from templates itself, we could introduce the concept of a templated input directly into the control protocol.

I am imaging that we introduce a message type that contains a base template, which could similar to the unrendered agent input in the agent policy (repeated below for clarity) but in the same message type we also include the list of variables to substitute.

- data_stream:
    namespace: production
  id: filestream-container-logs-b203b554-f873-4d4a-8b1e-3884282db838
  meta:
    package:
      name: kubernetes
      version: 1.31.0
  name: kubernetes-container-logs
  package_policy_id: b203b554-f873-4d4a-8b1e-3884282db838
  revision: 3
  streams:
  - data_stream:
      dataset: kubernetes.container_logs
      type: logs
    id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
    parsers:
    - container:
        format: auto
        stream: all
    paths:
    - /var/log/pods/*${kubernetes.pod.name}_${kubernetes.pod.uid}/${kubernetes.container.name}/*.log
    prospector.scanner.symlinks: true
  type: filestream
  use_output: default

This configuration require us to provide ${kubernetes.pod.name}, ${kubernetes.pod.uid}, and ${kubernetes.container.name} for each pod on the node. A representative set of messages could look like:

message TemplatedInput {
   string InputTemplate = 1;
   repeated TemplateVariables = 2;
}

// Repeated maps aren't supported directly
message TemplateVariables {
   map<string, string> vars = 1;
}

This would compress the configuration down to the minimum set of information necessary to transport to each sub-process, however it would require each component support templated input rendering in their implementation language. This solution is not as easily generalizable as introducing a chunked control protocol, but it also solves the problem in the case where the agent policy needs to be stored in

Solution 4: Use YAML Anchors to Avoid Repetition

Similar to the approach above to avoid duplicating information in the control protocol, we could leave the control protocol as is and attempt to eliminate the duplication using YAML anchors or custom YAML syntax in the agent policy itself.

For an example, see Gitlab's documentation on simplifying configurations: https://docs.gitlab.com/ee/ci/yaml/yaml_optimization.html#anchors

Those examples use anchors in combination with map merging to attempt to perform the templating described above using the more advanced parts of the YAML syntax.

This will make the computed agent policy harder to read, and also requires that the YAML features we use be well supported in multiple implementation languages (at least Go and C++).

Scope

Provide a recommended solution to this problem to ensure that the Elastic Agent can scale on to monitor any size of Kubernetes cluster without arbitrary internal limits. Consider solutions beyond those proposed here, each of which as their own set of pros and cons.

Keep in mind that this problem so far only affects Kubernetes deployments, and solutions that add a fixed resource cost to all use cases (enabling compression in the control protocol for example) should be avoided if possible.

The text was updated successfully, but these errors were encountered:

blakerouse · 2023-04-05T20:23:34Z

The CheckinV2 is already a streaming protocol so we would need to get creative on how to chunk the messages.

Could we just go with a larger default maximum message size? Being that this is an internal GRPC protocol between the components is there any danger in doing that? I don't think so, its not a limitation for DDOS or something.

cmacknz · 2023-04-05T22:20:05Z

We can increase the default size, and it will only have performance implications when the configs we transport are very large.

The question with just adjusting the maximum size is determining how big it needs to be. We can make it configurable to minimize complexity here.

I'm not actually sure how common this problem will be, it seems simpler to just increase the max message size and allow configuring it as an escape hatch when the default isn't large enough.

Another option would be compressing the messages when we send them, that might help mitigate it as well but with a permanent CPU cost.

bvader · 2023-04-19T15:18:05Z

@cmacknz @blakerouse
We are seeing this with K8s deployment 500+ service endpoints, which I think will become fairly commonplace.

I'm not actually sure how common this problem will be, it seems simpler to just increase the max message size and allow configuring it as an escape hatch when the default isn't large enough.

We have a support case confirming that setting the max message size does not seem to work/have any effect.

Thus the result is, this issue is blocking the rollout of the Agent to these medium K8s clusters at customers/users.

We have expectations/clusters that are much larger 3K-4K endpoints and these numbers will only increase over time.

cmacknz · 2023-04-20T01:25:09Z

I think the problem here is that we can adjust the maximum message size for the server side (agent) via the agent configuration, but this limit also applies on the client side (Filebeat) where we don't expose it. See https://pkg.go.dev/google.golang.org/grpc#MaxCallRecvMsgSize.

We have expectations/clusters that are much larger 3K-4K endpoints and these numbers will only increase over time.

This is a good point and pushes me towards trying to solve this in a way that doesn't involve manual configuration changes each time the agent is deployed on k8s. There's no real way to know how big to make the max message size ahead of time.

We could consider changing the protocol to send each configuration unit in a separate message, right now they are each sent as an repeated array in a single message.

https://github.com/elastic/elastic-agent-client/blob/166fd1fd746f39dadddf1abfd0c987e6c946a10d/elastic-agent-client.proto#L310-L321

message CheckinExpected {
    // Units is the expected units the component should be running.
    repeated UnitExpected units = 1;
    // Agent info is provided only on first CheckinExpected response to the component.
    CheckinAgentInfo agent_info = 2;

    // Features are the expected feature flags configurations.
    // Added on Elastic Agent v8.7.1.
    Features features = 3;
    // Index of the either current features configuration or new the configuration provided.
    uint64 features_idx = 4;
}

@bvader do you have diagnostics or a sample agent policy that was experiencing this that we could use as a reference?

bvader · 2023-04-24T19:28:10Z

@cmacknz Apologies I will get these for you.

cmacknz · 2023-04-24T20:07:28Z

The diagnostics were shared privately. I can confirm at least one instance contains 525 unique units.

The structure of the components.yaml in this case had each unit starting with:

  units:
  - config:
      datastream:

Searching for - config: gives 525 hits, likely once for each pod in the cluster:

rg '\- config:' components.yaml --count
525

blakerouse · 2023-04-24T20:28:06Z

I believe we can get creative and get this working without having to change the protocol. The Elastic Agent only sends a unit configuration when something has changed, I would expect the main issue here is that on startup with a large number of containers the Elastic Agent is trying to send all the new units at one time. That one time send will hit the limit of the GRPC.

In the case that a large number of units are being added to the component (larger than the GRPC limit) we add the units in increments (they all don't have to be created in one shot). That staggered roll out of the units would allow the Elastic Agent to stay under the GRPC units.

Same logic above can be applied to updating units. If we need to rollout a configuration change that affects all units and would result in every unit to get an updated configuration that roll out of the configuration could be staggered so not to send every configuration for every unit in one shot.

In the very rare chance that the base information for all units without a configuration (which should be very small) is more than we would ever be able to send over the protocol we could split that into two separate components. Each running a set of units to keep it below the threshold of the GRPC limit.

cmacknz · 2023-04-25T12:42:22Z

The protobuf messages all have a Size() method we could use to check how large they are before attempting to send them to help with that idea.

We would have to be very careful to avoid bugs in client implementations where the staggered rollout leads to an accidentally removed unit. For example Beats is always looking for units that were present but aren't anymore (see code).

Really this is a just a different type of protocol change, but it is harder for clients to know about the semantic change because the wire format hasn't actually changed. It might actually be easier to also change the message definitions at the same time, because with either approach each client implementation needs to be thoroughly retested anyway.

I think I am still biased towards making the change that solves this permanently and just changing the RPC definition. We know about all the client implementations today and this change won't get easier as time passes.

blakerouse · 2023-04-25T13:49:17Z

We would have to be very careful to avoid bugs in client implementations where the staggered rollout leads to an accidentally removed unit. For example Beats is always looking for units that were present but aren't anymore (see code).

We have to be careful of that always, that is the contract on how the protocol works. My suggestion would change nothing in the contract or in the protocol. From the stand point of the component it all works the same. It would be no different then someone adding a single integration one at a time of a set of time.

Really this is a just a different type of protocol change, but it is harder for clients to know about the semantic change because the wire format hasn't actually changed. It might actually be easier to also change the message definitions at the same time, because with either approach each client implementation needs to be thoroughly retested anyway.

Not true, its not a protocol change at all and not a change to the component at all. The change would only need to be done in the internal of the Elastic Agent.

I think I am still biased towards making the change that solves this permanently and just changing the RPC definition. We know about all the client implementations today and this change won't get easier as time passes.

This requires the change to happen in every component that Elastic Agent supports. Where as my change allows it to be fixed in the Elastic Agent without having to change the protocol or the contract with the spawned components.

cmacknz · 2023-04-27T15:51:15Z

@blakerouse and I spoke about this today and agreed the best solution will be to implement an optional, chunked transfer protocol for sending the CheckinExpected and CheckinObserved messages. This guarantees that there is no arbitrary hard limit on the message size in the protocol.

We must support a chunked transfer of both the CheckinExpected and CheckinObserved message.
The chunked protocol must be opt-in. Components should declare that they support the chunked protocol when they first check in with the agent. This avoids making a true breaking change to the protocol and allows agent input developers to adopt the protocol as needed. Beats would always use the chunked protocol.
Optionally, we can pass the maximum gRPC message size a client should configure on stdin when we pass the gRPC connection information.

This will be conceptually similar to the HTTP chunked transfer encoding.

cmacknz · 2023-07-28T19:27:57Z

My thinking on this problem is now that we may be better off changing how we generate configurations when using autodiscovery. Regardless of if we change the control protocol, generating these giant configurations is inefficient and will still cause problems with some recent changes we have coming where we'd like to store these configurations in ConfigMaps or Secrets which have fixed size limit.

pierrehilbert · 2023-07-30T13:44:43Z

How do you see us addressing this topic? A kinda lightened configuration for autodiscovery?

cmacknz · 2023-07-31T17:45:41Z

We probably still need to change the protocol, but in a different way. We could add a way to transport an input template and then the list of template variables, rather than having the agent pre-render everything which is what it does today.

cmacknz · 2023-07-31T20:40:11Z

I significantly expanded the description to add more context to the problem, and provide a few more candidate solutions that are able to work in situations where we don't manage input configurations with the control protocol.

faec · 2023-09-08T19:27:30Z

Here is the current plan after investigation and discussion with @cmacknz:

We are expecting to go with Solution 1 (Configurable Message Size Limits)
As part of the change, we'll be developing an internal utility to test Agent's control protocol under a variety of scenarios (Create tool for probing Agent's checkin / control protocol #3390). This is both to validate that the chosen solution is appropriate, and to provide a more systematic way of testing and documenting current behavior, which is a recurring difficulty since it involves complex code spread across several projects / repositories.

henrikno · 2023-11-02T19:17:21Z

Just wanted to note that this can happen even when running a very low pod count per node. E.g. we're currently running 30 pods per node max, but we hit this because there are a bunch of pods that are completed but still not cleaned up fully.

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 18 Oct 2023 10:52:17 -0700
      Finished:     Wed, 18 Oct 2023 10:56:48 -0700
    Ready:          False

For some pods we need to keep them around in this terminated phase to e.g. debug, retry etc.
Just wanted to note that since we basically need to test with way more than 110/700 pods.

cmacknz added the Team:Elastic-Agent Label for the Agent team label Apr 5, 2023

cmacknz added the 8.9-candidate label Apr 25, 2023

pierrehilbert assigned faec Jul 7, 2023

cmacknz changed the title ~~Control protocol Checkin payloads can exceed the gRPC maximum message size~~ Control protocol checkin payloads can exceed the gRPC maximum message size when using autodiscovery Aug 1, 2023

cmacknz mentioned this issue Aug 1, 2023

[Meta] Improving Hints Autodiscovery with Elastic Agent #3063

Open

10 tasks

faec mentioned this issue Sep 8, 2023

Create tool for probing Agent's checkin / control protocol #3390

Open

paulb-elastic mentioned this issue Nov 15, 2023

[Synthetics] Optimise the data stored in the Fleet Policy elastic/kibana#171276

Open

cmacknz mentioned this issue Nov 22, 2023

grpc.max_message_size in Fleet policy is ignored #3782

Open

gizas mentioned this issue Nov 28, 2023

Azure Monitor module to be able to support multiple subscriptions in the config elastic/beats#37215

Open

This was referenced Dec 7, 2023

Update to elastic-agent-client with chunking support. elastic/beats#37343

Merged

Fix component control protocol to allow checkin to be chunked across multiple messages #3884

Merged

blakerouse assigned blakerouse and unassigned faec Jan 5, 2024

blakerouse removed the 8.9-candidate label Jan 5, 2024

blakerouse closed this as completed in #3884 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control protocol checkin payloads can exceed the gRPC maximum message size when using autodiscovery #2460

Control protocol checkin payloads can exceed the gRPC maximum message size when using autodiscovery #2460

cmacknz commented Apr 5, 2023 •

edited

Loading

blakerouse commented Apr 5, 2023

cmacknz commented Apr 5, 2023

bvader commented Apr 19, 2023 •

edited

Loading

cmacknz commented Apr 20, 2023

bvader commented Apr 24, 2023

cmacknz commented Apr 24, 2023

blakerouse commented Apr 24, 2023 •

edited

Loading

cmacknz commented Apr 25, 2023

blakerouse commented Apr 25, 2023

cmacknz commented Apr 27, 2023

cmacknz commented Jul 28, 2023

pierrehilbert commented Jul 30, 2023

cmacknz commented Jul 31, 2023

cmacknz commented Jul 31, 2023

faec commented Sep 8, 2023

henrikno commented Nov 2, 2023

Control protocol checkin payloads can exceed the gRPC maximum message size when using autodiscovery #2460

Control protocol checkin payloads can exceed the gRPC maximum message size when using autodiscovery #2460

Comments

cmacknz commented Apr 5, 2023 • edited Loading

Intro

Problem

Solution 1: Configurable Message Size Limits

Solution 2: Chunked gRPC Transfers

Solution 3: Require Components to Render Input Templates

Solution 4: Use YAML Anchors to Avoid Repetition

Scope

blakerouse commented Apr 5, 2023

cmacknz commented Apr 5, 2023

bvader commented Apr 19, 2023 • edited Loading

cmacknz commented Apr 20, 2023

bvader commented Apr 24, 2023

cmacknz commented Apr 24, 2023

blakerouse commented Apr 24, 2023 • edited Loading

cmacknz commented Apr 25, 2023

blakerouse commented Apr 25, 2023

cmacknz commented Apr 27, 2023

cmacknz commented Jul 28, 2023

pierrehilbert commented Jul 30, 2023

cmacknz commented Jul 31, 2023

cmacknz commented Jul 31, 2023

faec commented Sep 8, 2023

henrikno commented Nov 2, 2023

cmacknz commented Apr 5, 2023 •

edited

Loading

bvader commented Apr 19, 2023 •

edited

Loading

blakerouse commented Apr 24, 2023 •

edited

Loading