[Testing] Input reload not working as expected under Elastic-Agent #35178

pierrehilbert · 2023-04-24T06:27:45Z

Related Issue

Description

Implement some Integration Test using our new Framework to catch the input reload bug before trying to do the implementation.
By doing it, we will be able to ensure that the fix we will provide is covering correctly every workflow.
More details in the related issue.

Acceptance Criterias

The publishing pipeline doesn't contain remaining event when the input is stopped
Harvester status is the correct one
Restarting an input doesn't generate errors
...

elasticmachine · 2023-04-24T06:27:48Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-04-24T18:43:46Z

Do we need this to be an integration test, or to even test it under the Elastic agent at all?

Can this behaviour be isolated to just Filebeat itself? The agent is just configuring Filebeat inputs using essentially the same logic used when configuration file reloading is enabled.

Can this be a Filebeat system test, or even just a unit test?

pierrehilbert · 2023-04-25T04:48:49Z

I based myself on the issue's description.
If we can isolate this bug to Filebeat only, I agree we can test it there directly instead of in the Agent.

belimawr · 2023-04-25T10:07:40Z

Can this be a Filebeat system test, or even just a unit test?

I believe it might be possible if we mock the ManagerV2 or create a similar situation. The main difference from running under Agent and the configuration file reload is that the Agent can send pieces of configuration at any time in any order, and that seems to be the root cause of the issues. When reloading from the configuration file, Filebeats read the whole config and reloads it in "a single step".

I've been thinking about the acceptance criteria and the first two are not really feasible:

The publishing pipeline doesn't contain remaining event when the input is stopped
Harvester status is the correct one

Here is why:

Even on a standalone Filebeat, we do not flush the publishing pipeline when the input is stopped.
Specially for the log input, the status of the harvester is updated as the events are acked, so we can only correctly access the state if the publishing pipeline has been emptied.

With that all said, the 3rd one is pretty good to describe on a high level what we should be testing.

cmacknz · 2023-04-26T13:01:35Z

Here's an example of a unit test that has complete control over the sequence of messages that would be sent by the agent:

beats/x-pack/libbeat/management/managerV2_test.go

Line 77 in 87e5245

srv := mockSrv([][]*proto.UnitExpected{

I am not convinced we actually need to test at this level though, and we can probably trigger this bug by just working with the beat configuration directly.

If we replace a filestream input with ID A with a copy of that input with ID B I think we'd trigger the same behaviour, this is what the agent units are likely doing and it is the exact behaviour you'd get if you reassigned the agent policy to another one with the same input defined in it. Only the ID would change.

cmacknz · 2023-04-26T20:38:52Z

Discussed today and we believe we have isolated the behaviour using only Filebeat itself. @belimawr if we need another round of discussion and brainstorming on this let me know and I'll set it up.

pierrehilbert · 2023-04-27T07:04:25Z

@belimawr don't hesitate to update the issue following your investigation and discussion with Craig.

belimawr · 2023-04-27T09:46:16Z

Yes, we isolated it, the main thing is that a standalone Filebeat doing config reload handles this issue gracefully by having a debounce when reloading config + infinity retry until all inputs are successfully started, hence we don't see the effects of this issue.

I'll add a quick brain dump of how to reproduce/see it happening here.

Given:

A filebeat.yml
Two input configurations reading the same file but with different IDs
A log file that's constantly written to (like a line every second). I always use flog for that.

filebeat.yml

filebeat.inputs:

filebeat.config:
  inputs:
    enabled: true
    path: config/*.yml
    reload.enabled: true
    reload.period: 10s

output.file:
  enabled: true

log1.yml

- type: log
  id: log-input-1
  paths:
    - /tmp/flog.log

log2.yml

- type: log
  id: log-input-2
  paths:
    - /tmp/flog.log

Start Filebeat with debug logs

./filebeat -e -v -d "centralmgmt"

Keep replacing the config

Copy one of the configs so Filebeat can use them

cp log1.yml config/log.yml

Wait for the harvester to start

You should see logs like

Harvester started for paths: [/tmp/flog.log]

Keep replacing the input config

Here is an example, after every command wait for a little while, until you see some logs informing of the reload/start of the new harvester

cp log1.yml config/log.yml
cp log2.yml config/log.yml
cp log1.yml config/log.yml

Eventually (it happens rather quickly) you'll see logs like:

Error creating runner from config: Can only start an input when all related states are finished: {Id: native::164253-34, Finished: false, Fileinfo: &{flog.log 950343 420 {179441601 63818185130 0x4892000} {34 164253 1 33188 1000 1000 0 0 950343 4096 1864 {1682588243 628737296} {1682588330 179441601} {1682588330 179441601} [0 0 0]}}, Source: /tmp/flog.log, Offset: 951193, Timestamp: 2023-04-27 11:39:00.031876649 +0200 CEST m=+19.152186163, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 164253-34}

Will there be any observable issue on the output/data harvested?

No, there will not. The config file reload knows how to handle this situation and Filebeat will, eventually, be running all configured inputs successfully.

What about when running under Elastic-Agent

Well, if there is any issue when applying the new configuration sent by the Elastic-Agent Filebeat will report the error, which makes the Elastic-Agent to report unhealthy.

cmacknz · 2023-05-02T17:11:44Z

I spoke with @belimawr today and we came up with the follow proposal to test this:

Add a new command line argument for the Beats management package that allows specifying the gRPC server address to connect to directly. This would involve modifying this structure:

beats/x-pack/libbeat/management/config.go

Lines 12 to 17 in f4374dc

    
           // Config for central management 
        
           type Config struct { 
        
           	Enabled               bool                    `config:"enabled" yaml:"enabled"` 
        
           	Blacklist             ConfigBlacklistSettings `config:"blacklist" yaml:"blacklist"` 
        
           	RestartOnOutputChange bool                    `config:"restart_on_output_change" yaml:"restart_on_output_change"` 
        
           }

When the server address is configured directly, instead of trying to read the server address and mTLS configuration from stdin like it would when running under agent the Beat will create a V2 protocol client to connect directly to that address without TLS. This will involve modifying this block:

beats/x-pack/libbeat/management/managerV2.go

Lines 118 to 125 in f4374dc

    
           agentClient, _, err := client.NewV2FromReader(os.Stdin, client.VersionInfo{ 
        
           	Name:    "beat-v2-client", 
        
           	Version: version.GetDefaultVersion(), 
        
           	Meta: map[string]string{ 
        
           		"commit":     version.Commit(), 
        
           		"build_time": version.BuildTime().String(), 
        
           	}, 
        
           })

It will look something like the insecure configuration used in the existing unit tests:

beats/x-pack/libbeat/management/managerV2_test.go

Lines 202 to 209 in f4374dc

    
           client := client.NewV2(fmt.Sprintf(":%d", srv.Port), "", client.VersionInfo{ 
        
           	Name:    "program", 
        
           	Version: "v1.0.0", 
        
           	Meta: map[string]string{ 
        
           		"key": "value", 
        
           	}, 
        
           }, grpc.WithTransportCredentials(insecure.NewCredentials()))

We will write an integration test that starts a gRPC server with an implementation of the ElasticAgentClient service, starts Filebeat as a process, and then sequences the CheckinExpected messages in a way that reproduces this problem.

The test will only need to implement the CheckinV2 RPC, the other RPCs can be stubs. The Checkin RPC is deprecated and we don't need Actions.

https://github.com/elastic/elastic-agent-client/blob/554dc30ca4b66d9ab41149e6ed099b859c82ad09/pkg/proto/elastic-agent-client_grpc.pb.go#L28-L34

This will give us the foundation to write integration tests that directly control configuration changes sent to Beats without having to rely on editing the agent policy in the required way, and more importantly does not introduce a dependency on the Elastic Agent build system by requiring use of the under development Elastic Agent integration testing framework

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Apr 24, 2023

pierrehilbert assigned belimawr Apr 24, 2023

belimawr mentioned this issue Apr 27, 2023

Fix input reload under Elastic-Agent #35250

Merged

7 tasks

belimawr closed this as completed in #35250 May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Testing] Input reload not working as expected under Elastic-Agent #35178

[Testing] Input reload not working as expected under Elastic-Agent #35178

pierrehilbert commented Apr 24, 2023 •

edited

Loading

elasticmachine commented Apr 24, 2023

cmacknz commented Apr 24, 2023

pierrehilbert commented Apr 25, 2023

belimawr commented Apr 25, 2023

cmacknz commented Apr 26, 2023

cmacknz commented Apr 26, 2023

pierrehilbert commented Apr 27, 2023

belimawr commented Apr 27, 2023

cmacknz commented May 2, 2023

[Testing] Input reload not working as expected under Elastic-Agent #35178

[Testing] Input reload not working as expected under Elastic-Agent #35178

Comments

pierrehilbert commented Apr 24, 2023 • edited Loading

Related Issue

Description

Acceptance Criterias

elasticmachine commented Apr 24, 2023

cmacknz commented Apr 24, 2023

pierrehilbert commented Apr 25, 2023

belimawr commented Apr 25, 2023

cmacknz commented Apr 26, 2023

cmacknz commented Apr 26, 2023

pierrehilbert commented Apr 27, 2023

belimawr commented Apr 27, 2023

Given:

Start Filebeat with debug logs

Keep replacing the config

Wait for the harvester to start

Keep replacing the input config

Will there be any observable issue on the output/data harvested?

What about when running under Elastic-Agent

cmacknz commented May 2, 2023

pierrehilbert commented Apr 24, 2023 •

edited

Loading