Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Testing] Input reload not working as expected under Elastic-Agent #35178

Closed
4 tasks
pierrehilbert opened this issue Apr 24, 2023 · 9 comments · Fixed by #35250
Closed
4 tasks

[Testing] Input reload not working as expected under Elastic-Agent #35178

pierrehilbert opened this issue Apr 24, 2023 · 9 comments · Fixed by #35250
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@pierrehilbert
Copy link
Collaborator

pierrehilbert commented Apr 24, 2023

Related Issue

#33653

Description

Implement some Integration Test using our new Framework to catch the input reload bug before trying to do the implementation.
By doing it, we will be able to ensure that the fix we will provide is covering correctly every workflow.
More details in the related issue.

Acceptance Criterias

  • The publishing pipeline doesn't contain remaining event when the input is stopped
  • Harvester status is the correct one
  • Restarting an input doesn't generate errors
  • ...
@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Apr 24, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Apr 24, 2023

Do we need this to be an integration test, or to even test it under the Elastic agent at all?

Can this behaviour be isolated to just Filebeat itself? The agent is just configuring Filebeat inputs using essentially the same logic used when configuration file reloading is enabled.

Can this be a Filebeat system test, or even just a unit test?

@pierrehilbert
Copy link
Collaborator Author

I based myself on the issue's description.
If we can isolate this bug to Filebeat only, I agree we can test it there directly instead of in the Agent.

@belimawr
Copy link
Contributor

Can this be a Filebeat system test, or even just a unit test?

I believe it might be possible if we mock the ManagerV2 or create a similar situation. The main difference from running under Agent and the configuration file reload is that the Agent can send pieces of configuration at any time in any order, and that seems to be the root cause of the issues. When reloading from the configuration file, Filebeats read the whole config and reloads it in "a single step".

I've been thinking about the acceptance criteria and the first two are not really feasible:

  1. The publishing pipeline doesn't contain remaining event when the input is stopped
  2. Harvester status is the correct one

Here is why:

  1. Even on a standalone Filebeat, we do not flush the publishing pipeline when the input is stopped.
  2. Specially for the log input, the status of the harvester is updated as the events are acked, so we can only correctly access the state if the publishing pipeline has been emptied.

With that all said, the 3rd one is pretty good to describe on a high level what we should be testing.

@cmacknz
Copy link
Member

cmacknz commented Apr 26, 2023

Here's an example of a unit test that has complete control over the sequence of messages that would be sent by the agent:

srv := mockSrv([][]*proto.UnitExpected{

I am not convinced we actually need to test at this level though, and we can probably trigger this bug by just working with the beat configuration directly.

If we replace a filestream input with ID A with a copy of that input with ID B I think we'd trigger the same behaviour, this is what the agent units are likely doing and it is the exact behaviour you'd get if you reassigned the agent policy to another one with the same input defined in it. Only the ID would change.

@cmacknz
Copy link
Member

cmacknz commented Apr 26, 2023

Discussed today and we believe we have isolated the behaviour using only Filebeat itself. @belimawr if we need another round of discussion and brainstorming on this let me know and I'll set it up.

@pierrehilbert
Copy link
Collaborator Author

@belimawr don't hesitate to update the issue following your investigation and discussion with Craig.

@belimawr
Copy link
Contributor

Yes, we isolated it, the main thing is that a standalone Filebeat doing config reload handles this issue gracefully by having a debounce when reloading config + infinity retry until all inputs are successfully started, hence we don't see the effects of this issue.

I'll add a quick brain dump of how to reproduce/see it happening here.

Given:

  • A filebeat.yml
  • Two input configurations reading the same file but with different IDs
  • A log file that's constantly written to (like a line every second). I always use flog for that.
filebeat.yml
filebeat.inputs:

filebeat.config:
  inputs:
    enabled: true
    path: config/*.yml
    reload.enabled: true
    reload.period: 10s

output.file:
  enabled: true
log1.yml

- type: log
  id: log-input-1
  paths:
    - /tmp/flog.log

log2.yml
- type: log
  id: log-input-2
  paths:
    - /tmp/flog.log

Start Filebeat with debug logs

./filebeat -e -v -d "centralmgmt"

Keep replacing the config

Copy one of the configs so Filebeat can use them

cp log1.yml config/log.yml

Wait for the harvester to start

You should see logs like

Harvester started for paths: [/tmp/flog.log]

Keep replacing the input config

Here is an example, after every command wait for a little while, until you see some logs informing of the reload/start of the new harvester

cp log1.yml config/log.yml
cp log2.yml config/log.yml
cp log1.yml config/log.yml

Eventually (it happens rather quickly) you'll see logs like:

Error creating runner from config: Can only start an input when all related states are finished: {Id: native::164253-34, Finished: false, Fileinfo: &{flog.log 950343 420 {179441601 63818185130 0x4892000} {34 164253 1 33188 1000 1000 0 0 950343 4096 1864 {1682588243 628737296} {1682588330 179441601} {1682588330 179441601} [0 0 0]}}, Source: /tmp/flog.log, Offset: 951193, Timestamp: 2023-04-27 11:39:00.031876649 +0200 CEST m=+19.152186163, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 164253-34}

Will there be any observable issue on the output/data harvested?

No, there will not. The config file reload knows how to handle this situation and Filebeat will, eventually, be running all configured inputs successfully.

What about when running under Elastic-Agent

Well, if there is any issue when applying the new configuration sent by the Elastic-Agent Filebeat will report the error, which makes the Elastic-Agent to report unhealthy.

@cmacknz
Copy link
Member

cmacknz commented May 2, 2023

I spoke with @belimawr today and we came up with the follow proposal to test this:

  1. Add a new command line argument for the Beats management package that allows specifying the gRPC server address to connect to directly. This would involve modifying this structure:

// Config for central management
type Config struct {
Enabled bool `config:"enabled" yaml:"enabled"`
Blacklist ConfigBlacklistSettings `config:"blacklist" yaml:"blacklist"`
RestartOnOutputChange bool `config:"restart_on_output_change" yaml:"restart_on_output_change"`
}

  1. When the server address is configured directly, instead of trying to read the server address and mTLS configuration from stdin like it would when running under agent the Beat will create a V2 protocol client to connect directly to that address without TLS. This will involve modifying this block:

agentClient, _, err := client.NewV2FromReader(os.Stdin, client.VersionInfo{
Name: "beat-v2-client",
Version: version.GetDefaultVersion(),
Meta: map[string]string{
"commit": version.Commit(),
"build_time": version.BuildTime().String(),
},
})

It will look something like the insecure configuration used in the existing unit tests:

client := client.NewV2(fmt.Sprintf(":%d", srv.Port), "", client.VersionInfo{
Name: "program",
Version: "v1.0.0",
Meta: map[string]string{
"key": "value",
},
}, grpc.WithTransportCredentials(insecure.NewCredentials()))

  1. We will write an integration test that starts a gRPC server with an implementation of the ElasticAgentClient service, starts Filebeat as a process, and then sequences the CheckinExpected messages in a way that reproduces this problem.

The test will only need to implement the CheckinV2 RPC, the other RPCs can be stubs. The Checkin RPC is deprecated and we don't need Actions.

https://github.com/elastic/elastic-agent-client/blob/554dc30ca4b66d9ab41149e6ed099b859c82ad09/pkg/proto/elastic-agent-client_grpc.pb.go#L28-L34

This will give us the foundation to write integration tests that directly control configuration changes sent to Beats without having to rely on editing the agent policy in the required way, and more importantly does not introduce a dependency on the Elastic Agent build system by requiring use of the under development Elastic Agent integration testing framework

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants