Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic-agent must be able to manage global variables within a beat #163

Closed
fearful-symmetry opened this issue Oct 28, 2021 · 17 comments
Closed
Labels
bug Something isn't working enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Oct 28, 2021

CC @ruflin @jen-huang @joshdover @jlind23 @masci

What's this about?

This is a result of me trying to deal with the bug in elastic/beats#28546, where the problem stems for an inability to manage the global hostfs from within fleet. I'm going to be summarizing from an email thread a bit here.

Right now, we have we have no real way for a user to set and update global variables within metricbeat from the fleet UI. In addition, the config system that manages a beat config from fleet has no way to set or manage global variables. The OnReload callbacks that happen only impact config that happens from within a given module. In addition libbeat's management of global config and state is highly piecemeal, and not well-suited to clean remote management by something like elastic-agent. These are all interrelated problems, and whatever fix comes out of this needs to deal with all of these problems in some way.

What do we need?

While there's a handful of use cases for management of global variables, the first and most immediate from a tech debt perspective is the system.hostfs variable that's used by various system components to specify an alternate root filesystem, in cases where a user wants to monitor a host system from within a container. Usage of this setting is spread across two metricbeat modules and libbeat, and will probably expand to other modules over time.

While I'm sure we could put something together in a week to fix hostfs, I'd like to approach this from a generic perspective, as this need is going to crop up in other areas over time, and we need a generic, scalable way of managing global state from fleet.

At minimum, it seems to me need the following:

  • A UI element that sets variables at a Fleet Config Policy Level
  • Some component that transmits this config to beats and applies it a global level
  • Some component(s) in libbeat that are capable of updating global config variables from fleet inputs

The workaround in the above PR is to set the global variable from the system module init, which is not a great solution, but it's the only one I have, considering we just need the bug fixed in the short term.

Analysis: What are the underlying issues?

When I was considering the requirements for a "proper" hostfs fix, I came up with a few requirements (copying here from an email):

  • Must work under metricbeat's Initialization Scheme, as well as Fleet
  • Must have an API that insures different components are using the HostFS in a consistent manner
  • Any component that needs to know the HostFS must not not carry it around in global state, and instead receive it from an API
  • It must be able to be updated with a new HostFS during runtime, and that updated value must be viewable to any downstream consumers

It struck me halfway through that I was sort of describing a key-value store. Right now we don't have anything like that in fleet as far as I'm aware, but I think we should consider something like it (at least from an API perspective), as the fundamental problem this issue is trying to address is that global config within beats is done in a rather piecemeal, distributed fashion without any central gatekeeping or management. This makes sense for a CLI-based application like metricbeat/filebeat, but not so much for a remotely managed component of multi-part system like the fleet stack.

If we want a solution that's more elegant than "append a flag to the exec command, restart the process", we'll need to re-evaluate how we utilize libbeat from inside elastic-agent. A meaningful, long-term solution to this problem will probably require a lot of refactoring and new code, but I think it's worth it, as large portions of libbeat were built around the assumption of an independent application that is configured locally via a CLI and yaml file. The longer we go without creating a codebase that treats remote management as a first-class use case, the more tech debt we're gonna build up.

The fleet data streams -> metricbeat module architecture we have in place now works because modules within a beat operate in a relatively independent, generic fashion, allowing something like the central management code to come along and re-implement the underlying interface. This isn't really the case with libbeat, as CLI flags, and init functions happen in different places. For example, when one excludes testing flags, there's about four different sub-libraries in libbeat that set and fetch their own CLI flags and (I imagine in some cases) merge them with yml values using their own logic.

There's a lot of potential overlapping solutions to this problem (the KV store, spinning up inputs as independent processes and not a global beat instance, plugging modules directly into elastic-agent without metricbeat, etc, etc), and I'm not pushing for one solution, but more using the KV-store idea as an illustration for the kinds of overarching, big-picture changes I think we should be looking at.

@fearful-symmetry fearful-symmetry added bug Something isn't working enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Oct 28, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ruflin
Copy link
Contributor

ruflin commented Nov 2, 2021

Thanks @fearful-symmetry for the write up around the current issue(s) and share ideas on the direction we could take this. As described, we should tackled the more general problem of "global variables" in Fleet and Elastic Agent which then provides a solution for hostfs. I expect this to be a collaboration between the Control Plane and Fleet team (@jen-huang ) as it will also have an affect on how the policy is built, how users modify it etc.

@jlind23
Copy link
Contributor

jlind23 commented Feb 10, 2022

@ph i don't think we did enough preparatory work on our end to make this happen. I would rather deal with it in 8.3 or 8.4 - Thoughts?

@ph
Copy link
Contributor

ph commented Feb 10, 2022

I think I would be more drastic... Using environment variables as flag enablement is a fragile idea, I see this as a tech deb that we should clean. @fearful-symmetry got it right here.

In addition libbeat's management of global config and state is highly piecemeal, and not well-suited to clean remote management by something like elastic-agent. These are all interrelated problems, and whatever fix comes out of this needs to deal with all of these problems in some way.

I would prefer we look at this problem without having the need for an environment variable as a configuration in the context of input v2 and removing the requirement for a global variable both in input and libbeat.

@ph
Copy link
Contributor

ph commented Feb 10, 2022

Just to be clear, I would suggest we have an issue removing the need for global variables completely which would fall into the data plane side.

@jlind23
Copy link
Contributor

jlind23 commented Feb 10, 2022

@cmacknz would you take on this issue or should I create a new one following what @ph stated?

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2022

I don't know enough about the interactions here to have an opinion on what the solution to this is. I can believe changing the metricbeat side will be easier than coordinating something across all of agent and fleet.

Hopefully @fearful-symmetry can give me a tour of the code involved here the next time we talk.

@fearful-symmetry
Copy link
Contributor Author

I would prefer we look at this problem without having the need for an environment variable as a configuration in the context of input v2 and removing the requirement for a global variable both in input and libbeat.

@ph Seconded, most of the logic around global variables rolling around the code is for managing either CLI flags for or random configs thrown into the main whateverbeat.yml, once we're no longer tied to a beat with a command line, we get rid of the need for a lot of this. Considering how V2 will also isolate inputs from queues, shippers, etc, there might not be a lot of global variables left.

The most awkward component here (and the one I'm most familiar with), is hostfs. Until we refactor out all the vendored libraries like gosigar, some subcomponent in the metricbeat module will need to set hostfs as a global variable in order to make those vendor libraries happy.

@ph
Copy link
Contributor

ph commented Feb 10, 2022

The most awkward component here (and the one I'm most familiar with), is hostfs. Until we refactor out all the vendored libraries like gosigar, some subcomponent in the metricbeat module will need to set hostfs as a global variable in order to make those vendor libraries happy.

Sadly, yes, it's been a while since I've gone into go-sigar rabbit hole. So I don't know what is the size of that refactoring effort. But as we move forward we should reduce side effect code (like env variable or func init()) and move to explicit configuration.

@fearful-symmetry
Copy link
Contributor Author

Sadly, yes, it's been a while since I've gone into go-sigar rabbit hole. So I don't know what is the size of that refactoring effort. But as we move forward we should reduce side effect code (like env variable or func init()) and move to explicit configuration.

Agreed. That refactoring is currently in-progress (See elastic/beats#30076), but it's a huge slog, as there's tons of code to re-write and move over. There's also shirou/gopsutil, which requires a global hostfs in a few places as well.

@fearful-symmetry
Copy link
Contributor Author

On the request of @cmacknz , here's a breakdown of the parts of the system code that still have discrepancies as far as hostfs handling:

  • Libraries still using shirou/gopsutil, which requires hostfs to be set globally at the library level:

    • libbeat/metric/system/diskio
    • metricbeat/module/system/network
    • metricbeat/module/system/socket_summary
  • Other uses of gopsutil that doesn't require hostfs:

    • The darwin implementation of "github.com/shirou/gopsutil/v3/cpu"
    • libbeat/cmd/platformcheck/platformcheck.go
  • Libraries still using elastic/go-sysinfo, which doesn't take any hostfs variable, and thus won't work under containers:

    • libbeat/metric/system/host
    • metricbeat/module/linux/memory
    • metricbeat/module/system/network_summary
  • Libraries using elastic/go-sysinfo that I don't think will care about an alternate hostfs:

    • libbeat/metric/system/process (used to fetch host memory stats, as the memory code lives in metricbeat/internal
    • libbeat/processors/add_observer_metadata
    • libbeat/processors/add_host_metadata
    • metricbeat/module/docker/network_summary

go-sysinfo is also used in packetbeat, agent, and auditbeat. I'm not knowledgeable enough with auditbeat to say for sure, but there might be some code in there that would care about a procfs root, but I'm not sure if auditbeat supports that at all.

  • Libraries still using gosigar, which requires hostfs to be set globally
    • libbeat/metric/system/cpu (the name is deceptive, this is for load averages)
    • libbeat/metric/system/diskio
    • metricbeat/module/system/filesystem
    • metricbeat/module/system/process_summary
  • Libraries still using gosigar but don't require hostfs:
    • libbeat/metric/system/cgroup/cgv1
    • libbeat/metric/system/process
    • metricbeat/internal/metrics/cpu (windows only)
    • metricbeat/module/system/socket
    • /metricbeat/module/system/uptime
    • metricbeat/module/windows/service
    • packetbeat/procs

As far as the path forward here, go-sysinfo is the low-hanging fruit. We own the code, we can just copy it over with some changes and cleanup to support beat's internal hostfs resolver. However, a lot of the go-sysinfo code is used outside of metricbeat metrics gathering, and it raises the question of what parts of the beats ecosystem should use what. The original plan (which I'm still planning to stick to) is to extract any parts of go-sysinfo that are only used by metricbeat modules, and later add hostfs support to go-sysinfo in case we need it in other places.

The remaining gosigar libraries are also rather low-hanging. It should be noted that a lot of the remaining uses of gosigar don't require any hostfs value, and are instead used to import legacy non-linux implementations of various system APIs, as we aren't exactly overflowing with knowledgeable Darwin or Windows systems devs right now who can refactor this code. I can do it with a little self-training, but it's a lower priority.

gopsutil is going to be a bit of a snag, as we don't own the code, and a lot of the APIs that it implements, like networking, aren't exactly a breeze. As an amusing aside, last time I dug into a lot of these libraries, a lot of the code (particularly implementations on Darwin and Windows) seems to be copy-and-pasted ad infinitum across github, and I have no idea where some of it actually came from, from an authorship standpoint.

@jlind23
Copy link
Contributor

jlind23 commented Mar 15, 2022

After discussing with @cmacknz and having in mind the current design on the input as part of the Elastic V2, we decided to froze this work for now and rather keep it for later to see if V2 design will fixes these problems.

@cmacknz
Copy link
Member

cmacknz commented Mar 15, 2022

Yes, there is some valuable refactoring to be done here but completing all of the refactoring to solve this problem within metricbeat will take a long time. We are shifting focus to V2 implementation and we can hopefully address the root problem with V2 (via separate processes, and agent KV store, etc as suggested in the description).

@jlind23
Copy link
Contributor

jlind23 commented May 27, 2024

@cmacknz can this be closed as completed/outdated?

@cmacknz
Copy link
Member

cmacknz commented May 27, 2024

This hasn't come up again since it was written, so we can close and reopen if needed.

If we still need to expose hostfs it can be added as a new setting in the agent policy. We can open a more targeted issue for that.

@cmacknz cmacknz closed this as completed May 27, 2024
@therealdwright
Copy link
Contributor

@cmacknz I have been in discussion with your support team about this exact issue for almost a month now. I think a solution is still required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants