Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Agent Policy should have an option to automatically unenroll INACTIVE agents #179399

Closed
6 of 13 tasks
nimarezainia opened this issue Mar 26, 2024 · 22 comments · Fixed by #189861
Closed
6 of 13 tasks
Assignees
Labels
Feature:Fleet Fleet team's agent central management project QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@nimarezainia
Copy link
Contributor

nimarezainia commented Mar 26, 2024

In reference to the state machine in the docs HERE, we already have mechanisms by which an agent will go OFFLINE and then after a certain amount of time will be considered INACTIVE (user configurable) and removed from the default view Users can filter to the INACTIVE view to see these agents and determine whether further analysis is required.

In an INACTIVE state the API keys allocated to the agent are still valid, in case that agent becomes active again. After analysis, the user can "select all" and unenroll these agents. Where they are removed from view and all API keys are relinquished.

In environments with ephemeral agents, where VMs/Containers are continuously provisioned and de-provisioned, this approach may lead to many agents in the INACTIVE state consuming API keys. In these environments we should provide the users an opt-in option to automatically unenroll and clean up the agents in the INACTIVE view.

Task list

  • Define a new per-policy setting: "Inactive agent unenrollment timeout", to be provided as a number in seconds
  • Create a new kibana task that polls agents regularly and creates an unenroll action if the timeout is reached. This should only apply to agents that are already in the INACTIVE state.
    • For scalability, execute the unenrollment in batches of 5k/10K agents to avoid putting ES under pressure and run the job frequently to remove all the agents in reasonable time
  • Make sure that Audit / Activity logs show when an unenrollment happens due to an agent policy configuration.
  • Update the docs
  • Show the new option in agent policy settings. This value should be deselected by default.
Inactive agent unenrollment timeout

If configured, inactive agents will be automatically unenrolled and their API keys will be invalidated after they've been inactive for this value in seconds. This can be useful for policies containing ephemeral agents, such as those in a Docker or Kubernetes environment.
  • Remove the deprecated Unenrollment timeout field from the UI to avoid confusion

Follow up

Original description
  • Provide a checkbox/toggle in the Agent Policy which is off by default.
  • When enabled, every 24hrs, force unenroll agents that are in an INACTIVE state.
  • Audit logs / Activity logs should show that this unenrollment is happening due to a configuration in the Agent Policy.
  • Unenrolled Agents should also remove all documents in .fleet* that relate to this agent
  • Ensure this gets documented well in product as well as documentation regarding Agent state transitions.
  • allow for cancellation between when agents are going to be unenrolled and when we actually unenroll them
cc: @kpollich
@nimarezainia nimarezainia added Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team labels Mar 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Feature:Fleet)

@kpollich kpollich changed the title [Fleet] Agent Policy hould have an option to automatically unenroll INACTIVE agents [Fleet] Agent Policy should have an option to automatically unenroll INACTIVE agents Mar 28, 2024
@kpollich
Copy link
Member

@nimarezainia thanks for filing this. I think the old "unenroll timeout" feature worked better for ephemeral agents, while the newer "inactivity timeout" feature works substantially better for more permanent agents (e.g. employee laptops).

I think overall providing some opt-in value for automatic unenrollment on certain policies is the way to go. I don't think having a hardcoded 24h cutoff is the way to go, though. I think we should essentially replicate the old unenrollment timeout value as a per-policy setting and it should be provided as a number in seconds.

We still have the UI for the policy-level unenrollment setting, but it's deprecated and ignored unless Fleet Server is running on a version prior to 8.7.0, for context:

image

I'd recommend adding a new timeout setting below "Inactivity timeout" that's described as the "Inactive agent unenrollment timeout", e.g.

Inactive agent unenrollment timeout

If configured, inactive agents will be automatically unenrolled and their API keys will be invalidated after they've been inactive for this value in seconds. This can be useful for policies containing ephemeral agents, such as those in a Docker or Kubernetes environment.

We may also want to add some additional clarity or reword the deprecated "Unenrollment timeout" form field as well to avoid confusion.

We should also include detail documentation for this value and its intended use.

@nimarezainia
Copy link
Contributor Author

@kpollich Happy to have it as a configurable element - off by default.
It should only apply to agents that are already in the INACTIVE state.
Therefore, I agree with your suggestion for the title "Inactive agent unenrollment timeout" (move it up to be near Inactivity Timeout.

++ to Documentation obviously. Will add as a task.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 22, 2024

Agent removal should also remove all documents in .fleet* that relate to this agent

We don't remove documents currently even for unenrolled agents. I think with the auto unenroll also, the documents should stay with unenrolled state.

Also, I think we should remove the deprecated Unenrollment timeout field from the UI, it's not supported for a long time now, and it might cause confusion.

Regarding the implementation, I think automatically moving agents from inactive to unenrolled state is not trivial, currently the manual unenrollment is an action, which would have to be triggered automatically, in order for the action doc to be updated, and API keys revoked.
We would probably need a kibana task that polls agents regularly and creates an unenroll action if the inactive unenrollment timeout is reached.
We should consider behaviour at scale, for example in case of 100k agents, we don't want force unenroll to be triggered at the same time, putting on a heavy load on ES. We could unenroll only a limited agents at a time (e.g. 5k or 10k), and let the job run frequently enough that the agents will be all unenrolled within a reasonable time period. This is similar to how ES works when deleting inactive API keys with eventual consistency.

@kpollich
Copy link
Member

We should allow for cancellation between when agents are going to be unenrolled and when we actually unenroll them

@nimarezainia
Copy link
Contributor Author

nimarezainia commented Jul 25, 2024

Also, I think we should remove the deprecated Unenrollment timeout field from the UI, it's not supported for a long time now, and it might cause confusion.

Yes please.

Agent removal should also remove all documents in .fleet* that relate to this agent

We don't remove documents currently even for unenrolled agents. I think with the auto unenroll also, the documents should stay with unenrolled state.

This is actually a concern from the users, particularly in ephemeral environments. These documents could add up for INACTIVE agents and frankly take up space and that is what the customer is paying for.

@juliaElastic is it possible to remove these documents for unenrolled agents?

@nimarezainia
Copy link
Contributor Author

We should allow for cancellation between when agents are going to be unenrolled and when we actually unenroll them

added this to the requirement.
How do you reckon we could achieve this? I think perhaps in the agent activity fly out, each batch that is being un-enrolled is shown with a cancellation?

@juliaElastic
Copy link
Contributor

is it possible to remove these documents for unenrolled agents?

It is possible, though at least we should add some delay until deleting them for debug/visibility purposes. We could do something like move from inactive to unenrolled when the inactive unenrollment timeout is reached, and then have another task clean up unenrolled agents after some time.

@nimarezainia
Copy link
Contributor Author

is it possible to remove these documents for unenrolled agents?

It is possible, though at least we should add some delay until deleting them for debug/visibility purposes. We could do something like move from inactive to unenrolled when the inactive unenrollment timeout is reached, and then have another task clean up unenrolled agents after some time.

sure. I would say that the unenrolled agents should be cleaned up in this manner at all times. Modified the description to say this.

@criamico
Copy link
Contributor

criamico commented Jul 29, 2024

@juliaElastic @nimarezainia
I'm reading trough the previous discussion and trying to consolidate it on a comprehensive tasklist - please let me know if I'm missing something:

  • Define a new per-policy setting: "Inactive agent unenrollment timeout", to be provided as a number in seconds
  • Create a new kibana task that polls agents regularly and creates an unenroll action if the timeout is reached. This should only apply to agents that are already in the INACTIVE state.
    • For scalability, execute the unenrollment in batches of 5k/10K agents to avoid putting ES under pressure and run the job frequently to remove all the agents in reasonable time
  • Make sure that Audit / Activity logs show when an unenrollment happens due to an agent policy configuration.
  • Update the docs

UI

  • Show the new option as a dropdown in agent policy settings. This value should be deselected by default.
Inactive agent unenrollment timeout

If configured, inactive agents will be automatically unenrolled and their API keys will be invalidated after they've been inactive for this value in seconds. This can be useful for policies containing ephemeral agents, such as those in a Docker or Kubernetes environment.
  • Remove the deprecated Unenrollment timeout field from the UI to avoid confusion

EDIT - To be done in separate tickets

  • Investigate cleaning up documents in .fleet* that relate to the unenrolled agents. For this we should schedule an additional task that cleans up the documents after some time that agents where unenrolled.
  • Investigate how to enable the user to cancel the action. This should be allowed between the time when agents are scheduled to be unenrolled and the time when we actually unenroll them
    • Resurface this option in the UI. How do we want to show the user the option to cancel the unenrollment for a batch? ( see this comment)

@nchaulet
Copy link
Member

nchaulet commented Jul 29, 2024

Clean up documents in .fleet* that relate to the unenrolled agents. For this we should schedule an additional task that cleans up the documents after some time that agents where unenrolled.

Should this be configurable maybe in the kibana config? it's kind of a breaking change

@criamico
Copy link
Contributor

criamico commented Jul 29, 2024

Clean up documents in .fleet* that relate to the unenrolled agents. For this we should schedule an additional task that cleans up the documents after some time that agents where unenrolled.

Should this be configurable maybe in the kibana config? it's kind of a breaking change

Maybe we can move this to a separate ticket and discuss it there? I think it can be done separately anyway, this ticket it's already quite big.
I edited my previous comment to highlight that this could be broken off from this issue and investigated separately.

@juliaElastic
Copy link
Contributor

I'm reading trough the previous discussion and trying to consolidate it on a comprehensive tasklist - please let me know if I'm missing something:

Good summary! I think only the cancellation behavior is not defined too well, maybe we should move that to a separate ticket as well?
I could imagine this working as scheduled upgrades, so that the unenroll bulk action batches are scheduled, they show up on Agent activity, and the users have a chance to Cancel from the UI. This doesn't sound like a small amount of work.

@criamico
Copy link
Contributor

I think only the cancellation behavior is not defined too well, maybe we should move that to a separate ticket as well?

I agree, that's not too clear to me as well and it will require some investigation to define how it should be done. Also we don't have any UX for it yet.

@criamico
Copy link
Contributor

Created two follow up tickets:

I'm also updating the ticket description to reflect the previous discussion.

@nimarezainia
Copy link
Contributor Author

  • Show the new option as a dropdown in agent policy settings. This value should be deselected by default.

@criamico could this not be a text box accepting a timeout value.

@criamico
Copy link
Contributor

criamico commented Jul 31, 2024

@nimarezainia here's a screenshot of how it will look:

Screenshot 2024-07-31 at 16 32 14

@kpollich
Copy link
Member

kpollich commented Aug 2, 2024

Would it make sense to put this behind a feature flag until the cancellation work in #189508 is done? I don't think it makes sense to allow users to opt into this behavior without some means of cancellation before agents are actually unenrolled.

@nimarezainia
Copy link
Contributor Author

Would it make sense to put this behind a feature flag until the cancellation work in #189508 is done? I don't think it makes sense to allow users to opt into this behavior without some means of cancellation before agents are actually unenrolled.

I'm not sure we need to have all the pieces of this puzzle together before the feature can be used. I'm also thinking about the cleaning up of the dot indices. The offline agents clog up the UI, so this timeout would help the user who is concerned with that.

Cancellation is needed but I think can be treated as a follow on enhancement.

@harshitgupta-qasource
Copy link

Hi Team,

We have executed 05 testcases under the Feature test run for the 8.16.0 release at the link:

Status:

PASS: 05

Build details:
VERSION: 8.16.0 BC2
BUILD: 79434
COMMIT: 59220e9

As the testing is completed on this feature, we are marking this as QA:Validated.

Please let us know if anything else is required from our end.
Thanks

@harshitgupta-qasource harshitgupta-qasource added QA:Validated Issue has been validated by QA and removed QA:Needs Validation Issue needs to be validated by QA labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Fleet Fleet team's agent central management project QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants