Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AgentOutputs migration runs indefinitely under high scale and breaks Fleet Server #1958

Closed
joshdover opened this issue Oct 6, 2022 · 2 comments
Assignees
Labels
blocker bug Something isn't working v8.5.0

Comments

@joshdover
Copy link
Contributor

joshdover commented Oct 6, 2022

  • Version: 8.5.0+
  • Steps to Reproduce:
    1. Enroll a lot agents agents
    2. Restart the Fleet Server
    3. Observe constant "migration AgentOutput done" logs
    4. Fleet Server hangs on this and won't be able to fully startup and accept requests from Agents

image

The migration will run in a loop when there are version conflicts on agents it tried to update, which can happen constantly at scale when there are frequent checkins from the online agents.

This usually will eventually resolve itself as the migration code races the Agent checkins, however in this case it's never completeing because this query for the migration is returning all agents, even the ones that are already on the new schema:

query.Query().Bool().MustNot().Exists(fieldOutputs)

The reason this returns all agents is because the mappings for this field is this:

        "outputs": {
          "properties": {
            "api_key": {
              "type": "keyword"
            },
            "api_key_id": {
              "type": "keyword"
            },
            "policy_permissions_hash": {
              "type": "keyword"
            },
            "to_retire_api_key_ids": {
              "type": "object",
              "enabled": false
            },
            "type": {
              "type": "keyword"
            }
          }
        },

But the migrated agent documents look like this:

          "outputs": {
            "default": {
              "api_key": "",
              "permissions_hash": "",
              "type": "elasticsearch",
              "to_retire_api_key_ids": [
                {
                  "retired_at": "2022-10-06T11:04:57Z",
                  "id": ""
                }
              ],
              "api_key_id": ""
            }
          },

Notice the extra default key between outputs > api_key. When doing a "must not exists" query in ES, ES finds no indexed fields under outputs because default is not indexed. This results in all agents being returned every time.

I see a few paths forward to solve this:

  1. Change the query to instead look for agents that have the old fields present rather than missing the new fields. I think this would be a good idea anyways to do in case we ever have a case where there is no output but there is also no old field either.
  2. Fix the mappings by using a dynamic_template to fit the schema of the documents correctly
    • This would require less changes than the next option, but I don't like it because it results in more mapped fields than necessary
  3. Change the shape of the documents to an array and use a nested field mapping
    • This requires more changes to Fleet Server and we're getting close to the 8.5 release.
    • Results in fewer mapped fields, better long-term option
    • Migrating to this later will be more overall work and risk IMO

In summary, I think it would be best to do 1 and 3.

@joshdover joshdover added bug Something isn't working blocker v8.5.0 labels Oct 6, 2022
@joshdover
Copy link
Contributor Author

The plan right now is to do (1) + remove the invalid mappings from ES for the 8.5.0 release. This is the least risky option.

@pjbertels
Copy link

Tested in 8.5 SNAPSHOT c52257ee looks good. Verified migration related debug logs(~70logs) are only at the beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Something isn't working v8.5.0
Projects
None yet
Development

No branches or pull requests

3 participants