Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul key size limit hit with many identity entries #5597

Closed
maf23 opened this issue Oct 24, 2018 · 8 comments
Closed

Consul key size limit hit with many identity entries #5597

maf23 opened this issue Oct 24, 2018 · 8 comments
Milestone

Comments

@maf23
Copy link
Contributor

maf23 commented Oct 24, 2018

Describe the bug
The AWS auth method will create a new entity in the identity subsystem for each new host authenticating. These are never reaped and will eventually hit the limits of the storage engine. When this happens vault tend to crash and become unusable.

To Reproduce

  1. Authenticate a new machine via the AWS auth method (we used the ec2 sub method).
  2. Repeat step Initial Website Import #1 approximately 900.000 times (I managed to find 899767 unique id's in my identity database when the system was in the broken state)

Expected behavior
I had expected the identities to be reaped when the associated tokens expired. Or alternatively that it would not create a new identity entity for each new machine authenticating.

Environment:

  • Vault Server Version: 0.10.3
  • Vault CLI Version: 0.10.3
  • Server Operating System/Architecture: Linux
  • Storage backend: consul

Vault server configuration file(s):

The aws auth method configuration

# vault read auth/aws-ec2/role/problem
Key                               Value
---                               -----
allow_instance_migration          false
auth_type                         ec2
bound_account_id                  [REDACTED]
bound_ami_id                      []
bound_ec2_instance_id             <nil>
bound_iam_instance_profile_arn    []
bound_iam_principal_arn           []
bound_iam_principal_id            []
bound_iam_role_arn                []
bound_region                      []
bound_subnet_id                   []
bound_vpc_id                      []
disallow_reauthentication         false
inferred_aws_region               n/a
inferred_entity_type              n/a
max_ttl                           0
period                            0
policies                          [default gen_activities REDACTED]
resolve_aws_unique_ids            false
role_tag                          vault_role
ttl                               604800

A typical identity entry

# vault read identity/entity/id/ff6b7c9b-4253-1798-b74a-34f4bf69b032
Key                    Value
---                    -----
aliases                [map[mount_type:aws-ec2 id:REDACTED last_update_time:2018-08-09T23:23:21.953980984Z merged_from_canonical_ids:<nil> metadata:<nil> mount_accessor:auth_aws-ec2_c9f5b437 mount_path:auth/aws-ec2/ canonical_id:REDACTED creation_time:2018-08-09T23:23:21.953980984Z name:i-deadbeef]]
creation_time          2018-08-09T23:23:21.953976383Z
direct_group_ids       []
disabled               false
group_ids              []
id                     REDACTED
inherited_group_ids    []
last_update_time       2018-08-09T23:23:21.953976383Z
merged_entity_ids      <nil>
metadata               <nil>
name                   entity_38c7a0cf
policies               <nil>

Additional context
It has taken us a while to build up this number of identities but with lots of autoscaling we eventually got there. Each new machine has a unique instance-id which is why they each get a new entry in the identities subsystem. In our case the machines normally live for just a few hours and the token expire after 7 days.

Identifying the root cause of the issue we saw was not easy. There was nothing in the vault server logs or stdout/stderr so our only clue was the error message returned to the client when they tried to authenticate: {“errors”:[“failed to persist packed storage entry: Unexpected response code: 413 (Value exceeds 524288 byte limit)“]}
There was no indication what kind of storage entry was hitting that limit and I had t modify the source to find that.

Also getting the list of ids in the identity database was hard since the server would crash when I asked via the API. I had to modify the source to print the list to the log as it was generated.

Vault seems to spread these entities among 256 buckets which are stored in consul. The error happens when the used storage bucket would get bigger than 512Kb which is the maximum size of a value in consul.

@jefferai jefferai changed the title The AWS auth method is a time-delayed bomb and will eventually crash your valt Consul key size limit hit with many identity entries Oct 24, 2018
@jefferai
Copy link
Member

One of the many benefits of the IAM method is that it has values that can be reused. See https://www.vaultproject.io/api/auth/aws/index.html#configure-identity-integration

@jefferai
Copy link
Member

Putting on 1.0 milestone for discussion but will likely slip.

@vishalnayak
Copy link
Contributor

vishalnayak commented Nov 20, 2018

@maf23

While the underlying problem is that the storage imposes a size limit, there could be ways to delay reaching this limit, while satisfying the needs of the workflows. The primary reason which caused the problem in this case is that each ec2 instance with a new alias ID (instanceId) resulted in a new entity being created in Vault.

What if there were options to configure which property of the instance gets used as the alias ID? For example, if the imageId (curl http://169.254.169.254/latest/dynamic/instance-identity/document) is used as an alias ID, then all the instances that get spun up using a specific image ID will result in a single entity. There could be other properties that could allow reusing them, thereby reducing the number of entities in Vault. Such a configuration currently is available for the IAM type, but it isn't yet for the EC2 type.

@maf23
Copy link
Contributor Author

maf23 commented Nov 21, 2018

If we had an option to add them to a common entry, for example based on ImageId, would vault not just create a new alias entry for each machine. And that would basically have the same problem. Or have I misunderstood how it would work?

If the aliasing is not a problem then being able to group them according to an instance property like ImageID or VpcID would solve the problem for us.

@vishalnayak
Copy link
Contributor

@maf23 If ImageID is used as the alias ID, then there would only be one alias per AMI, regardless of the number of instances that are spun up using that same AMI.

The changes to enable this feature is being done here: #5846

@jefferai jefferai modified the milestones: 1.0, 1.0.1 Dec 3, 2018
@jefferai jefferai modified the milestones: 1.0.1, 1.0.2 Dec 12, 2018
@chrishoffman chrishoffman modified the milestones: 1.0.2, 1.0.3 Jan 7, 2019
@jefferai jefferai modified the milestones: 1.0.3, 1.1 Feb 1, 2019
@briankassouf briankassouf modified the milestones: 1.1, 1.1.1 Mar 1, 2019
@jefferai jefferai modified the milestones: 1.1.1, 1.1.2 Apr 10, 2019
@briankassouf briankassouf modified the milestones: 1.1.2, 1.1.3 Apr 29, 2019
@briankassouf briankassouf modified the milestones: 1.1.3, 1.2 May 21, 2019
@jefferai jefferai modified the milestones: 1.2, 1.3 Jul 2, 2019
@lattwood
Copy link

How can we work around this issue?

@jefferai
Copy link
Member

You can upgrade to Consul 1.5.3 and increase the key value size limit or switch to the integrated Raft storage.

Closing as this is handled now by those solutions, and it's now easier (and by default) to not hit this.

@lattwood
Copy link

@jefferai thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants