Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The List method in the etcd backend does not work as expected #23784

Closed
khodyrevyurii opened this issue Oct 22, 2023 · 9 comments · Fixed by #23872
Closed

The List method in the etcd backend does not work as expected #23784

khodyrevyurii opened this issue Oct 22, 2023 · 9 comments · Fixed by #23872
Labels
reproduced This issue has been reproduced by a Vault engineer storage/etcd

Comments

@khodyrevyurii
Copy link

khodyrevyurii commented Oct 22, 2023

We encountered the error described in the issue 3772
But we got the error after we exceeded 1.4 million active tokens.

Describe the bug
In the current implementation, the List method always performs a Get request to etcd and always gets a key and a value in return.
We looked at the implementation of List methods in other backends and noticed that in those methods, list always gets only the keys.

We noticed that before the problem occurs, vault performs a list query on the path /vault/sys/expire/id/ and fails after a while.
In the logs we see an error:

{"level":"warn","ts":"2023-10-17T22:58:28.612+0500","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-763c3edf-a809-4d6f-8146-a060dc754200/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2510660365 vs. 2147483647)"}
Error: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2510660365 vs. 2147483647)

To Reproduce
Create 1.5 million tokens and restart the vault

for i in {1..1500000}; do VAULT_ADDR="https://127.0.0.1:8200" VAULT_TOKEN="{{token_for_vault}}" VAULT_SKIP_VERIFY="true" vault token create; done

Expected behavior
Vault starts up without errors

Environment:

  • Vault Version: Vault v1.9.3
  • etcd Version: 3.3.19
  • Operating System/Architecture: Ubuntu 18.04

Vault server configuration file(s):

{
  "storage": {
    "etcd": {
      "address": "https://127.0.0.1:2379",
      "tls_ca_file": "${SSL}/ca.crt",
      "tls_cert_file": "${SSL}/root.crt",
      "tls_key_file": "${SSL}/root.key",
      "etcd_api": "v3",
      "request_timeout": "20s"
    }
  },
  "listener": {
    "tcp": {
      "address": "127.0.0.1:8200",
      "tls_disable_client_certs": true,
      "tls_ca_file": "${SSL}/ca.crt",
      "tls_cert_file": "${SSL}/root.crt",
      "tls_key_file": "${SSL}/root.key",
    }
  },
  "default_lease_ttl": "4h",
  "disable_mlock": true,
  "ui": true
}

Additional context

It seems to me and my colleagues that in https://github.com/hashicorp/vault/blob/v1.9.3/physical/etcd/etcd3.go#L238
Should be

...
	resp, err := c.etcd.Get(ctx, prefix, clientv3.WithPrefix(), clientv3.WithKeysOnly())
...
@raskchanky
Copy link
Contributor

Hi @khodyrevyurii

Thanks for reporting this! I've been trying to reproduce the problem on my end but I keep running into a database space exceeded error from etcd. I'm not really an etcd expert, so I'm assuming I've done something wrong with my etcd config. Any chance you can share the etcd config you're using?

@khodyrevyurii
Copy link
Author

Hi @raskchanky, sorry I didn't think of that right away.

For local testing, we used the backend configuration of etcd:

  ETCDCTL_API=3 \
  ETCD_CLIENT_CERT_AUTH=true \
  ETCD_CERT_FILE="$SSL/root.crt" \
  ETCD_KEY_FILE="$SSL/root.key" \
  ETCD_TRUSTED_CA_FILE="$SSL/ca.crt" \
  ETCD_LISTEN_CLIENT_URLS="https://127.0.0.1:2379" \
  ETCD_ADVERTISE_CLIENT_URLS="https://127.0.0.1:2379" \
  etcd \
    --name vault \
    --initial-cluster backup=https://127.0.0.1:2380 \
    --initial-cluster-token vault \
    --initial-advertise-peer-urls https://127.0.0.1:2380 \
    --quota-backend-bytes 21474836480 \
    --data-dir etcd &> ./etcd.log &

Vault configuration

cat > ./vault-config.json <<EOF
{
  "storage": {
    "etcd": {
      "address": "https://127.0.0.1:2379",
      "tls_ca_file": "${SSL}/ca.crt",
      "tls_cert_file": "${SSL}/root.crt",
      "tls_key_file": "${SSL}/root.key",
      "etcd_api": "v3",
      "request_timeout": "20s"
    }
  },
  "listener": {
    "tcp": {
      "address": "127.0.0.1:8200",
      "tls_disable_client_certs": true,
      "tls_ca_file": "${SSL}/ca.crt",
      "tls_cert_file": "${SSL}/root.crt",
      "tls_key_file": "${SSL}/root.key",
    }
  },
  "default_lease_ttl": "4h",
  "disable_mlock": true,
  "ui": true
}
EOF
vault server -config=vault-config.json &> ./vault.log &

@raskchanky
Copy link
Contributor

@khodyrevyurii Thank you! I'll have another go at this. I think the --quota-backend-bytes bit was the piece I was missing.

@khodyrevyurii
Copy link
Author

I think the --quota-backend-bytes bit was the piece I was missing.

Yes, it does. etcd has few parameters that can be controlled. This parameter is responsible for the size of the database in RAM and the size etcd can take up on disk.

I will wait for confirmation of this issue so that I can start using the suggested changes without fear.

Thanks in advance

@raskchanky raskchanky added the reproduced This issue has been reproduced by a Vault engineer label Oct 26, 2023
@raskchanky
Copy link
Contributor

@khodyrevyurii A quick update for you. I believe I've succeeded in reproducing your problem. Here's an excerpt of the logs from my Vault server:

CleanShot 2023-10-26 at 14 31 55

The bad news is the line of code that you linked in your original bug report: https://github.com/hashicorp/vault/blob/v1.9.3/physical/etcd/etcd3.go#L238

Changing that to add clientv3.WithKeysOnly(), as you suggest, does not solve the problem. The same error still occurs. I will tinker with this a bit more and see if I can figure something else out. Thanks for the patience!

@raskchanky
Copy link
Contributor

@khodyrevyurii FWIW though, the list call itself took 21.4s on my MBP, so with a request timeout of 20s in the config, it was just barely too short.

@raskchanky
Copy link
Contributor

At the very least, I don't think the suggestion you've made is incorrect, as List() should only operate on keys anyway. I can open a PR for the change.

@khodyrevyurii
Copy link
Author

the list call itself took 21.4s on my MBP, so with a request timeout of 20s in the config, it was just barely too short.

Hi, @raskchanky.

Sorry, I forgot to add the clarification that when a context deadline error occurs, we usually increase the parameter
"request_timeout": "60s" or more if this value is insufficient.

In our case, the problem occurs because vault tries to read all keys and values at once and exceeds the grpc stack limit, resulting in an error: grpc: trying to send message larger than max (2510660365 vs. 2147483647)

This is why we started looking into the problem, as it seemed to us that when calling the List() method, the response generated by ETCD should not exceed 2 GB.

@raskchanky
Copy link
Contributor

@khodyrevyurii Got it, thanks for clarifying. My PR should merge today, so hopefully that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reproduced This issue has been reproduced by a Vault engineer storage/etcd
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants