Memory ballooning issues with standby instance in v0.9.x #3798

TimJones · 2018-01-16T08:47:17Z

Environment:

Vault Version: Vault v0.9.1 ('87b6919dea55da61d7cd444b2442cabb8ede8ab1')
Operating System/Architecture: Kubernetes Deployment w/ official Docker image vault:0.9.1

Vault Config File:

storage "etcd" {
  etcd_api = "v3"
  address = "http://vault-backend-client:2379"
  ha_enabled = "true"
  redirect_addr = "http://vault.devops-tools.svc.cluster.local"
}
listener "tcp" {
  address = "0.0.0.0:8200"
  tls_cert_file = "/vault/tls/tls.crt"
  tls_key_file = "/vault/tls/tls.key"
}
telemetry {
  statsd_address = "localhost:9125"
  disable_hostname = true
}

Startup Log Output:
Standby instance:

==> Vault server configuration:

                     Cgo: disabled
         Cluster Address: https://100.96.12.3:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", tls: "enabled")
               Log Level: 
                   Mlock: supported: true, enabled: true
        Redirect Address: http://vault.devops-tools.svc.cluster.local
                 Storage: etcd (HA available)
                 Version: Vault v0.9.1
             Version Sha: 87b6919dea55da61d7cd444b2442cabb8ede8ab1

==> Vault server started! Log data will stream in below:

2018/01/16 08:13:18.695251 [INFO ] core: vault is unsealed
2018/01/16 08:13:18.695750 [INFO ] core: entering standby mode

Active instance:

==> Vault server configuration:

                     Cgo: disabled
         Cluster Address: https://100.96.11.4:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", tls: "enabled")
               Log Level: 
                   Mlock: supported: true, enabled: true
        Redirect Address: http://vault.devops-tools.svc.cluster.local
                 Storage: etcd (HA available)
                 Version: Vault v0.9.1
             Version Sha: 87b6919dea55da61d7cd444b2442cabb8ede8ab1

==> Vault server started! Log data will stream in below:

2018/01/15 16:07:12.460791 [INFO ] core: vault is unsealed
2018/01/15 16:07:12.460956 [INFO ] core: entering standby mode
2018/01/15 16:07:15.269155 [INFO ] core: acquired lock, enabling active operation
2018/01/15 16:07:15.396505 [INFO ] core: post-unseal setup starting
2018/01/15 16:07:15.399116 [INFO ] core: loaded wrapping token key
2018/01/15 16:07:15.399366 [INFO ] core: successfully setup plugin catalog: plugin-directory=
2018/01/15 16:07:15.405728 [INFO ] core: successfully mounted backend: type=kv path=secret/
2018/01/15 16:07:15.406413 [INFO ] core: successfully mounted backend: type=system path=sys/
2018/01/15 16:07:15.406453 [INFO ] core: successfully mounted backend: type=generic path=concourse/
2018/01/15 16:07:15.406707 [INFO ] core: successfully mounted backend: type=pki path=spotahome-ca/
2018/01/15 16:07:15.407726 [INFO ] core: successfully mounted backend: type=identity path=identity/
2018/01/15 16:07:15.407864 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018/01/15 16:07:15.424534 [INFO ] expiration: restoring leases
2018/01/15 16:07:15.424767 [INFO ] rollback: starting rollback manager
2018/01/15 16:07:15.442339 [INFO ] identity: entities restored
2018/01/15 16:07:15.454374 [INFO ] identity: groups restored
2018/01/15 16:07:15.459410 [INFO ] core: post-unseal setup complete
2018/01/15 16:07:15.459775 [INFO ] core/startClusterListener: starting listener: listener_address=0.0.0.0:8201
2018/01/15 16:07:15.459995 [INFO ] core/startClusterListener: serving cluster requests: cluster_listen_address=[::]:8201
2018/01/15 16:07:15.461689 [INFO ] expiration: lease restore complete
2018/01/15 17:22:10 http2: server: error reading preface from client 100.96.21.84:35748: read tcp4 100.96.11.4:8200->100.96.21.84:35748: read: connection reset by peer
2018/01/15 18:01:09 http2: server: error reading preface from client 100.96.21.84:52344: read tcp4 100.96.11.4:8200->100.96.21.84:52344: read: connection reset by peer

Expected Behavior:
That an instance can run in standby mode without consuming more and more memory.

Actual Behavior:
Standby instances gradually consume more memory until they reach a preset limit and fail with OOM.

You can see that when the instance gets restarted due to memory consumption, it returns to a base line level and doesn't increase. I believe this correlates with the standby instance forwarding requests in standby mode, and refusing connections in sealed mode.

Memory use of the active instance over the same time period:

Steps to Reproduce:
Deploy a HA Vault cluster in Kubernetes.

Important Factoids:
Running on a Kubernetes cluster. Each instance has a imposed 200MB memory limit, more than enough for the active instance to work.
Also observed (but not recorded) with v0.9.0.

The text was updated successfully, but these errors were encountered:

TimJones · 2018-01-16T14:16:58Z

Just to add, the below shows quite definitively that the memory ballooning only begins when the sealed instance is unsealed & begins forwarding requests to the active instance.

jefferai · 2018-01-16T14:26:46Z

What is the "still" part of the title here? This issue isn't ringing any bells.

TimJones · 2018-01-16T14:35:27Z

From v0.9.1 CHANGELOG

core: Fix memory ballooning when a connection would connect to the cluster port and then go away -- redux! [GH-3680]

I don't know if my client connections are classified as 'going away', but the memory ballooning seems only to happen when the instance is in standby, and forwarding client requests to the active instance.

jefferai · 2018-01-16T14:46:32Z

That's a separate issue where a connection would be made, fail to authenticate, and then drop, due to status checks being run against the port.

TimJones · 2018-01-16T14:48:10Z

Understood! Should I close & re-open, or just fix the title?

jefferai · 2018-01-16T15:09:12Z

Fixing the title is fine!

briankassouf · 2018-01-16T21:28:37Z

After looking into this a little I believe it's an issue with the etcd v3 storage backend. @xiang90, wondering if you have any ideas here.

Every forwarded request first calls core.Leader() to check if it's the leader. The leader function calls the LockWith() function, which in etcd's case creates a new conncurrency.Session. These sessions are not being closed.

xiang90 · 2018-01-16T22:10:24Z

@briankassouf

is that true that every forwarded request needs a lock? i though standby will only hold a long lived lock. /cc @jefferai

briankassouf · 2018-01-16T22:14:56Z

@xiang90 It uses the Value() function of the lock to get the info about the leader.

vault/vault/core.go

Lines 891 to 903 in f320f00

    
           lock, err := c.ha.LockWith(coreLockPath, "read") 
        
           if err != nil { 
        
           	return false, "", "", err 
        
           } 
        
           // Read the value 
        
           held, leaderUUID, err := lock.Value() 
        
           if err != nil { 
        
           	return false, "", "", err 
        
           } 
        
           if !held { 
        
           	return false, "", "", nil 
        
           }

The Value() function doesn't use the concurrency.Session object, so maybe that can be created only if we are calling the Lock/Unlock functions?

xiang90 · 2018-01-16T22:17:27Z

@briankassouf

ok. i see. i think you are right. would you like to get it fixed by lazily creating the session?

jefferai · 2018-01-16T23:01:54Z

@xiang90 Yes please :-)

TimJones changed the title ~~Still have memory ballooning issues with 0.9.1~~ Memory ballooning issues with 0.9.x Jan 16, 2018

TimJones changed the title ~~Memory ballooning issues with 0.9.x~~ Memory ballooning issues with standby instance in v0.9.x Jan 16, 2018

jefferai modified the milestones: 0.9.2, 0.9.3 Jan 16, 2018

jefferai modified the milestones: 0.9.3, 0.9.4 Jan 28, 2018

xiang90 mentioned this issue Feb 1, 2018

etcd3: only create lock when lock is called #3893

Merged

jefferai closed this as completed in #3893 Feb 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory ballooning issues with standby instance in v0.9.x #3798

Memory ballooning issues with standby instance in v0.9.x #3798

TimJones commented Jan 16, 2018 •

edited

Loading

TimJones commented Jan 16, 2018

jefferai commented Jan 16, 2018

TimJones commented Jan 16, 2018

jefferai commented Jan 16, 2018

TimJones commented Jan 16, 2018

jefferai commented Jan 16, 2018

briankassouf commented Jan 16, 2018 •

edited

Loading

xiang90 commented Jan 16, 2018

briankassouf commented Jan 16, 2018

xiang90 commented Jan 16, 2018 •

edited

Loading

jefferai commented Jan 16, 2018

Memory ballooning issues with standby instance in v0.9.x #3798

Memory ballooning issues with standby instance in v0.9.x #3798

Comments

TimJones commented Jan 16, 2018 • edited Loading

TimJones commented Jan 16, 2018

jefferai commented Jan 16, 2018

TimJones commented Jan 16, 2018

jefferai commented Jan 16, 2018

TimJones commented Jan 16, 2018

jefferai commented Jan 16, 2018

briankassouf commented Jan 16, 2018 • edited Loading

xiang90 commented Jan 16, 2018

briankassouf commented Jan 16, 2018

xiang90 commented Jan 16, 2018 • edited Loading

jefferai commented Jan 16, 2018

TimJones commented Jan 16, 2018 •

edited

Loading

briankassouf commented Jan 16, 2018 •

edited

Loading

xiang90 commented Jan 16, 2018 •

edited

Loading