Audit socket backend does not reconnect on error #2931

CVTJNII · 2017-06-28T21:54:02Z

Storage: consul (HA available)
Version: Vault v0.7.3
Version Sha: 0b20ae0

In experimenting with the socket audit backend I've noticed it isn't reconnecting on error. I caused a failover to a node where the listener was down, and Vault logged the following error, which is expected:

2017/06/28 20:23:31.306148 [ERROR] core: failed to create audit entry: path=socket/ error=dial tcp 172.17.0.1:6500: getsockopt: connection refused

However, after this the socket audit backed appears to just die. After starting the listener I see no logs, and no further socket errors are logged. Attempting to reconfigure the backend with a HTTP PUT to /v1/sys/audit/socket has no effect, I'm unable to get it to resume audit logging.

This is undesirable behavior as the backend should resume logging when the listener comes back up, it should not fail indefinitely.

This is my complete backend config:

{
  "type": "socket",
  "options": {
    "address": "172.17.0.1:6500",
    "socket_type": "tcp",
    "log_raw": "false",
    "hmac_accessor": "false",
    "format": "json"
  }
}

The text was updated successfully, but these errors were encountered:

CVTJNII · 2017-06-28T21:58:14Z

UDP sockets do not exhibit this behavior, which is expected as they're connectionless.

jefferai · 2017-06-28T22:12:29Z

The backend already does try to reconnect; take a look at https://github.com/hashicorp/vault/blob/master/builtin/audit/socket/backend.go#L146

However, you may be hitting an error at https://github.com/hashicorp/vault/blob/master/builtin/audit/socket/backend.go#L75 -- in that case the backend will fail to come up. If you have a different backend configured that comes up successfully Vault will still use that backend; if you have only the one, Vault will actually bail from taking over active duty.

This is in line with Vault working as long as any audit backend can write, and it's unclear what to do otherwise -- if an audit backend fails to come up, trying over and over again is often not the right approach and can simply make any underlying problems worse.

I believe this is the error you're seeing since you said it happened after you failed over Vault, in which case you have two options: add a second audit backend for redundancy in which case that will hopefully come up and still log, or fail over again, which should happen automatically if it's the only audit backend and the problem is on backend setup.

CVTJNII · 2017-06-28T22:21:08Z

I do also have the file backend configured, which was working. However, I wish to have the file and socket backends up: file for reliable but difficult to access logs, and socket for less than reliable but easy to access logs. (I say less than reliable by the nature of networks and depending on external services.)

So the socket backend must be up when the initial connection is made or it is simply not used? Do I understand that correctly? I consider that bad behavior and would appreciate an option to always have it retry, even on initial errors. In my opinion this is a race, if Vault happens to failover at the same time the listener is down for some reason then Vault will never log to the socket without manual intervention, which is undesirable behavior in my environment.

EDIT: I did confirm it will reconnect if the initial connection is successful. As mentioned above I'd like to see an option to have it try and reconnect if the initial connection fails, in case that's something transient. Having a retry on an interval is much better in my opinion than being down indefinitely.

jefferai · 2017-06-28T22:49:58Z

So the socket backend must be up when the initial connection is made or it is simply not used? Do I understand that correctly?

Yes. Any backend must work when being set up or it causes Vault post-unseal to fail. Audit backends are a special case as of a few versions ago though, when we changed it to any one must come up. The arguments in favor were persuasive, and it matches the model of any one backend must successfully log.

I'm hesitant to have a system whereby something causes the backend to try, try again for some period of time or number of tries.

Possibly the right thing to do is actually remove the initial connection when the backend is created. The first request would always error out but the error would be swallowed by a reconnect and retry, if successful.

network failures are worked around. Also, during a reconnect always close the existing connection. Fixes #2931

…ent (#2934) network failures are worked around. Also, during a reconnect always close the existing connection. Fixes #2931

jefferai added a commit that referenced this issue Jun 28, 2017

Don't dial on backend startup; retry dials at log time so that transient

a666234

network failures are worked around. Also, during a reconnect always close the existing connection. Fixes #2931

jefferai mentioned this issue Jun 28, 2017

Don't dial on socket audit backend startup #2934

Merged

jefferai closed this as completed in #2934 Jul 6, 2017

jefferai added a commit that referenced this issue Jul 6, 2017

Don't dial on backend startup; retry dials at log time so that transi…

4efff56

…ent (#2934) network failures are worked around. Also, during a reconnect always close the existing connection. Fixes #2931

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit socket backend does not reconnect on error #2931

Audit socket backend does not reconnect on error #2931

CVTJNII commented Jun 28, 2017

CVTJNII commented Jun 28, 2017

jefferai commented Jun 28, 2017

CVTJNII commented Jun 28, 2017 •

edited

Loading

jefferai commented Jun 28, 2017

Audit socket backend does not reconnect on error #2931

Audit socket backend does not reconnect on error #2931

Comments

CVTJNII commented Jun 28, 2017

CVTJNII commented Jun 28, 2017

jefferai commented Jun 28, 2017

CVTJNII commented Jun 28, 2017 • edited Loading

jefferai commented Jun 28, 2017

CVTJNII commented Jun 28, 2017 •

edited

Loading