-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Add metadata to vault client requests #2475
Comments
@stevenscg Hey, so Nomad already adds metadata to the token which is shown in the audit log. I just ran a simple test and grabbed this from the audit log:
You can see the client unwrap and renew and the metadata is shown. So you should be able to search the audit log via the client_token to figure out which allocation/node/task. Let me know if that makes sense! Going to close since we already do this! |
Hey Alex. That use case looks great! Exactly what I'd want to see.
However, my case was upon failure of a vault request. Basically, when we want the info the most.
I think that because the request didn't succeed, we lose all of our visibility into the metadata.
- Chris
… On Mar 26, 2017, at 5:04 PM, Alex Dadgar ***@***.***> wrote:
Closed #2475.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@dadgar Thanks again. Do you know if metadata is logged by vault on a failed request? This data is exactly what I'd want for debugging: "metadata": {
"AllocationID": "c93adf67-d8fe-40c7-433f-2b78de6b8b93",
"NodeID": "564daec4-315a-5f3c-864a-fc195db17cdf",
"Task": "redis"
} |
@stevenscg So on a failure nothing is logged to avoid a DDOS attack against Vault. What you can do is add a new audit back end: https://www.vaultproject.io/docs/audit/file.html and then set log_raw=true and hmac_accessor=false while you debug. I would then find the token that is failing and seeing what its capabilities are. The ability to renew self is in the default capabilities so I wonder if during testing you may have changed that. After your done debugging/revoking the token I would disable the audit backend with those loosened security settings. |
Thanks @dadgar! Makes sense that Vault is doing that minimize payload size and DDOS potential. I fired up the file audit backend earlier today and have been watching for the issue. This is a good workaround for my test environment, but I think I'd have a hard time feeling comfortable doing it in production. |
@dadgar and team: I can move this to the mailing list if it's better, but I was able to use the file audit log to follow the token and we have the context from the discussion above. I'm not sure if there's a real issue here or not... TL;DR: Is it expected for Nomad try to I believe all the tasks are running correctly because they complete before the token renews. This same task can run up to 1-2 hours in certain conditions, which might create a problem. These are dispatched jobs, so it could be some case that's different from the redis example, etc. ==== 2017-03-27T15:30:00Z
2017-03-27T15:30:00Z
2017-03-27T15:30:01Z
2017-03-27T20:45:01Z
2017-03-27T21:31:17Z
Separately, I see from the nomad logs that the task completed successfully at 15:30:01 and GCd around 19:45:
|
@stevenscg As for the questions under 2017-03-27T15:30:00Z. The 30 second increment doesn't really matter since periodic tokens are extended to the period duration. Some questions for you:
|
This does not appear to be a one-off thing, but I think it is not happening with all invocations of the same dispatch job. I may have to stop the automatic execution of this job, dispatch one or more manually, and wait out the renewals. There was another occurrence of it overnight (alloc completed Mar 28 09:10:06, renew-self failed Mar 28 09:56:45).
Using the overnight occurrence for alloc 5487d09f-1bd9-1868-e3cb-1f67c5c6710c, the nomad worker logs look like this:
Nothing unique or special about these allocations that I am aware of. There are only 2 jobs running on the cluster:
Both jobs are using the raw_exec driver. QQ: Is there anything unusual about the "Not restarting" entries in the logs on question 2? My plan for today is to:
|
Update - I was able to observe the issue in the same environment while running a single dispatch of the task in isolation on the cluster. Mar 28 13:47:54 - Dispatch request over http The time it took to call renew-self (~25-30 minutes) looks correct. the nomad-cluster role period is currently set to 1800. I'm going to leave this running as is for now and see if more renewals are attempted.
|
@stevenscg Thanks so much for reproducing in an isolated way. I am going to play with this now and see if I can get to the bottom of it! |
@dadgar No problem at all. I'm super curious myself. |
This PR fixes an oversight in which the client would attempt to renew a token even after the task exits. Fixes #2475
@stevenscg Well don't look too closely at the fix haha it is rather sad. |
@dadgar 🙌 I won't. Really thought I was losing it for a while there. /me fires up master later today. ;-) |
@stevenscg We are gonna cut 0.5.6 RC1 today so you can save yourself some time 👍 |
Oh man, that's fantastic news. hehe |
Short-lived containers (especially those < 1 second) often do not have thier logs sent to Nomad. This PR adjusts the nomad docker driver and docker logging driver to: 1. Enable docklog to run after a container has stopped (to some grace period limit) 2. Collect logs from stopped containers up until the grace period This fixes the current issues: 1. docklog is killed by the handlea as soon as the task finishes, which means fast running containers can never have their logs scraped 2. docklog quits streaming logs in its event loop if the container has stopped In order to do this, we need to know _whether_ we have read logs for the current container in order to apply a grace period. We add a copier to the fifo streams which sets an atomic flag, letting us know whether we need to retry reading the logs and use a grace period or if we can quit early. Fixes hashicorp#2475, hashicorp#6931. Always wait to read from logs before exiting Store number of bytes read vs a simple counter
Short-lived containers (especially those < 1 second) often do not have thier logs sent to Nomad. This PR adjusts the nomad docker driver and docker logging driver to: 1. Enable docklog to run after a container has stopped (to some grace period limit) 2. Collect logs from stopped containers up until the grace period This fixes the current issues: 1. docklog is killed by the handlea as soon as the task finishes, which means fast running containers can never have their logs scraped 2. docklog quits streaming logs in its event loop if the container has stopped In order to do this, we need to know _whether_ we have read logs for the current container in order to apply a grace period. We add a copier to the fifo streams which sets an atomic flag, letting us know whether we need to retry reading the logs and use a grace period or if we can quit early. Fixes hashicorp#2475, hashicorp#6931. Always wait to read from logs before exiting Store number of bytes read vs a simple counter
Short-lived containers (especially those < 1 second) often do not have thier logs sent to Nomad. This PR adjusts the nomad docker driver and docker logging driver to: 1. Enable docklog to run after a container has stopped (to some grace period limit) 2. Collect logs from stopped containers up until the grace period This fixes the current issues: 1. docklog is killed by the handlea as soon as the task finishes, which means fast running containers can never have their logs scraped 2. docklog quits streaming logs in its event loop if the container has stopped In order to do this, we need to know _whether_ we have read logs for the current container in order to apply a grace period. We add a copier to the fifo streams which sets an atomic flag, letting us know whether we need to retry reading the logs and use a grace period or if we can quit early. Fixes hashicorp#2475, hashicorp#6931. Always wait to read from logs before exiting Store number of bytes read vs a simple counter
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I ran into a situation where a nomad worker node was failing on a vault request to
auth/token/renew-self
per the logs on the worker. The failed request is not the issue here.Based on the sample request/response shown below from the vault server logs, I can't tell if this issue is originating from a particular job or allocation or the nomad agent itself.
I think that the addition of some nomad-specific metadata to the vault requests could make debugging issues like this much easier. I haven't found any workarounds or similar feature requests.
JobID
,AllocID
, andNodeID
all seem useful in this scenario, but I think other users with more experience may have better suggestions.If the "fixed" metadata concept mentioned above is not agreeable, a more flexible alternative may be to allow the user/operator to add the metadata to the vault request dynamically via either a) the job specification or b) the nomad configuration file using the existing nomad variable interpolation.
Example dynamic vault metadata with interpolation in a nomad job:
One possible hurdle with this proposal is that Vault's metadata seems to be applied to (and available via) generated tokens and not at the request level where we would like to associate the data. Vault PR hashicorp/vault#2321 added a
headers
object that could alternatively serve a transport for the values proposed here.Sample vault request/response:
The text was updated successfully, but these errors were encountered: