Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add metadata to vault client requests #2475

Closed
stevenscg opened this issue Mar 24, 2017 · 16 comments · Fixed by #2495
Closed

Feature request: Add metadata to vault client requests #2475

stevenscg opened this issue Mar 24, 2017 · 16 comments · Fixed by #2495

Comments

@stevenscg
Copy link

I ran into a situation where a nomad worker node was failing on a vault request to auth/token/renew-self per the logs on the worker. The failed request is not the issue here.

Based on the sample request/response shown below from the vault server logs, I can't tell if this issue is originating from a particular job or allocation or the nomad agent itself.

I think that the addition of some nomad-specific metadata to the vault requests could make debugging issues like this much easier. I haven't found any workarounds or similar feature requests.

JobID, AllocID, and NodeID all seem useful in this scenario, but I think other users with more experience may have better suggestions.

If the "fixed" metadata concept mentioned above is not agreeable, a more flexible alternative may be to allow the user/operator to add the metadata to the vault request dynamically via either a) the job specification or b) the nomad configuration file using the existing nomad variable interpolation.

Example dynamic vault metadata with interpolation in a nomad job:

vault {
  metadata {
    alloc_id = "${NOMAD_ALLOC_ID}"
  }
}

One possible hurdle with this proposal is that Vault's metadata seems to be applied to (and available via) generated tokens and not at the request level where we would like to associate the data. Vault PR hashicorp/vault#2321 added a headers object that could alternatively serve a transport for the values proposed here.

Sample vault request/response:

Mar 22 20:51:53 ip-10-101-26-78 vault[5303]: {
    "time": "2017-03-22T20:51:53Z",
    "type": "request",
    "auth": {
        "client_token": "",
        "accessor": "",
        "display_name": "",
        "policies": null,
        "metadata": null
    },
    "request": {
        "id": "c297b154-5e11-59d1-62b2-ace7cbdddb05",
        "operation": "update",
        "client_token": "hmac-sha256:a3043162475304ea5e19b165fe245d5dccca3898e872408ca91baf8b6e558da2",
        "client_token_accessor": "",
        "path": "auth/token/renew-self",
        "data": {
            "increment": "hmac-sha256:43a3ec91d41913303b18a0c0aad2c14610233d892172f1fb2f0ed792e4483418"
        },
        "remote_address": "{NOMAD-WORKER-IP}",
        "wrap_ttl": 0,
        "headers": {}
    },
    "error": "permission denied"
}

Mar 22 20:51:53 ip-10-101-26-78 vault[5303]: {
    "time": "2017-03-22T20:51:53Z",
    "type": "response",
    "auth": {
        "client_token": "",
        "accessor": "",
        "display_name": "",
        "policies": null,
        "metadata": null
    },
    "request": {
        "id": "c297b154-5e11-59d1-62b2-ace7cbdddb05",
        "operation": "update",
        "client_token": "hmac-sha256:a3043162475304ea5e19b165fe245d5dccca3898e872408ca91baf8b6e558da2",
        "client_token_accessor": "",
        "path": "auth/token/renew-self",
        "data": {
            "increment": "hmac-sha256:43a3ec91d41913303b18a0c0aad2c14610233d892172f1fb2f0ed792e4483418"
        },
        "remote_address": "{NOMAD-WORKER-IP}",
        "wrap_ttl": 0,
        "headers": {}
    },
    "response": {
        "data": {
            "error": "hmac-sha256:84fc1708e17a781463a448791459e8313dd75295791c08b0f34dfaf07d9490b1"
        }
    },
    "error": "1 error occurred:\n\n* permission denied"
}
$ nomad version
Nomad v0.5.5

$ vault version
Vault v0.7.0 ('614deacfca3f3b7162bbf30a36d6fc7362cd47f0')
@dadgar
Copy link
Contributor

dadgar commented Mar 26, 2017

@stevenscg Hey, so Nomad already adds metadata to the token which is shown in the audit log. I just ran a simple test and grabbed this from the audit log:

{
  "time": "2017-03-26T21:49:29Z",
  "type": "response",
  "auth": {
    "client_token": "",
    "accessor": "",
    "display_name": "",
    "policies": [
      "response-wrapping"
    ],
    "metadata": null
  },
  "request": {
    "id": "b9645421-19a5-1095-0fb7-bd6388b2f99d",
    "operation": "update",
    "client_token": "hmac-sha256:73538f2db50e5dcf6add57aac21669744bf040b60facee79d6b2c410cddca3ed",
    "client_token_accessor": "hmac-sha256:a920abf9f18c97bfa889b78fe36f8a789d6f8ed27e2cf44c42207df45c0d4b2b",
    "path": "sys/wrapping/unwrap",
    "data": null,
    "remote_address": "127.0.0.1",
    "wrap_ttl": 0,
    "headers": {}
  },
  "response": {
    "auth": {
      "client_token": "hmac-sha256:de54c35bd45be5e7d58b3af28fb669b6ea397bf913cc5996e7f0f6e877571ab9",
      "accessor": "hmac-sha256:6a83b30ec43f6ee9373eba75b730fb9bb89bbc24b7c8dc66a160ec9972783b99",
      "display_name": "",
      "policies": [
        "default"
      ],
      "metadata": {
        "AllocationID": "c93adf67-d8fe-40c7-433f-2b78de6b8b93",
        "NodeID": "564daec4-315a-5f3c-864a-fc195db17cdf",
        "Task": "redis"
      }
    }
  },
  "error": ""
}
{
  "time": "2017-03-26T21:49:29Z",
  "type": "request",
  "auth": {
    "client_token": "",
    "accessor": "",
    "display_name": "token-c93adf67-d8fe-40c7-433f-2b78de6b8b93-redis",
    "policies": [
      "default"
    ],
    "metadata": {
      "AllocationID": "c93adf67-d8fe-40c7-433f-2b78de6b8b93",
      "NodeID": "564daec4-315a-5f3c-864a-fc195db17cdf",
      "Task": "redis"
    }
  },
  "request": {
    "id": "2fa48443-82b3-864d-70ce-028236ecf81f",
    "operation": "update",
    "client_token": "hmac-sha256:de54c35bd45be5e7d58b3af28fb669b6ea397bf913cc5996e7f0f6e877571ab9",
    "client_token_accessor": "hmac-sha256:6a83b30ec43f6ee9373eba75b730fb9bb89bbc24b7c8dc66a160ec9972783b99",
    "path": "auth/token/renew-self",
    "data": {
      "increment": "hmac-sha256:89e51347e74f08728c47fa1efc425c5d1e3698abe8dac6fd3b819b9e5ddb0604"
    },
    "remote_address": "127.0.0.1",
    "wrap_ttl": 0,
    "headers": {}
  },
  "error": ""
}

You can see the client unwrap and renew and the metadata is shown. So you should be able to search the audit log via the client_token to figure out which allocation/node/task.

Let me know if that makes sense! Going to close since we already do this!

@dadgar dadgar closed this as completed Mar 26, 2017
@stevenscg
Copy link
Author

stevenscg commented Mar 27, 2017 via email

@stevenscg
Copy link
Author

@dadgar Thanks again. Do you know if metadata is logged by vault on a failed request?

This data is exactly what I'd want for debugging:

      "metadata": {
        "AllocationID": "c93adf67-d8fe-40c7-433f-2b78de6b8b93",
        "NodeID": "564daec4-315a-5f3c-864a-fc195db17cdf",
        "Task": "redis"
      }

@dadgar
Copy link
Contributor

dadgar commented Mar 27, 2017

@stevenscg So on a failure nothing is logged to avoid a DDOS attack against Vault. What you can do is add a new audit back end: https://www.vaultproject.io/docs/audit/file.html and then set log_raw=true and hmac_accessor=false while you debug.

I would then find the token that is failing and seeing what its capabilities are. The ability to renew self is in the default capabilities so I wonder if during testing you may have changed that. After your done debugging/revoking the token I would disable the audit backend with those loosened security settings.

@stevenscg
Copy link
Author

Thanks @dadgar! Makes sense that Vault is doing that minimize payload size and DDOS potential.

I fired up the file audit backend earlier today and have been watching for the issue. This is a good workaround for my test environment, but I think I'd have a hard time feeling comfortable doing it in production.

@stevenscg
Copy link
Author

@dadgar and team:

I can move this to the mailing list if it's better, but I was able to use the file audit log to follow the token and we have the context from the discussion above. I'm not sure if there's a real issue here or not...

TL;DR: Is it expected for Nomad try to renew-self on a token at 20:45 for a task that ended around 15:30 and GCd around 19:45?

I believe all the tasks are running correctly because they complete before the token renews.

This same task can run up to 1-2 hours in certain conditions, which might create a problem.

These are dispatched jobs, so it could be some case that's different from the redis example, etc.

====

2017-03-27T15:30:00Z

  • Good unwrap from nomad-cluster role by the leader nomad server
  • Expected policies are present, including "default"

2017-03-27T15:30:00Z

  • Good "update" to auth/token/renew-self
  • This is sending data.increment = 30 (seconds)
  • Is that increment long enough?
  • What if a task will run longer?
  • How is it determined or set?
  • There are no more renew-self calls for 5-6 hours for this allocation
  • I have nomad-cluster token role ttl lowered right now to 43200 (6 hours) for testing

2017-03-27T15:30:01Z

  • Activity on the transit backend, etc. no issues.
  • Task completes successfully (per nomad logs below)

2017-03-27T20:45:01Z

  • Failed "update" to auth/token/renew-self
  • "error":"permission denied"
  • client_token_accessor = "" (is that expected?)

2017-03-27T21:31:17Z

  • I try to lookup the token with a root token and get the following:
  • "error":"bad token"

Separately, I see from the nomad logs that the task completed successfully at 15:30:01 and GCd around 19:45:

Mar 27 15:30:01 ip-10-101-27-196 nomad: 2017/03/27 15:30:01.947844 [INFO] client: task "scan" for alloc "c062a967-3cef-c10a-7b6a-9e0318c84fd5" completed successfully
Mar 27 15:30:01 ip-10-101-27-196 nomad[1139]: client: task "scan" for alloc "c062a967-3cef-c10a-7b6a-9e0318c84fd5" completed successfully
Mar 27 15:30:01 ip-10-101-27-196 nomad[1139]: client: Not restarting task: scan for alloc: c062a967-3cef-c10a-7b6a-9e0318c84fd5
Mar 27 15:30:01 ip-10-101-27-196 nomad[1139]: client: marking allocation c062a967-3cef-c10a-7b6a-9e0318c84fd5 for GC
Mar 27 15:30:01 ip-10-101-27-196 nomad: 2017/03/27 15:30:01.947888 [INFO] client: Not restarting task: scan for alloc: c062a967-3cef-c10a-7b6a-9e0318c84fd5
Mar 27 15:30:01 ip-10-101-27-196 nomad: 2017/03/27 15:30:01.948128 [INFO] client: marking allocation c062a967-3cef-c10a-7b6a-9e0318c84fd5 for GC
....
Mar 27 19:45:11 ip-10-101-27-196 nomad[1139]: client: marking allocation c062a967-3cef-c10a-7b6a-9e0318c84fd5 for GC
Mar 27 19:45:11 ip-10-101-27-196 nomad: 2017/03/27 19:45:11.401234 [INFO] client: marking allocation c062a967-3cef-c10a-7b6a-9e0318c84fd5 for GC
Mar 27 19:45:11 ip-10-101-27-196 nomad[1139]: client: garbage collecting allocation "c062a967-3cef-c10a-7b6a-9e0318c84fd5"
Mar 27 19:45:11 ip-10-101-27-196 nomad: 2017/03/27 19:45:11.404510 [INFO] client: garbage collecting allocation "c062a967-3cef-c10a-7b6a-9e0318c84fd5"

@dadgar
Copy link
Contributor

dadgar commented Mar 27, 2017

@stevenscg As for the questions under 2017-03-27T15:30:00Z. The 30 second increment doesn't really matter since periodic tokens are extended to the period duration.

Some questions for you:

  1. Is this happening often or it is a one off thing you noticed?
  2. What happened in the client logs when you see permission denied in the Vault audit logs?
  3. Anything unique about the allocations life time? Nomad restarted, etc?

@stevenscg
Copy link
Author

@dadgar

  1. Is this happening often or it is a one off thing you noticed?

This does not appear to be a one-off thing, but I think it is not happening with all invocations of the same dispatch job. I may have to stop the automatic execution of this job, dispatch one or more manually, and wait out the renewals.

There was another occurrence of it overnight (alloc completed Mar 28 09:10:06, renew-self failed Mar 28 09:56:45).

  1. What happened in the client logs when you see permission denied in the Vault audit logs?

Using the overnight occurrence for alloc 5487d09f-1bd9-1868-e3cb-1f67c5c6710c, the nomad worker logs look like this:

Mar 28 09:10:06 ip-10-101-27-196 nomad: 2017/03/28 09:10:06.887384 [INFO] client: task "scan" for alloc "5487d09f-1bd9-1868-e3cb-1f67c5c6710c" completed successfully
Mar 28 09:10:06 ip-10-101-27-196 nomad[1139]: client: task "scan" for alloc "5487d09f-1bd9-1868-e3cb-1f67c5c6710c" completed successfully
Mar 28 09:10:06 ip-10-101-27-196 nomad[1139]: client: Not restarting task: scan for alloc: 5487d09f-1bd9-1868-e3cb-1f67c5c6710c
Mar 28 09:10:06 ip-10-101-27-196 nomad[1139]: client: marking allocation 5487d09f-1bd9-1868-e3cb-1f67c5c6710c for GC
Mar 28 09:10:06 ip-10-101-27-196 nomad: 2017/03/28 09:10:06.887425 [INFO] client: Not restarting task: scan for alloc: 5487d09f-1bd9-1868-e3cb-1f67c5c6710c
Mar 28 09:10:06 ip-10-101-27-196 nomad: 2017/03/28 09:10:06.887680 [INFO] client: marking allocation 5487d09f-1bd9-1868-e3cb-1f67c5c6710c for GC
....
Mar 28 09:56:45 ip-10-101-27-196 nomad[1139]: client.vault: renewal of token failed: failed to renew the vault token: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/auth/token/renew-self#012Code: 403. Errors:#012#012* permission denied
Mar 28 09:56:45 ip-10-101-27-196 nomad: 2017/03/28 09:56:45.533419 [ERR] client.vault: renewal of token failed: failed to renew the vault token: Error making API request.
Mar 28 09:56:45 ip-10-101-27-196 nomad: URL: PUT https://active.vault.service.consul:8200/v1/auth/token/renew-self
Mar 28 09:56:45 ip-10-101-27-196 nomad: Code: 403. Errors:
Mar 28 09:56:45 ip-10-101-27-196 nomad: * permission denied
  1. Anything unique about the allocations life time? Nomad restarted, etc?

Nothing unique or special about these allocations that I am aware of.
Nomad was up and running very well.
Vault was up the entire time.

There are only 2 jobs running on the cluster:

  • 1 periodic job ("feed") that runs every 10 minutes
  • 1 dispatch job ("scan") with instances that are dispatched by the above "feed" job

Both jobs are using the raw_exec driver.
The "scan" job only runs about 4 times per day right now due to the business logic.

QQ: Is there anything unusual about the "Not restarting" entries in the logs on question 2?
I think my script is exiting properly. There are no issues shown and I see the expected output in the alloc logs (both stdout and stderr).

My plan for today is to:

  • Shorten all the vault TTLs, stop the "feed" job entirely, set nomad logs to debug, manually dispatch 1 of the "scan" jobs, and then wait out some renewals.

  • If the isolated run works, I can try dispatching 2 or more at around the same time (off-by-one in revoking tokens and/or marking allocations for GC? seems unlikely though).

  • If we still haven't learned anything, I'll try to reproduce it outside of this particular environment.

@stevenscg
Copy link
Author

Update - I was able to observe the issue in the same environment while running a single dispatch of the task in isolation on the cluster.

Mar 28 13:47:54 - Dispatch request over http
Mar 28 13:51:12 - task "scan" for alloc "a2dd6979..." completed successfully
Mar 28 14:12:02 - renew-self failed

The time it took to call renew-self (~25-30 minutes) looks correct. the nomad-cluster role period is currently set to 1800.

I'm going to leave this running as is for now and see if more renewals are attempted.

Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: http: Request /v1/job/scan/dispatch (12.588376ms)
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.112112 [DEBUG] http: Request /v1/evaluation/4189d4ae-f619-2e82-c0b3-c4d57cc1fb44 (711.146µs)
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: http: Request /v1/evaluation/4189d4ae-f619-2e82-c0b3-c4d57cc1fb44 (711.146µs)
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.113605 [DEBUG] http: Request /v1/evaluation/4189d4ae-f619-2e82-c0b3-c4d57cc1fb44/allocations (759.979µs)
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: http: Request /v1/evaluation/4189d4ae-f619-2e82-c0b3-c4d57cc1fb44/allocations (759.979µs)
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.129395 [DEBUG] client: updated allocations at index 487871 (total 1) (pulled 1) (filtered 0)
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: client: updated allocations at index 487871 (total 1) (pulled 1) (filtered 0)
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.129474 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: client: starting task runners for alloc 'a2dd6979-304c-606c-bbba-d3d2db9c40ae'
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: client: starting task context for 'scan' (alloc 'a2dd6979-304c-606c-bbba-d3d2db9c40ae')
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: client: waiting for Vault token for task scan in alloc "a2dd6979-304c-606c-bbba-d3d2db9c40ae"
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.131906 [DEBUG] client: starting task runners for alloc 'a2dd6979-304c-606c-bbba-d3d2db9c40ae'
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.132033 [DEBUG] client: starting task context for 'scan' (alloc 'a2dd6979-304c-606c-bbba-d3d2db9c40ae')
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.133224 [DEBUG] client: waiting for Vault token for task scan in alloc "a2dd6979-304c-606c-bbba-d3d2db9c40ae"
Mar 28 13:47:54 ip-10-101-27-196 nomad: 2017/03/28 13:47:54.277866 [DEBUG] client: retrieved Vault token for task scan in alloc "a2dd6979-304c-606c-bbba-d3d2db9c40ae"
Mar 28 13:47:54 ip-10-101-27-196 nomad[6321]: client: retrieved Vault token for task scan in alloc "a2dd6979-304c-606c-bbba-d3d2db9c40ae"
....
Mar 28 13:51:12 ip-10-101-27-196 nomad: 2017/03/28 13:51:12 [DEBUG] plugin: /opt/nomad/bin/nomad: plugin process exited
Mar 28 13:51:12 ip-10-101-27-196 nomad[6321]: plugin: /opt/nomad/bin/nomad: plugin process exited
Mar 28 13:51:12 ip-10-101-27-196 nomad[6321]: client: task "scan" for alloc "a2dd6979-304c-606c-bbba-d3d2db9c40ae" completed successfully
Mar 28 13:51:12 ip-10-101-27-196 nomad[6321]: client: Not restarting task: scan for alloc: a2dd6979-304c-606c-bbba-d3d2db9c40ae
Mar 28 13:51:12 ip-10-101-27-196 nomad[6321]: client: marking allocation a2dd6979-304c-606c-bbba-d3d2db9c40ae for GC
Mar 28 13:51:12 ip-10-101-27-196 nomad: 2017/03/28 13:51:12.550652 [INFO] client: task "scan" for alloc "a2dd6979-304c-606c-bbba-d3d2db9c40ae" completed successfully
Mar 28 13:51:12 ip-10-101-27-196 nomad: 2017/03/28 13:51:12.550676 [INFO] client: Not restarting task: scan for alloc: a2dd6979-304c-606c-bbba-d3d2db9c40ae
Mar 28 13:51:12 ip-10-101-27-196 nomad: 2017/03/28 13:51:12.550896 [INFO] client: marking allocation a2dd6979-304c-606c-bbba-d3d2db9c40ae for GC
Mar 28 13:51:12 ip-10-101-27-196 nomad: 2017/03/28 13:51:12.756589 [DEBUG] client: updated allocations at index 487879 (total 1) (pulled 0) (filtered 1)
Mar 28 13:51:12 ip-10-101-27-196 nomad[6321]: client: updated allocations at index 487879 (total 1) (pulled 0) (filtered 1)
Mar 28 13:51:12 ip-10-101-27-196 nomad[6321]: client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
Mar 28 13:51:12 ip-10-101-27-196 nomad: 2017/03/28 13:51:12.756693 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
....
Mar 28 14:12:02 ip-10-101-27-196 nomad: 2017/03/28 14:12:02.308907 [ERR] client.vault: renewal of token failed: failed to renew the vault token: Error making API request.
Mar 28 14:12:02 ip-10-101-27-196 nomad: URL: PUT https://active.vault.service.consul:8200/v1/auth/token/renew-self
Mar 28 14:12:02 ip-10-101-27-196 nomad: Code: 403. Errors:
Mar 28 14:12:02 ip-10-101-27-196 nomad: * permission denied
Mar 28 14:12:02 ip-10-101-27-196 nomad[6321]: client.vault: renewal of token failed: failed to renew the vault token: Error making API request.#012#012URL: PUT https://active.vault.service.consul:8200/v1/auth/token/renew-self#012Code: 403. Errors:#012#012* permission denied
Mar 28 14:12:09 ip-10-101-27-196 nomad: 2017/03/28 14:12:09.953472 [DEBUG] http: Request /v1/agent/servers (177.597µs)
Mar 28 14:12:09 ip-10-101-27-196 nomad[6321]: http: Request /v1/agent/servers (177.597µs)
Mar 28 14:12:11 ip-10-101-27-196 nomad: 2017/03/28 14:12:11.060475 [DEBUG] client: updated allocations at index 487879 (total 1) (pulled 0) (filtered 1)
Mar 28 14:12:11 ip-10-101-27-196 nomad[6321]: client: updated allocations at index 487879 (total 1) (pulled 0) (filtered 1)
Mar 28 14:12:11 ip-10-101-27-196 nomad[6321]: client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
Mar 28 14:12:11 ip-10-101-27-196 nomad: 2017/03/28 14:12:11.060620 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
...

@dadgar
Copy link
Contributor

dadgar commented Mar 28, 2017

@stevenscg Thanks so much for reproducing in an isolated way. I am going to play with this now and see if I can get to the bottom of it!

@stevenscg
Copy link
Author

@dadgar No problem at all. I'm super curious myself.

dadgar added a commit that referenced this issue Mar 28, 2017
This PR fixes an oversight in which the client would attempt to renew a
token even after the task exits.

Fixes #2475
@dadgar
Copy link
Contributor

dadgar commented Mar 28, 2017

@stevenscg Well don't look too closely at the fix haha it is rather sad.

@stevenscg
Copy link
Author

@dadgar 🙌 I won't. Really thought I was losing it for a while there.

/me fires up master later today. ;-)

@dadgar
Copy link
Contributor

dadgar commented Mar 28, 2017

@stevenscg We are gonna cut 0.5.6 RC1 today so you can save yourself some time 👍

@stevenscg
Copy link
Author

Oh man, that's fantastic news. hehe

tonyhb added a commit to inngest/nomad that referenced this issue Jan 26, 2022
Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.

This PR adjusts the nomad docker driver and docker logging driver to:

1. Enable docklog to run after a container has stopped (to some grace
   period limit)
2. Collect logs from stopped containers up until the grace period

This fixes the current issues:

1. docklog is killed by the handlea as soon as the task finishes, which
   means fast running containers can never have their logs scraped
2. docklog quits streaming logs in its event loop if the container has
   stopped

In order to do this, we need to know _whether_ we have read logs for the
current container in order to apply a grace period.  We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.

Fixes hashicorp#2475, hashicorp#6931.

Always wait to read from logs before exiting

Store number of bytes read vs a simple counter
shoenig pushed a commit to tonyhb/nomad that referenced this issue Jun 9, 2022
Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.

This PR adjusts the nomad docker driver and docker logging driver to:

1. Enable docklog to run after a container has stopped (to some grace
   period limit)
2. Collect logs from stopped containers up until the grace period

This fixes the current issues:

1. docklog is killed by the handlea as soon as the task finishes, which
   means fast running containers can never have their logs scraped
2. docklog quits streaming logs in its event loop if the container has
   stopped

In order to do this, we need to know _whether_ we have read logs for the
current container in order to apply a grace period.  We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.

Fixes hashicorp#2475, hashicorp#6931.

Always wait to read from logs before exiting

Store number of bytes read vs a simple counter
tonyhb added a commit to inngest/nomad that referenced this issue Jul 4, 2022
Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.

This PR adjusts the nomad docker driver and docker logging driver to:

1. Enable docklog to run after a container has stopped (to some grace
   period limit)
2. Collect logs from stopped containers up until the grace period

This fixes the current issues:

1. docklog is killed by the handlea as soon as the task finishes, which
   means fast running containers can never have their logs scraped
2. docklog quits streaming logs in its event loop if the container has
   stopped

In order to do this, we need to know _whether_ we have read logs for the
current container in order to apply a grace period.  We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.

Fixes hashicorp#2475, hashicorp#6931.

Always wait to read from logs before exiting

Store number of bytes read vs a simple counter
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants