-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.3] Issue accessing allocation logs #5920
Comments
What happens if you run the CLI command on the client node where the allocation was scheduled (i.e., where the logs are still present on the file system)?
|
we experience the same behavior, the logs are working intermediately |
Hey @cgbaker From the node, I was able to constantly get the logs: $ NOMAD_ADDR=https://localhost:4646 NOMAD_SKIP_VERIFY=1 nomad logs -stderr $myAlloc From my machine, pointing |
We have similar problems with both |
Here is the example that @onomojaku referenced: The alloc NOMAD_CLIENT_1 > nomad alloc exec 3cdea42e bash
root@3cdea42e:/# Here are the logs: NOMAD_CLIENT_1 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 15:04:37 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:37.919Z [INFO ] client: task exec session starting: exec_id=0c3684e1-d128-237f-770a-c2213644878c alloc_id=3cdea42e-c682-c292-06f4-db96c0d8ca2d task=rabbitmq command=[bash] tty=true access_token_name= access_token_id=
Jul 10 15:04:40 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:40.028Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true duration=2.108800229s However, the command NOMAD_CLIENT_2 > nomad alloc exec 3cdea42e bash
failed to exec into task: write tcp NOMAD_SERVER_1:43064->NOMAD_SERVER_2:4647: write: connection reset by peer Here are the logs: NOMAD_CLIENT_2 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true error="write tcp NOMAD_SERVER_IP_1:57470->NOMAD_SERVER_IP_2:4647: write: connection reset by peer" code=500
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true duration=16.958344ms Of note, creating a websocket with NOMAD_CLIENT_2 > curl -X GET \
--include \
--no-buffer \
--verbose \
--header "Connection: Upgrade" \
--header "Upgrade: websocket" \
--header "Sec-Websocket-Version: 13" \
--header "Sec-WebSocket-Key: foo" \
--cacert /etc/pki/cluster/nomad-ca.pem \
--cert /etc/pki/cluster/nomad.cli.pem \
--key /etc/pki/cluster/nomad.cli-key.pem \
"https://localhost:4646/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D®ion=REGION_1&task=rabbitmq&tty=true" succeeds, after which the |
FWIW, we're seeing the same thing after upgrading to 0.9.3 from 0.9.1. |
Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4. |
Awesome!
You guys do great work and are incredibly responsive. Thanks much!
Rod
…Sent from my iPad
On Jul 11, 2019, at 10:02 PM, Mahmood Ali ***@***.***> wrote:
Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Glad to hear that this is a bug, and that you were able to pinpoint the bug. Thank you for the responsiveness, and looking forward to upgrade to 0.9.4 ! Cheers |
tested with a dev binary and it indeed solved our issue. Thank you :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Hey everyone,
I'll preface this by saying that this seems intermittent. Sometimes you will be able to
nomad logs
the allocation, sometimes the call will hang. However it happens on running, failed or completed allocations alike.When calling
nomad alloc logs
on an allocation, the call hangs. This also happens on the Web UI where both the direct call to the node, and the call through the master fail.One thing to note is that logs are available on the filesystem of the clients and can be read without any problems.
I'm not sure whether this is a question or more a bug report as I can't reproduce the issue on other versions of Nomad (0.9.1).
Reproduction steps
Job file (if appropriate)
Doesn't seem dependent on the job since sometimes it works for one allocation of a job, and not for another one
Nomad Shared config
Nomad Server config
Nomad Client Config
Nomad Client logs (if appropriate)
Couldn't find anything meaningful in Nomad logs.
If possible please post relevant logs in the issue.
In the browser I see:
The first thing is expected, the client nodes don't have a certificate that correspond to their IP.
The second call, directly to the master, that will then proxy to the client, is the one that is flaky for us. Sometimes, refreshing is enough to get the logs (in the UI), or making another nomad logs call. Other times, the call will just hang (through curl or nomad logs)
I've been trying to figure it out for a few days at this point and can't find anything that could indicate what's happening.
Thanks
The text was updated successfully, but these errors were encountered: