Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatch API returning 500 only for some Payloads #11385

Closed
Talon876 opened this issue Oct 25, 2021 · 4 comments · Fixed by #11396
Closed

Dispatch API returning 500 only for some Payloads #11385

Talon876 opened this issue Oct 25, 2021 · 4 comments · Fixed by #11396
Labels

Comments

@Talon876
Copy link

Nomad version

1.1.4

Operating system and Environment details

debian:bullseye-slim docker container on Ubuntu 20.04 base OS

Issue

When dispatching a job using the Dispatch Job api, some Payloads cause a 500 error with the response body rpc error: EOF. The docs indicate that the base64 Payload string in the request body must be <= 16384 but these payloads are all under that limit.

Reproduction steps

  1. Create a parameterized job, test-job, and attempt to dispatch jobs using the HTTP API

Example of a decoded payload that causes the 500 response:

{
  "function": "export_client_bundle",
  "call_id": "export_client_bundle-c198f2f992eca4397a1aeb7d5977a495",
  "args": {
    "options": {
      "artifactId": "artifact_2021-10-24.00:38:47-full"
    }
  },
  "nomad_job": "test-job",
  "project": "talon",
  "data_source": "PROD",
  "user": "talon",
  "cluster": "talon",
  "reuse": false,
  "nomadToken": "--redacted--"
}

When encoded, this is the exact payload sent to nomad:

{
  "Payload": "eyJmdW5jdGlvbiI6ICJleHBvcnRfY2xpZW50X2J1bmRsZSIsICJjYWxsX2lkIjogImV4cG9ydF9jbGllbnRfYnVuZGxlLWMxOThmMmY5OTJlY2E0Mzk3YTFhZWI3ZDU5NzdhNDk1IiwgImFyZ3MiOiB7Im9wdGlvbnMiOiB7ImFydGlmYWN0SWQiOiAiYXJ0aWZhY3RfMjAyMS0xMC0yNC4wMDozODo0Ny1mdWxsIn19LCAibm9tYWRfam9iIjogInRlc3Qtam9iIiwgInByb2plY3QiOiAidGFsb24iLCAiZGF0YV9zb3VyY2UiOiAiUFJPRCIsICJ1c2VyIjogInRhbG9uIiwgImNsdXN0ZXIiOiAidGFsb24iLCAicmV1c2UiOiBmYWxzZSwgIm5vbWFkVG9rZW4iOiAiLS1yZWRhY3RlZC0tIn0="
}

With a small change to the payload, it will return a 200 and submit the job:

{
  "function": "export_client_bundle",
  "call_id": "export_client_bundle-ec5f6840a550c773939e47128d66f9ac",
  "args": {
    "options": {
      "artifactId": "111111111111111111111111111111111"
    }
  },
  "nomad_job": "test-job",
  "project": "talon",
  "data_source": "PROD",
  "user": "talon",
  "cluster": "talon",
  "reuse": false,
  "nomadToken": "--redacted--"
}

Raw request body sent to nomad:

{
  "Payload": "eyJmdW5jdGlvbiI6ICJleHBvcnRfY2xpZW50X2J1bmRsZSIsICJjYWxsX2lkIjogImV4cG9ydF9jbGllbnRfYnVuZGxlLWVjNWY2ODQwYTU1MGM3NzM5MzllNDcxMjhkNjZmOWFjIiwgImFyZ3MiOiB7Im9wdGlvbnMiOiB7ImFydGlmYWN0SWQiOiAiMTExMTExMTExMTExMTExMTExMTExMTExMTExMTExMTExIn19LCAibm9tYWRfam9iIjogInRlc3Qtam9iIiwgInByb2plY3QiOiAidGFsb24iLCAiZGF0YV9zb3VyY2UiOiAiUFJPRCIsICJ1c2VyIjogInRhbG9uIiwgImNsdXN0ZXIiOiAidGFsb24iLCAicmV1c2UiOiBmYWxzZSwgIm5vbWFkVG9rZW4iOiAiLS1yZWRhY3RlZC0tIn0="
}

Expected Result

Receive a 200 response for both payloads and the jobs to be submitted.

Actual Result

Receive a 500 rpc error: EOF error for one of the payloads and no job is submitted.

@notnoop
Copy link
Contributor

notnoop commented Oct 26, 2021

I'm afraid that trying out the payload you included here against my sample parameterized job didn't trigger the condition. I expect to see some panic stacktrace or some relevant log lines in the leader at time of 500 error responses. Can you check logs and see if any ERROR or panics were logged? Thanks!

@Talon876
Copy link
Author

Here are the logs from when it happened:

nomad-logs.txt

Thanks for looking!

@notnoop
Copy link
Contributor

notnoop commented Oct 26, 2021

Wow - looks like a bug in the Snappy library we use for compressing payload, in https://github.com/hashicorp/nomad/blob/v1.1.4/nomad/job_endpoint.go#L1949-L1950 !

runtime.throw({0x2336d8d, 0x5})
        runtime/panic.go:1198 +0x54 fp=0x40029f6930 sp=0x40029f6900 pc=0x43c024
runtime.sigpanic()
        runtime/signal_unix.go:742 +0x1e4 fp=0x40029f6970 sp=0x40029f6930 pc=0x455274
github.com/golang/snappy.encodeBlock({0x40001d2202, 0x1f4, 0x1f4}, {0x400095a9c0, 0x193, 0x195})
        github.com/golang/[email protected]/encode_arm64.s:666 +0x354 fp=0x40029fea10 sp=0x40029f6980 pc=0xf56f44
github.com/golang/snappy.Encode({0x0, 0x0, 0x0}, {0x400095a9c0, 0x193, 0x195})
        github.com/golang/[email protected]/encode.go:39 +0x1f0 fp=0x40029feaa0 sp=0x40029fea10 pc=0xf56500
github.com/hashicorp/nomad/nomad.(*Job).Dispatch(0x40009b3ea0, 0x40024c6e80, 0x4001b5f840)
        github.com/hashicorp/nomad/nomad/job_endpoint.go:1950 +0xbf4 fp=0x40029feed0 sp=0x40029feaa0 pc=0x1538614

I see some fixes in the upstream library in https://github.com/golang/snappy/commits/master/encode_arm64.s . I will try to reproduce the issue with arm64 hosts and follow up.

notnoop pushed a commit that referenced this issue Oct 27, 2021
…11396)

Pick up golang/snappy#56 to handle arm64 architectures to fix panics. tldr; Golang 1.16 changed `memmove` implementation for arm64 requiring additional cpu registers that snappy wasn't preserving in its assembly implementation.

Other projects have experienced this issue as well, searching for `encode_arm64.s:666` on your favorite search engine will reveal some.  Vault updated the dependency earlier this August: hashicorp/vault#12371 .

I believe this issue affects Nomad 1.2.x and 1.1.x. Nomad 1.0.x use Golang 1.15 and isn't affected. However, backporting the change to 1.0.x should be harmless.

Fixed #11385 .
lgfa29 pushed a commit that referenced this issue Nov 15, 2021
…11396)

Pick up golang/snappy#56 to handle arm64 architectures to fix panics. tldr; Golang 1.16 changed `memmove` implementation for arm64 requiring additional cpu registers that snappy wasn't preserving in its assembly implementation.

Other projects have experienced this issue as well, searching for `encode_arm64.s:666` on your favorite search engine will reveal some.  Vault updated the dependency earlier this August: hashicorp/vault#12371 .

I believe this issue affects Nomad 1.2.x and 1.1.x. Nomad 1.0.x use Golang 1.15 and isn't affected. However, backporting the change to 1.0.x should be harmless.

Fixed #11385 .
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants