Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad logs no longer works without NOMAD_REGION set #24635

Closed
HerbCSO opened this issue Dec 9, 2024 · 4 comments · Fixed by #24644
Closed

nomad logs no longer works without NOMAD_REGION set #24635

HerbCSO opened this issue Dec 9, 2024 · 4 comments · Fixed by #24644

Comments

@HerbCSO
Copy link

HerbCSO commented Dec 9, 2024

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf10

Operating system and Environment details

MacOS CLI, Linux server

Issue

Up to and including Nomad CLI 1.7.7 I was able to run nomad alloc logs deadbeef without NOMAD_REGION being set. Since 1.8.0 NOMAD_REGION is required, else I just get a Unexpected response code: 500 (No path to region). So I have to either set NOMAD_REGION or run nomad alloc logs -region myregion deadbeef.

From the docs it reads to me like this is unintentional behavior. It says Defaults to the Agent's local region. so I would assume I wouldn't need to specify this parameter. I didn't find anything in the release notes that specifies a necessary config change for this, so please let me know if I missed something there.

Also, this sounds a bit like the issue described in hashicorp/nomad-pack#374, but I'm not sure if this is really the same thing.

FWIW our Nomad servers and clients are a bit older, running 1.6.3 - maybe that's part of the issue?

Reproduction steps

Run nomad alloc logs using a version prior to 1.8.0 - works
Run nomad alloc logs using a version prior >= 1.8.0 - produces 500 error

Expected Result

No 500 error

Actual Result

500 error

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

@tgross
Copy link
Member

tgross commented Dec 9, 2024

Hi @HerbCSO! It looks like you're running into the same issue described here: #24609 (comment). The alloc logs RPC uses the exact same paths as the alloc exec command described there. I'd ask if you can answer some of the same questions I've asked there.

For what it's worth, I haven't been able to reproduce this behavior yet.

FWIW our Nomad servers and clients are a bit older, running 1.6.3 - maybe that's part of the issue?

This is an interesting clue! The CLI is 1.8.0+ and the agents are older?

@HerbCSO
Copy link
Author

HerbCSO commented Dec 9, 2024

I'd ask if you can answer some of the same questions I've asked there.

I don't have full access to the cluster (specifically no API access), however nomad server members reports all 5 servers as alive, running version 1.6.3, one is the leader, all running raft version 3. Also nomad node status shows all nodes as eligible and ready status.

FWIW our Nomad servers and clients are a bit older, running 1.6.3 - maybe that's part of the issue?

This is an interesting clue! The CLI is 1.8.0+ and the agents are older?

Correct. And using CLI versions 1.6.2 or 1.7.7 (basically any version < 1.8.0) from the same machine I was able to pull logs without specifying the region.

@tgross
Copy link
Member

tgross commented Dec 10, 2024

Thanks @HerbCSO. I was able to reproduce and the reproducer was that the requests were going to a non-leader. The HTTP API request arrives at the server with the region set to global, so the logic we have to use the agent's region is not kicking in. I've used git bisect from 1.7.6 to main and determined that this b7419bc is the culprit. But I haven't figured out how that's breaking things yet.

(Edit: weirdly I think this may have revealed an existing bug, rather than being the bug itself. https://github.com/hashicorp/nomad/blob/v1.9.3/api/api.go#L612 doesn't match the behavior we have anywhere else.)

(Edit2: yeah this bug dates back to 4d3b75d. When making direct-to-client requests as in nomad alloc logs, we need to set the region for the mTLS cert but that's not always going to be global. We should be figuring out the real region in the command and then setting it in the API client we configure, which I suspect was happening accidentally before b7419bc so long as another request happened first.)

tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address. In #23785 we fixed a bug where if the
configuration was used across multiple clients that mutation would happen
multiple times and the address would be incorrectly parsed.

When making `alloc log` or `alloc exec` calls to a region where the region is
not "global", we create a new client from the same configuration and then set
the address. But in this case we copy the private `url` field and that causes
the URL parsing to be skipped for the new client. This results in the region
always being set to the string literal `global` (because of mTLS handling code
introduced all the way back in 4d3b75d), which fails with an error "no path
to region" when the cluster isn't non-global and requests are sent to a
non-leader.

The "right" way of fixing this would be for `ClientConfig` not to change the
region to global in the first place, but as this is a public API and extremely
longstanding behavior, it could potentially be a breaking change for some
downstream consumers. Instead, we'll avoid copying the private `url` field so
that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
@tgross
Copy link
Member

tgross commented Dec 10, 2024

Fix up here for review: #24644

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Dec 10, 2024
@tgross tgross added this to the 1.9.x milestone Dec 10, 2024
tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 16, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 16, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Dec 16, 2024
tgross added a commit that referenced this issue Dec 16, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 16, 2024
…ddress (#24644) (#24682)

In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d

Co-authored-by: Tim Gross <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants