Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserved port duplicates/conflicts block all allocations #13505

Closed
groggemans opened this issue Jun 28, 2022 · 6 comments · Fixed by #13651
Closed

Reserved port duplicates/conflicts block all allocations #13505

groggemans opened this issue Jun 28, 2022 · 6 comments · Fixed by #13651
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug

Comments

@groggemans
Copy link
Contributor

Nomad version

Nomad v1.3.1 (2b054e3)

Operating system and Environment details

Linux

Issue

Any duplicate port reservations or conflicts within the agent config will result in all job allocations failing with a port conflict. Even jobs that don't have any port specification will fail.

It can be debated if this is a bug, as there are only very specific use cases where you have multiple aliases for the same network or have reserved port ranges defined on a global and host_network specific level. You could argue that in this case it would be ok if people can only define the port reservation on one of the network aliases or on the global level and only have network specific ports in the alias config. But it feels rather unnatural to me and in that case should have some giant warnings in the documentation about this.

Reproduction steps

Either have a global port reservation overlap a host_network reservation or have multiple aliases for the same address range reserve the same ports.

Expected Result

In the past the port reservations merged and didn't cause any conflicts.

Actual Result

Allocations fail because of port conflicts, even if the job isn't reserving any ports

Job file (if appropriate)

Anny job will do, I've tested locally with the example job (nomad job init)

agent config snipped

    host_network "LAN1" {
        cidr = "192.168.0.0/24"
        reserved_ports = "22"
    }
    host_network "default" { 
        cidr = "192.168.0.0/24"
        reserved_ports = "22"
    }

The conflict also happens if there's a reserved block including a port which is also declared in one of the host_network definitions.

Nomad Server logs (if appropriate)

  • WARNING: Failed to place all allocations.
    Task Group "cache" (failed to place 1 allocation):
    • Resources exhausted on 1 nodes
    • Class "gp_compute" exhausted on 1 nodes
    • Dimension "network: port collision" exhausted on 1 nodes
@jrasell
Copy link
Member

jrasell commented Jun 29, 2022

Hi @groggemans and thanks for raising this. I was able to reproduce this locally using Nomad v1.3.1 and the below setup for future readers.

Running the agent in dev mode using the following additional config snippet via nomad agent -dev -config=test.hcl:

client {
  host_network "LAN1" {
    cidr = "192.168.0.0/24"
    reserved_ports = "22"
  }
  host_network "default" {
    cidr = "192.168.0.0/24"
    reserved_ports = "22"
  }
}

Attempting to run the following minimal jobspec via nomad job run example.nomad:

job "example" {
  datacenters = ["dc1"]
  group "cache" {
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
      }
    }
  }
}

@jrasell jrasell added theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jun 29, 2022
@jrasell
Copy link
Member

jrasell commented Jun 29, 2022

@groggemans would you be able to expand a little on the use cases you have mentioned in the below quote?

there are only very specific use cases where you have multiple aliases for the same network or have reserved port ranges defined on a global and host_network specific level

@schmichael
Copy link
Member

Reproduced with the following agent configuration in networks.hcl:

client {
  reserved {
    reserved_ports = "22"
  }

  host_network "eth" {
    cidr = "192.168.0.0/16"
    reserved_ports = "22"
  }
}

Simply run with nomad agent -dev -config networks.hcl and try to run any job. The node will have an invalid NetworkIndex and be ineligible for placements (the https://nomadproject.io/s/port-plan-failure bug)

I think there are 2 possible fixes here:

  1. Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides.
  2. Fail to start the client agent is both are set with an error message instructing users to set one or the other.

I think the 1st option is the friendliest here, and I've done brief testing to see that both options seem viable. If there aren't any other ideas I'll probably push a PR tomorrow to do the 1st.

@groggemans
Copy link
Contributor Author

groggemans commented Jul 7, 2022

I agree that just failing to start might not be the ideal solution and would also prefer the first option in that case.

However something that isn't addressed in the 1st option is if multiple aliases for the same network define the same ports.
It would be nice if those would get merged instead of causing conflicts.

The specific use case I see for multiple aliases for the same network is in a cluster with mixt network configurations. You might not want to adapt all the jobs. And on certain nodes there might be a separate management and lan network, on others it might be a single interface. From a configuration standpoint this would then result in the one interface being identified as both LAN and MGMT on that second node while the first has actual separate interfaces.

I have a specific use case like that in my local cluster, where all/most machines have the same networks defined, but not all use the same network config, because some are on wifi or don't have multiple interfaces/vlan's etc.

My use case is probably better solved by being able to attach "tags" to interfaces and being able to filter on those. But the current alias system is able to offer a similar experience although with a bit more config. So it's probably not worth the development effort at this point.

schmichael added a commit that referenced this issue Jul 8, 2022
schmichael added a commit that referenced this issue Jul 12, 2022
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
@schmichael
Copy link
Member

@groggemans fix will be going out in 1.3, 1.2, and 1.1 soon! The way the agent configuration maps to reserved ports on IPs is still a little harder to observe then I'd like, but we have plans for more networking work in the future! Feel free to open new issues with any bugs or ideas you might have.

lgfa29 pushed a commit that referenced this issue Jul 13, 2022
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
lgfa29 pushed a commit that referenced this issue Jul 13, 2022
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.

Co-authored-by: Michael Schurter <[email protected]>
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants