-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve consul integration user experience. #490
Comments
@F21 The Nomad client continues to keep trying to the Consul agent. The moment it connects to the agent it syncs all the service definitions of Tasks running on that node. Also, I am wondering why does Consul servers need to be run as system jobs? It makes perfect sense to run the consul agents as system jobs. We shouldn't probably need to use the distinct host constraint in case the job uses the system scheduler. |
@diptanu I agree that the Consul servers probably won't need to use the system scheduler. Having said that, my initial rationale was to use However, even after setting the consul server # Define a job called my-service
job "consul" {
# Job should run in the US region
region = "global"
# Spread tasks between us-west-1 and us-east-1
datacenters = ["dc1"]
# run this job globally
type = "system"
# Rolling updates should be sequential
update {
stagger = "30s"
max_parallel = 1
}
constraint{
distinct_hosts = "true"
}
group "consul-server" {
count = 1
# Create a web front end using a docker image
task "consul-server" {
driver = "docker"
config {
image = "f21global/consul"
network_mode = "host"
args = ["agent", "-server", "-bootstrap-expect", "1", "-data-dir", "/tmp/consul"]
}
resources {
cpu = 500
memory = 64
network {
# Request for a static port
port "consul_8300" {
static = 8300
}
port "consul_8301" {
static = 8301
}
port "consul_8302" {
static = 8302
}
port "consul_8400" {
static = 8400
}
port "consul_8500" {
static = 8500
}
port "consul_8600" {
static = 8600
}
}
}
}
}
group "consul-agent" {
# Create a web front end using a docker image
task "consul-agent" {
driver = "docker"
config {
image = "f21global/consul"
network_mode = "host"
args = ["agent", "-data-dir", "/tmp/consul", "-node=agent-twi"]
}
resources {
cpu = 500
memory = 64
network {
# Request for a static port
port "consul_8300" {
static = 8300
}
port "consul_8301" {
static = 8301
}
port "consul_8302" {
static = 8302
}
port "consul_8400" {
static = 8400
}
port "consul_8500" {
static = 8500
}
port "consul_8600" {
static = 8600
}
}
}
}
}
}
|
@F21 It looks like the agent and server are getting scheduled on the same machine, that's why you're getting the port collision. distinct_hosts at the job level just means that all the task groups are going to be running on distinct machines. But a system job still would run on every single machine and that's the consul agent is getting scheduled with the consul server. We might need a way to exclude system jobs from running on machines with certain label. |
I'm glad I'm not the only one running into this problem. There really should be a recommended way in the docs to run consul because having that service running is crucial for a production ready cluster. |
I tried splitting up consul-server and consul-agent into different services (instead of system) to let them run independently but that doesn't appear to be the right solution (or I missed something in the config) The nomad server running consul-server keeps restarting the service
And the server running consul-agent restarts a couple of times and then gets stuck
I'm still digging into it but just wanted to echo the need for a recommended way to run consul on nomad. |
FWIW, I have avoided these issues by making the consul network/service the thing that is setup and completed first, and which nomad then uses. CM manages consul and nomad on all nodes, and init for each node works out the details of forming/joining a cluster with consensus and a leader. No chicken/egg issues here. |
Has any process been made to get consul running on nomad? While it's possible to run consul by itself along side nomad, it poses the following problems (this is assuming we only have 1 datacenter and want to have 3 consul servers with the rest of the servers running the consul client):
|
@F21 You can definitely run Consul servers with Nomad. What is not possible today is to run both the Consul servers(using service scheduler) and clients(using system scheduler) via Nomad. The reason is that the system scheduler currently schedules all the system jobs in all the machines in a Nomad cluster, there is no way for the system scheduler to currently exclude running the Consul clients on the machines where Nomad is running the Consul servers. And on the same machine, the client and server can't run simultaenously because of port collisions. But if you just want to run the Consul servers on Nomad it's definitely possible and as you said, you could use Atlas's auto-join functionality to have the Consul servers find each other when they are dynamically scheduled by Nomad. |
@diptanu What are the nomad team's plans to fully bring consul scheduling to nomad? In terms of the service scheduler colliding with the system scheduler, maybe a key called a Or maybe, the metadata feature can be improved so that it's exposed on a node level. For example, a task could add a piece of metadata setting I've recently built a bunch of docker images to run HDFS, having this feature would also be very useful. For example, I want to run the namenodes on distinct nodes. I also want to use the system scheduler to schedule datanodes on all nodes in the cluster except for nodes where the namenodes are running. Another possible way to deal with consul discovery without using Atlas would be to exploit the |
Has any progress been made to run both consul agents and servers on nomad? |
@F21, from what I can tell, it's not impossible, it just takes work. With that said, I have had a lot of success with consul as the primary core service that runs outside nomad, and I would recommend considering this route too. |
@ketzacoatl That's what I am currently doing with a virtualized test cluster. However, if you have say 3 nodes running nomad servers and consul servers, how are you recovering if 1 of those nodes goes down or experiences a hardware failure? |
I have my nomad servers and consul leaders running together on one auto-scaling group, 3 to 5 servers. If one node goes down, AWS follows the auto-scaling group setup and creates a replacement for the node(s) that are not present. |
Ah, that makes sense. I am not using AWS but will probably be running on a set of dedicated servers and a public cloud provider without auto-scaling, so a machine going down will need manual intervention. |
@ketzacoatl How are your ASG instances joining the cluster when started? That is, how do you "know" the other servers to join to? |
@memelet, for the consul leaders themselves, I use some AWS hackery - Terraform creates the ASG and puts it in the smallest subnet possible (limiting the IP range). We have 2 AZ for failover on the ASG, so there are two subnets, and the list of "possible IPs" is computed (eg, those that the leaders might actually have, but we don't know, because it's ASG), and that list is used to create a DNS entry for "all leader IPs". Note that the list of possible IPs is huge (~25) compared to the leader nodes (3 - 5). Consul agents can then be pointed at that DNS record for the leaders, and configured with |
Why not Consul running with nomad by default? every node running nomad can be running consul as well. |
Running the two on the same hosts is trivial. The documentation for each app is clear in how to configure and run the software. There is a learning curve to understanding all the details you need to master in order to be effective. |
@ketzacoatl wow, thank you for describing your AWS hackery! Didn't think that we could simply brute force to find a leader in some subnet 👍 |
You could also use lambda to update a DNS record when nodes in the ASG change - see https://objectpartners.com/2015/07/07/aws-tricks-updating-route53-dns-for-autoscalinggroup-using-lambda/ for an example. |
Hey I am going to close this since we recommend running Consul outside of Nomad. |
When restoring a snapshot (on startup, installed from the leader, or during recovery) the logs are extremely terse. There are typically bookend messages indicating that a restore is going to happen, and that it is complete, but there's a big dead space in the middle. For small snapshots this is probably fine, but for larger multi-GB snapshots this can stretch out and can be unnerving as an operator to know if it's stuck or still making progress. This PR adjusts the logging to indicate a simple progress log message every 10s about overall completion in bytes-consumed.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I spent most of my morning trying to build a test nomad + consul cluster using vagrant.
I am finding that as it stands, consul integration is very difficult to run on top of nomad. I am sure the issues outlined here will spawn child issues that are more specific, but I think having a general issue will help provide discussion improve the user experience before we break it down to specific tasks.
Here is a quick background of my investigations to narrow down the scope:
f21global/consul
on docker hub).f21global/consul
).localhost:8500
.This is currently the nomad task config I am using:
Problems I ran into:
distinct_hosts
causes nomad to panic: Setting distinct_hosts to a boolean causes panic #489.start-join
andstart-join-wan
.consul.nomad
file with the ip address and then send it to nomad as an update.do not run on nodes that already have the consul-server task running
.The text was updated successfully, but these errors were encountered: