Cannot change ingress container from http to tcp (or vice versa) when using Consul Service Mesh #14802

brian-athinkingape · 2022-10-04T22:06:23Z

Nomad version

Nomad v1.3.5 (1359c25)

Operating system and Environment details

Ubuntu 22.04 on AWS (on a fresh EC2 instance), amd64

Consul v1.13.2
Revision 0e046bbb
Build Date 2022-09-20T20:30:07Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Docker version 20.10.18, build b40c2f6

Issue

If I run an ingress container with the http protocol, I'm unable to edit it to use tcp even after I stop the job. Even if I run nomad system gc and nomad system reconcile summaries, it still doesn't work. I'm also unable to edit the consul config to use

If I swap all instances of http and tcp I get the same errors.

Reproduction steps

Start nomad/consul in dev mode:

consul agent -dev
sudo nomad agent -dev-connect

Set up consul to use http as default protocol (using proxy-defaults.hcl file below)

consul config write proxy-defaults.hcl

Run the first job file

nomad job run job1.nomad

After job has started, stop the job

nomad job stop job1

When job stops successfully, run the second job file

nomad job run job2.nomad

Expected Result

I should be able to run job2 as normal.

Actual Result

$ nomad job run job2.nomad 
Error submitting job: Unexpected response code: 500 (Unexpected response code: 500 (service "test-upstream" has protocol "http", which does not match defined listener protocol "tcp"))
$ consul config write service-defaults.hcl 
Error writing config entry service-defaults/test-upstream: Unexpected response code: 500 (service "test-upstream" has protocol "tcp", which does not match defined listener protocol "http")

Job file (if appropriate)

proxy-defaults.hcl

Kind      = "proxy-defaults"
Name      = "global"
Config {
  protocol = "http"
}

service-defaults.hcl

Kind     = "service-defaults"
Name     = "test-upstream"
Protocol = "tcp"

job1.nomad:

job "job1" {
    region = "global"
    datacenters = ["dc1"]
    type = "system"

    group "group1" {
        network {
            mode = "bridge"
            port "default" {
                static = 12345
                to = 12345
            }
        }
        service {
            name = "test-ingress"
            port = "12345"
            connect {
                gateway {
                    proxy {
                        connect_timeout = "5s"
                    }
                    ingress {
                        listener {
                            port = 12345
                            protocol = "http"
                            service {
                                name = "test-upstream"
                                hosts = ["*"]
                            }
                        }
                    }
                }
            }
        }
    }
}

job2.nomad:

job "job2" {
    region = "global"
    datacenters = ["dc1"]
    type = "system"

    group "group2" {
        network {
            mode = "bridge"
            port "default" {
                static = 12345
                to = 12345
            }
        }
        service {
            name = "test-ingress"
            port = "12345"
            connect {
                gateway {
                    proxy {
                        connect_timeout = "5s"
                    }
                    ingress {
                        listener {
                            port = 12345
                            protocol = "tcp"
                            service {
                                name = "test-upstream"
                            }
                        }
                    }
                }
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

tgross · 2022-10-05T15:19:56Z

Hi @brian-athinkingape! I was able to reproduce the behavior you're seeing exactly. Thank you so much for providing a solid minimal example, it really helps a lot! The tl;dr is that you've hit a known design issue between Consul and Nomad around gateways, which is described by my colleague @shoenig in #8647 (comment)

There's a workaround roughly described in hashicorp/consul#10308 (comment). I'm going to show that workaround first and then get into the nitty-gritty of why this is happening below.

Workaround

Read the current kind=ingress-gateway config to a file, and then remove the listener:

$ consul config read -kind ingress-gateway -name test-ingress > ./ingress.json

Transform this into:

{
    "Kind": "ingress-gateway",
    "Name": "test-ingress",
    "TLS": {
        "Enabled": false
    }
}

Write the new config and delete the kind=proxy-defaults config:

$ consul config write ./ingress.json
Config entry written: ingress-gateway/test-ingress

$ consul config delete -kind proxy-defaults -name global
Config entry deleted: proxy-defaults/global

Now the second job works:

$ nomad job run ./job2.nomad
==> 2022-10-05T11:13:04-04:00: Monitoring evaluation "7ecf8803"
    2022-10-05T11:13:04-04:00: Evaluation triggered by job "job2"
    2022-10-05T11:13:04-04:00: Allocation "fdef5d8f" created: node "35be55c7", group "group2"
    2022-10-05T11:13:04-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-05T11:13:04-04:00: Evaluation "7ecf8803" finished with status "complete"

Reproduction

Running job2 hits the error you reported:

$ nomad job run ./job2.nomad
Error submitting job: Unexpected response code: 500 (Unexpected response code: 500 (service "test-upstream" has protocol "http", which does not match defined listener protocol "tcp"))

A clue to what's going on is that job2 isn't registered at all, which means that it's happening in the initial job submission and not part of allocation setup after we've scheduled the workload. That narrows down the behavior to this block job_endpoint.go#L249-L272 in the Job.Register RPC, which writes a configuration to Consul. (I'm also seeing that Sentinel policy enforcement is happening after we've done that, which seems backwards, but I'll address that elsewhere.)

I was a little confused by why we'd be doing this in the job register code path at all and not on the client node after an allocation is placed, but then I did some digging and found this comment #8647 (comment) from my colleague @shoenig which discusses the "multi-writer" problem we have. Ultimately Consul owns the configuration entry and it's global, so multiple Nomad clusters could be writing to it.

One way to imagine the problem is to consider what would happen if you ran both job1 and job2 at the same time! We wouldn't have any way of updating Consul correctly in this case.

So ultimately this issue is a duplicate of #8647 and something we need to fix, which I realize isn't very satisfying in the short term.

A challenging part of figuring out what to do as an operator is that the Consul CLI and UI isn't super clear on the data you need. The ingress gateway isn't exposed in the consul catalog CLI at all. So it took me a little while to find hashicorp/consul#10308 and develop the workaround described above.

tgross · 2022-10-05T15:20:42Z

Although this is technically a duplicate there could be unique bits to it. I'm going to keep this open and mark it for roadmapping, and crosslink to it from #8647.

brian-athinkingape · 2022-10-06T18:20:30Z

Thanks, we used the workaround to resolve the issue on our production system for now, looking forward to when this can be resolved!

brian-athinkingape added the type/bug label Oct 4, 2022

tgross self-assigned this Oct 5, 2022

tgross added the theme/consul/connect Consul Connect integration label Oct 5, 2022

tgross mentioned this issue Oct 5, 2022

Nomad can not use consul ingress-gateways because tasks use protocol tcp #8647

Open

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking labels Oct 5, 2022

tgross removed their assignment Oct 5, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot change ingress container from http to tcp (or vice versa) when using Consul Service Mesh #14802

Cannot change ingress container from http to tcp (or vice versa) when using Consul Service Mesh #14802

brian-athinkingape commented Oct 4, 2022

tgross commented Oct 5, 2022

tgross commented Oct 5, 2022

brian-athinkingape commented Oct 6, 2022

Cannot change ingress container from http to tcp (or vice versa) when using Consul Service Mesh #14802

Cannot change ingress container from http to tcp (or vice versa) when using Consul Service Mesh #14802

Comments

brian-athinkingape commented Oct 4, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

tgross commented Oct 5, 2022

Workaround

Reproduction

tgross commented Oct 5, 2022

brian-athinkingape commented Oct 6, 2022