Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot change ingress container from http to tcp (or vice versa) when using Consul Service Mesh #14802

Open
brian-athinkingape opened this issue Oct 4, 2022 · 3 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration theme/networking type/bug

Comments

@brian-athinkingape
Copy link
Contributor

Nomad version

Nomad v1.3.5 (1359c25)

Operating system and Environment details

Ubuntu 22.04 on AWS (on a fresh EC2 instance), amd64

Consul v1.13.2
Revision 0e046bbb
Build Date 2022-09-20T20:30:07Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Docker version 20.10.18, build b40c2f6

Issue

If I run an ingress container with the http protocol, I'm unable to edit it to use tcp even after I stop the job. Even if I run nomad system gc and nomad system reconcile summaries, it still doesn't work. I'm also unable to edit the consul config to use

If I swap all instances of http and tcp I get the same errors.

Reproduction steps

  1. Start nomad/consul in dev mode:
consul agent -dev
sudo nomad agent -dev-connect
  1. Set up consul to use http as default protocol (using proxy-defaults.hcl file below)
consul config write proxy-defaults.hcl
  1. Run the first job file
nomad job run job1.nomad
  1. After job has started, stop the job
nomad job stop job1
  1. When job stops successfully, run the second job file
nomad job run job2.nomad

Expected Result

I should be able to run job2 as normal.

Actual Result

$ nomad job run job2.nomad 
Error submitting job: Unexpected response code: 500 (Unexpected response code: 500 (service "test-upstream" has protocol "http", which does not match defined listener protocol "tcp"))
$ consul config write service-defaults.hcl 
Error writing config entry service-defaults/test-upstream: Unexpected response code: 500 (service "test-upstream" has protocol "tcp", which does not match defined listener protocol "http")

Job file (if appropriate)

proxy-defaults.hcl

Kind      = "proxy-defaults"
Name      = "global"
Config {
  protocol = "http"
}

service-defaults.hcl

Kind     = "service-defaults"
Name     = "test-upstream"
Protocol = "tcp"

job1.nomad:

job "job1" {
    region = "global"
    datacenters = ["dc1"]
    type = "system"

    group "group1" {
        network {
            mode = "bridge"
            port "default" {
                static = 12345
                to = 12345
            }
        }
        service {
            name = "test-ingress"
            port = "12345"
            connect {
                gateway {
                    proxy {
                        connect_timeout = "5s"
                    }
                    ingress {
                        listener {
                            port = 12345
                            protocol = "http"
                            service {
                                name = "test-upstream"
                                hosts = ["*"]
                            }
                        }
                    }
                }
            }
        }
    }
}

job2.nomad:

job "job2" {
    region = "global"
    datacenters = ["dc1"]
    type = "system"

    group "group2" {
        network {
            mode = "bridge"
            port "default" {
                static = 12345
                to = 12345
            }
        }
        service {
            name = "test-ingress"
            port = "12345"
            connect {
                gateway {
                    proxy {
                        connect_timeout = "5s"
                    }
                    ingress {
                        listener {
                            port = 12345
                            protocol = "tcp"
                            service {
                                name = "test-upstream"
                            }
                        }
                    }
                }
            }
        }
    }
}
@tgross tgross self-assigned this Oct 5, 2022
@tgross tgross added the theme/consul/connect Consul Connect integration label Oct 5, 2022
@tgross
Copy link
Member

tgross commented Oct 5, 2022

Hi @brian-athinkingape! I was able to reproduce the behavior you're seeing exactly. Thank you so much for providing a solid minimal example, it really helps a lot! The tl;dr is that you've hit a known design issue between Consul and Nomad around gateways, which is described by my colleague @shoenig in #8647 (comment)

There's a workaround roughly described in hashicorp/consul#10308 (comment). I'm going to show that workaround first and then get into the nitty-gritty of why this is happening below.

Workaround

Read the current kind=ingress-gateway config to a file, and then remove the listener:

$ consul config read -kind ingress-gateway -name test-ingress > ./ingress.json

Transform this into:

{
    "Kind": "ingress-gateway",
    "Name": "test-ingress",
    "TLS": {
        "Enabled": false
    }
}

Write the new config and delete the kind=proxy-defaults config:

$ consul config write ./ingress.json
Config entry written: ingress-gateway/test-ingress

$ consul config delete -kind proxy-defaults -name global
Config entry deleted: proxy-defaults/global

Now the second job works:

$ nomad job run ./job2.nomad
==> 2022-10-05T11:13:04-04:00: Monitoring evaluation "7ecf8803"
    2022-10-05T11:13:04-04:00: Evaluation triggered by job "job2"
    2022-10-05T11:13:04-04:00: Allocation "fdef5d8f" created: node "35be55c7", group "group2"
    2022-10-05T11:13:04-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-05T11:13:04-04:00: Evaluation "7ecf8803" finished with status "complete"

Reproduction

Running job2 hits the error you reported:

$ nomad job run ./job2.nomad
Error submitting job: Unexpected response code: 500 (Unexpected response code: 500 (service "test-upstream" has protocol "http", which does not match defined listener protocol "tcp"))

A clue to what's going on is that job2 isn't registered at all, which means that it's happening in the initial job submission and not part of allocation setup after we've scheduled the workload. That narrows down the behavior to this block job_endpoint.go#L249-L272 in the Job.Register RPC, which writes a configuration to Consul. (I'm also seeing that Sentinel policy enforcement is happening after we've done that, which seems backwards, but I'll address that elsewhere.)

I was a little confused by why we'd be doing this in the job register code path at all and not on the client node after an allocation is placed, but then I did some digging and found this comment #8647 (comment) from my colleague @shoenig which discusses the "multi-writer" problem we have. Ultimately Consul owns the configuration entry and it's global, so multiple Nomad clusters could be writing to it.

One way to imagine the problem is to consider what would happen if you ran both job1 and job2 at the same time! We wouldn't have any way of updating Consul correctly in this case.

So ultimately this issue is a duplicate of #8647 and something we need to fix, which I realize isn't very satisfying in the short term.

A challenging part of figuring out what to do as an operator is that the Consul CLI and UI isn't super clear on the data you need. The ingress gateway isn't exposed in the consul catalog CLI at all. So it took me a little while to find hashicorp/consul#10308 and develop the workaround described above.

@tgross
Copy link
Member

tgross commented Oct 5, 2022

Although this is technically a duplicate there could be unique bits to it. I'm going to keep this open and mark it for roadmapping, and crosslink to it from #8647.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking labels Oct 5, 2022
@tgross tgross removed their assignment Oct 5, 2022
@brian-athinkingape
Copy link
Contributor Author

Thanks, we used the workaround to resolve the issue on our production system for now, looking forward to when this can be resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration theme/networking type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants