Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting envoy admin IP to anything besides 127.0.0.1 breaks prometheus metrics #10747

Closed
alexdulin opened this issue Aug 1, 2021 · 3 comments · Fixed by #10757
Closed

Setting envoy admin IP to anything besides 127.0.0.1 breaks prometheus metrics #10747

alexdulin opened this issue Aug 1, 2021 · 3 comments · Fixed by #10757
Assignees
Labels
good first issue A well-defined bug or improvement with sufficient context which should be approachable for new contr theme/consul-nomad Consul & Nomad shared usability type/bug Feature does not function as expected

Comments

@alexdulin
Copy link

alexdulin commented Aug 1, 2021

Overview of the Issue

Running an envoy proxy with consul connect envoy and specifying the -admin-bind to an IP that is not 127.0.0.1 breaks prometheus metrics because the self_admin cluster does not receive the correct IP for the admin listener - it will always be 127.0.0.1, regardless of what the consul connect envoy command specified. This makes it impossible to bind the admin listener to an IP other than 127.0.0.1 and be able to correctly scrape prometheus metrics.

My guess is because the IP is hard-coded into the bootstrap command and cannot be changed, regardless of what the admin bind flag was set to.

This problem was discovered due to a recent "bug fix" in Nomad that results in the admin listener for envoy sidecars to bind to 127.0.0.2 instead of 127.0.0.1: hashicorp/nomad#10883. The issue makes it impossible to use Nomad 1.1.3 and collect prometheus metrics from envoy.

Reproduction Steps

  1. Start a local consul agent
consul agent -dev
  1. In a second terminal, run the following:
/bin/cat <<"EOM" | consul config write -
Kind = "proxy-defaults"
Name = "global"
Config {
  protocol = "http"
  envoy_prometheus_bind_addr = "0.0.0.0:9114"
}
EOM

consul connect envoy \
  -admin-bind=127.0.0.2:19002 \
  -address=127.0.0.1:19001 \
  -gateway=mesh \
  -register
  1. In a third terminal, get the listeners on the envoy proxy with: curl -s 127.0.0.2:19002/listeners. This should show it registered a prometheus listener with output like the following
envoy_prometheus_metrics_listener::0.0.0.0:9114
default:127.0.0.1:19001::127.0.0.1:19001
  1. However, the upstream cluster for self_admin will have the wrong IP of 127.0.0.1, not 127.0.0.2. Running curl -s 127.0.0.2:19002/clusters | grep self_admin | sort confirms this with output like the following:
self_admin::127.0.0.1:19002::canary::false
self_admin::127.0.0.1:19002::cx_active::0
self_admin::127.0.0.1:19002::cx_connect_fail::0
self_admin::127.0.0.1:19002::cx_total::0
self_admin::127.0.0.1:19002::health_flags::healthy
self_admin::127.0.0.1:19002::hostname::
self_admin::127.0.0.1:19002::local_origin_success_rate::-1.0
self_admin::127.0.0.1:19002::priority::0
self_admin::127.0.0.1:19002::region::
self_admin::127.0.0.1:19002::rq_active::0
self_admin::127.0.0.1:19002::rq_error::0
self_admin::127.0.0.1:19002::rq_success::0
self_admin::127.0.0.1:19002::rq_timeout::0
self_admin::127.0.0.1:19002::rq_total::0
self_admin::127.0.0.1:19002::sub_zone::
self_admin::127.0.0.1:19002::success_rate::-1.0
self_admin::127.0.0.1:19002::weight::1
self_admin::127.0.0.1:19002::zone::
  1. And consequently, curling the prometheus listener with curl -s localhost:9114/metrics results in a 503:
upstream connect error or disconnect/reset before headers. reset reason: connection failure

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = 
	revision = db839f18
	version = 1.10.1
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 127.0.0.1:8300
	server = true
raft:
	applied_index = 77
	commit_index = 77
	fsm_pending = 0
	last_contact = 0
	last_log_index = 77
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:1c8a1e81-16d4-86a6-bd21-2af1a0a4de76 Address:127.0.0.1:8300}]
	latest_configuration_index = 0
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 2
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 131
	max_procs = 8
	os = linux
	version = go1.16.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 1
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

envoy --version

envoy  version: 98c1c9e9a40804b93b074badad1cdf284b47d58b/1.18.3/clean-getenvoy-b76c773-envoy/RELEASE/BoringSSL
@jkirschner-hashicorp
Copy link
Contributor

@alexdulin: thank you for raising this issue with such detail, including diagnosing the likely root cause! Your efforts will make this issue much easier to address.

Assuming that hard-coded IP address you linked is the root cause, the bind address should be available in that context as args.AdminBindAddress.

Message to all:
We believe this is a "good first issue" for either a community member or a member of the HashiCorp maintainer team. If you are a community member and want to contribute the fix, let us know!

@jkirschner-hashicorp jkirschner-hashicorp added theme/consul-nomad Consul & Nomad shared usability type/bug Feature does not function as expected good first issue A well-defined bug or improvement with sufficient context which should be approachable for new contr labels Aug 2, 2021
blake added a commit that referenced this issue Aug 2, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
@blake blake self-assigned this Aug 2, 2021
blake added a commit that referenced this issue Aug 4, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
blake added a commit that referenced this issue Aug 10, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
hc-github-team-consul-core pushed a commit that referenced this issue Aug 10, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
blake added a commit that referenced this issue Aug 10, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
blake added a commit that referenced this issue Aug 10, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
blake added a commit that referenced this issue Aug 10, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
@blake
Copy link
Member

blake commented Aug 10, 2021

Hi @alexdulin, thanks for reporting this issue. A fix has been merged and will be available in the next Consul 1.8, 1.9, and 1.10 patch releases toward the end of the month.

@alexdulin
Copy link
Author

Thanks for getting a fix in and following up on it. Much appreciated! Look forward to a new release.

blake added a commit that referenced this issue Aug 11, 2021
Configure the self_admin cluster to use the admin bind address
provided when starting Envoy.

Fixes #10747
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue A well-defined bug or improvement with sufficient context which should be approachable for new contr theme/consul-nomad Consul & Nomad shared usability type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants