Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exporter/loadbalancingexporter: won't start unless a Port is explicitly given #10321

Closed
jvilhuber opened this issue May 25, 2022 · 6 comments
Closed
Labels
bug Something isn't working exporter/loadbalancing priority:p3 Lowest

Comments

@jvilhuber
Copy link
Contributor

Describe the bug
The documentation indicates that the default 4317 port is used for the DNS Resolver, in the case when no Port is given, however I have not been able to get it to work. Whenever I run without a port I get a warning:

{"level":"warn","ts":1653367580.393693,"caller":"zapgrpc/zapgrpc.go:191","msg":"[core] [Channel #13 SubChannel #14] grpc: addrConn.createTransport failed to connect to {\n  \"Addr\": \"10.1.0.79:4317\",\n  \"ServerName\": \"10.1.0.79:4317\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}. Err: connection error: desc = \"transport: Error while dialing dial tcp 10.1.0.79:4317: operation was canceled\"","grpc_log":true}

and no traces are processed. It seems it SHOULD work, since the default port is added here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L128
(and a non-default port is added here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/resolver_dns.go#L144-L146 )

Steps to reproduce
Configure a dns resolver with only a hostname, then run the exporter. See example configs below.

What did you expect to see?
Normal startup, no warning.

What did you see instead?
See warning message above.

What version did you use?
0.51.0

What config did you use?

"exporters":
  "loadbalancing":
    "protocol":
      "otlp":
        "compression": "snappy"
        "retry_on_failure":
          "enabled": true
          "initial_interval": "5s"
          "max_elapsed_time": "300s"
          "max_interval": "30s"
        "sending_queue":
          "enabled": true
          "num_consumers": 10
          "queue_size": 5000
        "timeout": "2s"
        "tls":
          "insecure": true
    "resolver":
      "dns":
        "hostname": "otel-collector-backend.example.com"
"extensions":
  "zpages":
    "endpoint": ":55690"
"processors":
"receivers":
  "jaeger":
    "protocols":
      "grpc": {}
      "thrift_binary": {}
      "thrift_compact": {}
      "thrift_http": {}
  "opencensus":
    "endpoint": "0.0.0.0:55678"
  "otlp":
    "protocols":
      "grpc": {}
      "http": {}
  "zipkin": {}
"service":
  "extensions":
  - "zpages"
  "pipelines":
    "traces/0":
      "exporters":
      - "loadbalancing"
      "processors":
      "receivers":
      - "jaeger"
  "telemetry":
    "logs":
      "encoding": "json"
      "level": "debug"
    "metrics":
      "address": "0.0.0.0:8890"
      "level": "normal"

Environment
tilt (using docker-for-desktop) on a mac, as well as running from goland development environment locally on the mac.

Additional context
I noticed the default port is defined in the code, and added here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L128

Other places where the endpoint is created (with or without port) https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/resolver_dns.go#L144-L146

@jvilhuber
Copy link
Contributor Author

jvilhuber commented May 25, 2022

I added some temporary logs to these two methods:

		// add the missing exporters first
		lb.addMissingExporters(ctx, resolved)
		lb.removeExtraExporters(ctx, resolved)

(I added the "adding endpoint" and "removing endpoint" logs; see below)

When using:

    "resolver":
      "dns":
        "hostname": "otel-collector-backend.example.com"

I see that the endpoint gets added and then immediately removed:

{"level":"info","ts":1653467919.549248,"caller":"loadbalancingexporter/loadbalancer.go:129","msg":"adding endpoint","kind":"exporter","name":"loadbalancing","endpoint":"127.0.0.1:4317"}
{"level":"info","ts":1653467919.54941,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel created","grpc_log":true}
{"level":"info","ts":1653467919.549463,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] original dial target is: \"127.0.0.1:4317\"","grpc_log":true}
{"level":"info","ts":1653467919.549481,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] dial target \"127.0.0.1:4317\" parse failed: parse \"127.0.0.1:4317\": first path segment in URL cannot contain colon","grpc_log":true}
{"level":"info","ts":1653467919.549488,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] fallback to scheme \"passthrough\"","grpc_log":true}
{"level":"info","ts":1653467919.549509,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] parsed dial target is: {Scheme:passthrough Authority: Endpoint:127.0.0.1:4317 URL:{Scheme:passthrough Opaque: User: Host: Path:/127.0.0.1:4317 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}","grpc_log":true}
{"level":"info","ts":1653467919.5495172,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel authority set to \"127.0.0.1:4317\"","grpc_log":true}
{"level":"info","ts":1653467919.5496562,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Resolver state updated: {\n  \"Addresses\": [\n    {\n      \"Addr\": \"127.0.0.1:4317\",\n      \"ServerName\": \"\",\n      \"Attributes\": null,\n      \"BalancerAttributes\": null,\n      \"Type\": 0,\n      \"Metadata\": null\n    }\n  ],\n  \"ServiceConfig\": null,\n  \"Attributes\": null\n} (resolver returned new addresses)","grpc_log":true}
{"level":"info","ts":1653467919.5497742,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel switches to new LB policy \"pick_first\"","grpc_log":true}
{"level":"info","ts":1653467919.549796,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel created","grpc_log":true}
{"level":"info","ts":1653467919.549841,"caller":"loadbalancingexporter/loadbalancer.go:157","msg":"removing endpoint","kind":"exporter","name":"loadbalancing","endpoint":"127.0.0.1:4317"}
{"level":"info","ts":1653467919.54988,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel Connectivity change to SHUTDOWN","grpc_log":true}
{"level":"info","ts":1653467919.549824,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING","grpc_log":true}
{"level":"info","ts":1653467919.549928,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel picks a new address \"127.0.0.1:4317\" to connect","grpc_log":true}
{"level":"info","ts":1653467919.549952,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN","grpc_log":true}
{"level":"info","ts":1653467919.549964,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel deleted","grpc_log":true}
{"level":"info","ts":1653467919.549969,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel deleted","grpc_log":true}
{"level":"debug","ts":1653467919.549978,"caller":"loadbalancingexporter/resolver_dns.go:85","msg":"DNS resolver started","kind":"exporter","name":"loadbalancing","resolver":"dns","hostname":"otel-collector-backend.example.com","port":"","interval":5,"timeout":1}
{"level":"info","ts":1653467919.549985,"caller":"builder/exporters_builder.go:48","msg":"Exporter started.","kind":"exporter","name":"loadbalancing"}
{"level":"info","ts":1653467919.54999,"caller":"service/service.go:86","msg":"Starting processors..."}
{"level":"info","ts":1653467919.549995,"caller":"builder/pipelines_builder.go:54","msg":"Pipeline is starting...","kind":"pipeline","name":"traces/0"}
{"level":"info","ts":1653467919.5500002,"caller":"builder/pipelines_builder.go:65","msg":"Pipeline is started.","kind":"pipeline","name":"traces/0"}
{"level":"info","ts":1653467919.550003,"caller":"service/service.go:91","msg":"Starting receivers..."}
{"level":"info","ts":1653467919.550007,"caller":"builder/receivers_builder.go:68","msg":"Receiver is starting...","kind":"receiver","name":"jaeger"}
{"level":"warn","ts":1653467919.55024,"caller":"zapgrpc/zapgrpc.go:191","msg":"[core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {\n  \"Addr\": \"127.0.0.1:4317\",\n  \"ServerName\": \"127.0.0.1:4317\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}. Err: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:4317: operation was canceled\"","grpc_log":true}
{"level":"info","ts":1653467919.550455,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Server #3] Server created","grpc_log":true}

Whereas if I start with a port explicitly defined

    "resolver":
      "dns":
        "hostname": "otel-collector-backend.example.com"
        "port": "4317"

I get:

{"level":"info","ts":1653467839.842873,"caller":"loadbalancingexporter/loadbalancer.go:129","msg":"adding endpoint","kind":"exporter","name":"loadbalancing","endpoint":"127.0.0.1:4317"}
{"level":"info","ts":1653467839.843035,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel created","grpc_log":true}
{"level":"info","ts":1653467839.8430789,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] original dial target is: \"127.0.0.1:4317\"","grpc_log":true}
{"level":"info","ts":1653467839.843092,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] dial target \"127.0.0.1:4317\" parse failed: parse \"127.0.0.1:4317\": first path segment in URL cannot contain colon","grpc_log":true}
{"level":"info","ts":1653467839.8430989,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] fallback to scheme \"passthrough\"","grpc_log":true}
{"level":"info","ts":1653467839.8431191,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] parsed dial target is: {Scheme:passthrough Authority: Endpoint:127.0.0.1:4317 URL:{Scheme:passthrough Opaque: User: Host: Path:/127.0.0.1:4317 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}","grpc_log":true}
{"level":"info","ts":1653467839.843128,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel authority set to \"127.0.0.1:4317\"","grpc_log":true}
{"level":"info","ts":1653467839.843216,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Resolver state updated: {\n  \"Addresses\": [\n    {\n      \"Addr\": \"127.0.0.1:4317\",\n      \"ServerName\": \"\",\n      \"Attributes\": null,\n      \"BalancerAttributes\": null,\n      \"Type\": 0,\n      \"Metadata\": null\n    }\n  ],\n  \"ServiceConfig\": null,\n  \"Attributes\": null\n} (resolver returned new addresses)","grpc_log":true}
{"level":"info","ts":1653467839.8433,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel switches to new LB policy \"pick_first\"","grpc_log":true}
{"level":"info","ts":1653467839.8433201,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel created","grpc_log":true}
{"level":"debug","ts":1653467839.8433979,"caller":"loadbalancingexporter/resolver_dns.go:85","msg":"DNS resolver started","kind":"exporter","name":"loadbalancing","resolver":"dns","hostname":"otel-collector-backend.example.com","port":"4317","interval":5,"timeout":1}
{"level":"info","ts":1653467839.843416,"caller":"builder/exporters_builder.go:48","msg":"Exporter started.","kind":"exporter","name":"loadbalancing"}
{"level":"info","ts":1653467839.843356,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING","grpc_log":true}
{"level":"info","ts":1653467839.843422,"caller":"service/service.go:86","msg":"Starting processors..."}
{"level":"info","ts":1653467839.8434272,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1 SubChannel #2] Subchannel picks a new address \"127.0.0.1:4317\" to connect","grpc_log":true}
{"level":"info","ts":1653467839.843445,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] pickfirstBalancer: UpdateSubConnState: 0x1400090f110, {CONNECTING <nil>}","grpc_log":true}
{"level":"info","ts":1653467839.843458,"caller":"zapgrpc/zapgrpc.go:174","msg":"[core] [Channel #1] Channel Connectivity change to CONNECTING","grpc_log":true}
{"level":"info","ts":1653467839.8434281,"caller":"builder/pipelines_builder.go:54","msg":"Pipeline is starting...","kind":"pipeline","name":"traces/0"}
{"level":"info","ts":1653467839.843466,"caller":"builder/pipelines_builder.go:65","msg":"Pipeline is started.","kind":"pipeline","name":"traces/0"}
{"level":"info","ts":1653467839.843471,"caller":"service/service.go:91","msg":"Starting receivers..."}

@jvilhuber
Copy link
Contributor Author

Ah I see the problem: When endpoints are resolved in

func (r *dnsResolver) resolve(ctx context.Context) ([]string, error) {

we create the endpoint-strings WITH the port ONLY if it was explicitly given:

		// if a port is specified in the configuration, add it
		if r.port != "" {
			backend = fmt.Sprintf("%s:%s", backend, r.port)
		}

In my test (which resolved to "127.0.0.1", without the explicit port, the endpoint string is "127.0.0.1".

But in

func (lb *loadBalancerImp) addMissingExporters(ctx context.Context, endpoints []string) {
	for _, endpoint := range endpoints {
		endpoint = endpointWithPort(endpoint)

we create the endpoint string WITH the port (if it's not already set in the previous code-snippet). This leaves us with a disparity when we get to

func (lb *loadBalancerImp) removeExtraExporters(ctx context.Context, endpoints []string) {
	for existing := range lb.exporters {
		if !endpointFound(existing, endpoints) {
			lb.logger.Info("removing endpoint", zap.String("endpoint", existing))
			lb.exporters[existing].Shutdown(ctx)
			delete(lb.exporters, existing)
		}
	}
}

The existing endpoint will be "127.0.0.1:4317" (because we used endpointWithPort() to create it) but the passed in list of endpoints (which comes from the resolve() function will have "127.0.0.1" because we didn't explicitly set a port and so the resolve() didn't set it.

Sorry if that's a bit convoluted.

@jpkrohling
Copy link
Member

Great detective work! Would you be able to come up with a unit test demonstrating the problem, and potentially with the fix as well?

@jvilhuber
Copy link
Contributor Author

I think this bug is what this PR is about: #10124

@jpkrohling
Copy link
Member

Looks like those are different problems.

@jvilhuber
Copy link
Contributor Author

Yea maybe I'm misremembering the various items I have pending (waiting on the CLA to really make progress; sorry about that)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing priority:p3 Lowest
Projects
None yet
Development

No branches or pull requests

4 participants