distirbution of traces/span amongst collector #1678

prana24 · 2019-07-23T16:16:38Z

Requirement - what kind of business use case are you trying to solve?

Are collector load balanced ?

Problem - what in Jaeger blocks you from solving the requirement?

We have our jaegertracing setup working with back end configured as elastic search. Currently we have two collector replica set up . There are 5-10 services which sends traces to the collector ( the number of services , keep changing ) . I see collectors are not evenly loaded with traffic. One collector reaches to the max queue usage where as other collector is hardly using 20-30% capacity . This causes the drop from the collector which is loaded to the capacity .
Can we load balance the traffic (spans) amongst the both collector ? I am not sure if there is any config and i am missing it.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

jpkrohling · 2019-07-23T16:23:26Z

If you are using the Jaeger Agent, you can configure them to use gRPC instead of Thrift (--reporter.type=grpc). You can then either pass a static list of collectors, or use gRPC's notation for discovering the servers (--reporter.grpc.host-port=dns:///service-name:14250).

If your tracers are connecting directly to the collector, only TChannel is supported at the moment, and it's not possible to load balance individual requests.

prana24 · 2019-07-23T16:38:32Z

we are using agent , will check the configuration . As of now we are managing the jaeger set up and there are different team they are just bombarding the traces and spans. Is there anyway we can do at collector level ?

yurishkuro · 2019-07-23T18:00:59Z

@prana24 at Uber we recommend all team to use an internal wrapper for Jaeger client libraries, which makes sure that production services are always using remote sampler that pulls sampling strategies from the backend. This way you can create configuration for the collectors controlling how much each service should sample.

If you have no control over the clients, the brute-force solution is to implement downsampling in the collector (which we do at Uber, but at this point as more of a safety measure). Downsampling is consistently based on trace ID hash, so you don't get partial traces, but downsampling affects all users equally, not just the offending service.

Another approach is throttling clients doing sampling, but it's not currently implemented (#1676).

The best solution imo is tail-based sampling, which Jaeger does not support yet directly, but you can get it with OpenCensus Service.

prana24 · 2019-08-02T13:28:30Z

We were using jaeger-agent 1.8.x , i see grpc was probably not enabled in that version . I am upgrading agent to latest ( 1.13.x ) . My collector is still 1.9.x , is this version ok , or i should upgrade that as well ?

jpkrohling · 2019-08-02T14:03:13Z

If you can, keep both the collector and the agent at the same version.

prana24 · 2019-08-02T17:17:19Z

Thank you @jpkrohling , i have done that , i have a basic question about dns:///<service_name>:14250 , what is <service_name> , here it is the same name which we get by command kubectl get service for collector service ?

jpkrohling · 2019-08-05T08:23:43Z

It's the DNS name under which the service can be reached. In Kubernetes, this is typically service_name.namespace.svc.cluster.local, but depending on the cluster configuration, you might be able to use only the service name as the hostname, if both the client and the the agent/collector are in the same namespace.

If you are using Kubernetes, I recommend taking a look at the jaeger-operator. Even if you decide not to use it for production, you might benefit from seeing how it deploys Jaeger.

prana24 · 2019-08-05T12:29:44Z

sure, thank you . I am taking a lookg

prana24 · 2019-08-06T11:35:17Z

Hi ,
I have made changes to my agent .yaml , somehow it still sends traffic to one of the collector only . it looks like it is not able to dns look up , adding my agent.yaml and agent log here for reference.

019/08/06 11:19:50 maxprocs: Leaving GOMAXPROCS=24: CPU quota undefined
{"level":"info","ts":1565090391.0357707,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1565090391.0362382,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1565090391.0362995,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14271}
{"level":"info","ts":1565090391.0363176,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14271,"health-status":"unavailable"}
{"level":"info","ts":1565090391.0373657,"caller":"grpc/builder.go:75","msg":"Agent requested insecure grpc connection to collector(s)"}
{"level":"info","ts":1565090391.041124,"caller":"grpc/clientconn.go:242","msg":"parsed scheme: \"dns\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.067472,"caller":"agent/main.go:74","msg":"Starting agent"}
{"level":"info","ts":1565090391.0675416,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1565090391.0675752,"caller":"app/agent.go:68","msg":"Starting jaeger-agent HTTP server","http-port":5778}
{"level":"info","ts":1565090391.0754118,"caller":"dns/dns_resolver.go:264","msg":"grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.jaeger-collector-dev.sampling.svc.cluster.local on 192.168.0.3:53: no such host.\n","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.119492,"caller":"dns/dns_resolver.go:289","msg":"grpc: failed dns TXT record lookup due to lookup _grpc_config.jaeger-collector-dev.sampling.svc.cluster.local on 192.168.0.3:53: no such host.\n","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1195319,"caller":"grpc/resolver_conn_wrapper.go:140","msg":"ccResolverWrapper: got new service config: ","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.119683,"caller":"grpc/resolver_conn_wrapper.go:126","msg":"ccResolverWrapper: sending new addresses to cc: [{192.168.172.54:14250 0  <nil>}]","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1197627,"caller":"base/balancer.go:76","msg":"base.baseBalancer: got new resolver state: {[{192.168.172.54:14250 0  <nil>}] }","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1197968,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241517,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241896,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[{192.168.172.54:14250 0  <nil>}:0xc00018d560]","system":"grpc","grpc_log":true}

Also pasted here agent.yaml

# Source: jaeger-client-mon/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: jaeger-app-1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: jaeger-app-1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: jaeger-app-1
    spec:
      containers:
      - image: docker.artifactory.prod.adnxs.net/jaeger_client_1
        name: jaeger-app-1
        ports:
        - containerPort: 8080
      - image: docker.artifactory.prod.mycompany.net/jaegertracing/jaeger-agent:1.13.1-1-b8a6d4ea680063ab03575e864f233841cfcb45cb58a9c5ddde2e287844c1b679
        name: jaeger-agent-1
        #args: ["--collector.host-port=jaeger-collector-dev.sampling.svc:14267"]
        args: ["--reporter.grpc.host-port=dns:///jaeger-collector-dev.sampling.svc.cluster.local:14250"]
        ports:
        - containerPort: 5775
          protocol: UDP
        - containerPort: 6831
          protocol: UDP
        - containerPort: 6832
          protocol: UDP
        - containerPort: 5778
          protocol: TCP

Any idea what is wrong here ?

jpkrohling · 2019-08-06T12:06:51Z

Nothing seems wrong there: gRPC tried to load some extra configuration via DNS but couldn't find anything "extra". As you can see in the following log entries, the connection with the collector was established and is ready:

{"level":"info","ts":1565090391.1197968,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241517,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, READY","system":"grpc","grpc_log":true}

So, looks like it's working ;-)

prana24 · 2019-08-06T12:22:59Z

It is working but , i was expecting agent should send traces/span to both the collector , which currently sending to only one . i mean to say it is not load balanced , Am i missing something here ?

jpkrohling · 2019-08-06T12:25:35Z

You might not see round-robin load balancing, as gRPC will reuse the same pipe for multiple requests, but one easy way to check that it's working as expected is by killing one of the collectors. If the agent switches over to the remaining collector, the load balancing is working.

prana24 · 2019-08-06T13:05:11Z

Oops !! that is failover right , that is not loadbalanced ? I want to avoid sitaution like this , i have added grafana images here , where collector1 reaches the max capacity and collector2 is sitting idle , because of this we see span drops. ( of course the implementation contains tchannel communication between agent and collector ) so as advised in this issue above i am adding grpc but somehow still i do not see spans are being load balanced between both the collector .

Let me know if i am doing anything wrong here ?

jpkrohling · 2019-08-06T13:36:28Z

I just checked the gRPC docs, and it seems that it should indeed be doing round-robin balancing:

It is worth noting that load-balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we still want them to be load-balanced across all servers.

Source: https://github.com/grpc/grpc/blob/master/doc/load-balancing.md

of course the implementation contains tchannel communication between agent and collector

What do you mean here? The communication between Agent and Collector should be via gRPC, not via TChannel.

jpkrohling · 2019-08-06T13:51:40Z

@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here?

prana24 · 2019-08-06T13:56:11Z

Just to clear confusion , the grafana image which i have posted is the production problem which i want to solve( agent , collector running on 1.8 .x with tchannel) .
Since it was recommended that if we use grpc with latest version ( 1.13.1 ) we can see traces/span loadbalanced . I am trying to the same in our dev environment to see if the traffic is really load balanced. But somehow all the spans are being moved to one collector.
I am just concerned about how can i get my traffic loadbalanced , hence all the collector does the work and there is minimum drops

jpkrohling · 2019-08-06T14:38:15Z

I'm confused now: you are seeing load balanced traffic in production, but not on your dev environment?

prana24 · 2019-08-06T14:47:38Z

My production version is 1.8 communication with tchannel , and the grafana images are from production env. it shows that load is unbalanced and also drops.

I want to check if we move to 1.13 .x with grpc we can solve the problem in production , and that is why i am trying 1.13.1 +grpc in dev ( agent.yaml and log which i shared ) ,
But in Dev env. also i do not see load balanced in traffic ,

jkandasa · 2019-08-06T15:48:34Z

@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here?

@jpkrohling In openshift, we create an additional service(jaegerqe-collector-headless) on the collector for gRPC load-balancing with ClusterIP: None.

jpkrohling · 2019-08-06T15:51:08Z

That's probably the trick that @prana24 is missing! Thanks @jkandasa!

prana24 · 2019-08-06T16:10:32Z

Vow ..!! i cant wait , @jkandasa can you give me more information ? where do i get it ? any reference and topology ?

jkandasa · 2019-08-06T16:43:30Z

@prana24 AFAIK, there is no specific example to create collector headless service. @objectiser can guide here better.
jaeger-operator creates a headless service by default. Reference in jaeger-operator code

I just copied/modified collector service YAML from generated(by jaeger-operator) service file.
I hope this will work(not tested).
Important line spec.clusterIP: None.
You may add to your existing service and test. If you create a new service named jaeger-collector-headless, do not forget to change it on your agent.

- apiVersion: v1
  kind: Service
  metadata:
    name: jaeger-collector-headless
    labels:
      app: jaeger
      jaeger-infra: collector-service
spec:
  clusterIP: None
  ports:
    - name: jaeger-collector-grpc
      port: 14250
      protocol: TCP
      targetPort: 14250
  selector:
      jaeger-infra: collector-pod
  type: ClusterIP

prana24 · 2019-08-07T12:28:49Z

Thanks @jkandasa , i will give it a shot today

piwenzi · 2019-11-20T10:25:33Z

i have the same problem!!!

error is
root@ubuntu-165:~/jaeger# kubectl logs productpage-v1-787bcf4b68-j88qj jaeger-agent | grep addrConn.createTransport
{"level":"info","ts":1574242860.4926956,"caller":"grpc/clientconn.go:1191","msg":"grpc: addrConn.createTransport failed to connect to {10.33.36.204:14250 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.33.36.204:14250: connect: connection refused". Reconnecting...","system":"grpc","grpc_log":true}
{"level":"info","ts":1574242861.493872,"caller":"grpc/clientconn.go:1191","msg":"grpc: addrConn.createTransport failed to connect to {10.33.36.204:14250 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.33.36.204:14250: connect: connection refused". Reconnecting...","system":"grpc","grpc_log":true}

but network was right
root@ubuntu-165:~/jaeger# kubectl exec box2 -- nslookup my-jaeger-collector-headless.kube-system
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: my-jaeger-collector-headless.kube-system
Address 1: 10.33.36.204 10-33-36-204.my-jaeger-collector.kube-system.svc.cluster.local
root@ubuntu-165:~/jaeger# kubectl exec box2 -- nslookup my-jaeger-collector.kube-system.svc.cluster.local

jpkrohling · 2019-11-20T10:28:28Z

@pujunYang could you please share what's your my-jaeger-collector-headless definition? kubectl get service my-jaeger-collector-headless -o yaml should do the trick. How are you setting it up? Is it via the Operator?

piwenzi · 2019-11-20T10:35:47Z

@jpkrohling yes
use Operator start jaeger,

kubectl get service my-jaeger-collector-headless -n kube-system  -o yaml 
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "false"
  creationTimestamp: "2019-11-20T09:40:13Z"
  labels:
    app: jaeger
    app.kubernetes.io/component: service-collector
    app.kubernetes.io/instance: my-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: my-jaeger-collector
    app.kubernetes.io/part-of: jaeger
  name: my-jaeger-collector-headless
  namespace: kube-system
  ownerReferences:
  - apiVersion: jaegertracing.io/v1
    controller: true
    kind: Jaeger
    name: my-jaeger
    uid: c00e3485-0b79-11ea-ab62-5254006535e0
  resourceVersion: "1933"
  selfLink: /api/v1/namespaces/kube-system/services/my-jaeger-collector-headless
  uid: c0869851-0b79-11ea-ab62-5254006535e0
spec:
  clusterIP: None
  ports:
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: grpc
    port: 14250
    protocol: TCP
    targetPort: 14250
  - name: c-tchan-trft
    port: 14267
    protocol: TCP
    targetPort: 14267
  - name: c-binary-trft
    port: 14268
    protocol: TCP
    targetPort: 14268
  selector:
    app: jaeger
    app.kubernetes.io/component: collector
    app.kubernetes.io/instance: my-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: my-jaeger-collector
    app.kubernetes.io/part-of: jaeger
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

piwenzi · 2019-11-20T10:41:34Z

@jpkrohling it is was Jaeger.yaml

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: my-jaeger
  namespace: kube-system
spec:
  strategy: production # <1>
  allInOne:
    image: jaegertracing/all-in-one:latest # <2>
    options: # <3>
      log-level: debug # <4>
  storage:
    type: elasticsearch # <5>
    options: # <6>
      es: # <7>
        server-urls: http://elasticsearch-logging:9200
        tls:
          skip-host-verify: true
  ingress:
    enabled: false # <8>
  agent:
    strategy: DaemonSet # <9>
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: "" # <10>

parberge · 2020-10-02T12:41:26Z

I have another concerned about load balancing;

We use "--reporter.grpc.host-port=dns:///jaeger-collector-gRPC.service.consul:14250" to get the list of collectors, which is working fine. All collectors receive spans.

The problem;
If we scale out the collectors the agent will never get a new list.
This also means if one or more collectors is removed/offline the list of collectors on the agents will remain the same.

It seems it only resolve the list when the agent starts?
Or am I missing something?

jpkrohling · 2020-10-02T12:49:32Z

Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter.

jpkrohling · 2020-10-02T12:50:22Z

I'm closing, as I think this has been answered some time ago, but feel free to reopen if there are still questions.

parberge · 2020-10-02T12:56:27Z

Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter.

Not 1.20 that's for sure.

Will test and create an issue if the problem remains. Thanks.

Kerptastic · 2020-10-15T00:27:03Z

Hey guys - I am seeing some of the same issues. On v.1.19 right now, so will try to do the upgrade. But much like some of the folks are seeing. I am using the jaeger-operator, have the HPA setup for min/max of 2/10, and when CPU gets hammered during our bot/soak tests the collectors scale up as expected, but the agent connections continue to fire spans down their already existing connections. So effectively, feels like more of a fault tolerance setup than a high availability one. @parberge before I go super deep, did you see any positive changes with v1.20.0?

parberge · 2020-10-15T05:42:33Z

Haven't upgraded yet 😑 Den tors 15 okt. 2020 02:27Josh Kierpiec <[email protected]> skrev:

…

Hey guys - I am seeing some of the same issues. On v.1.19 right now, so will try to do the upgrade. But much like some of the folks are seeing. I am using the jaeger-operator, have the HPA setup for min/max of 2/10, and when CPU gets hammered during our bot/soak tests the collectors scale up as expected, but the agent connections continue to fire spans down their already existing connections. So effectively, feels like more of a fault tolerance setup than a high availability one. @parberge <https://github.com/parberge> before I go super deep, did you see any positive changes with v1.20.0? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1678 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJQPWI4QULKBMK4ZUWLRWDSKY6WJANCNFSM4IGGIAVA> .

jpkrohling · 2020-10-15T07:21:49Z

Feel free to reopen this issue you see the same problem happening on v1.20.

Kerptastic · 2020-10-15T18:53:34Z

Hi folks, unfortunately I see the same behavior as before. Running 1.20.0 container and agent (see below) via jaeger-operator.

Execute our bots firing off spans/traces and can observe the following in our graphs. You can see below we scaled to 4 collector instances, but the agents have no knowledge that they should reconnect and continue to saturate the collectors they are already connected to. The situation makes sense - I'm not missing a configuration in anyway for the collectors to notify the agent they should reconnect when dropping spans?

Containers:
  jaeger-collector:
    Container ID:  docker://98c391498dc8cdb605f5d001350c2451d16800470a7e03c096cab7b808ff7b95
    Image:         jaegertracing/jaeger-collector:1.20.0

...

Containers:
  jaeger-agent-daemonset:
    Container ID:  docker://fa6d25021bc5531053b998789d34fcdb0520d2a8f7907125951c44feeb10ffa4
    Image:         jaegertracing/jaeger-agent:1.20.0

@jpkrohling will try to reopen, need to figure out how =)

jpkrohling · 2020-10-15T19:24:43Z

I'll check what we can do, but I think the gRPC client might need some time to update the list of backends. In earlier versions, it would update only if all known backends were failing.

Kerptastic · 2020-10-15T19:29:45Z

OK - I am letting this soak. This may be something unique to how our bots are running also, as they are being spun up asynchronously in a single service, so would make sense that it would send traffic a single agent and thus overload the collector its connected to. Is the agent designed to have a single connection to a collector at a given point in time? If thats the case, this MAY be OK for us in production when we have bots replaced with real traffic and getting load balanced across our edge service, thus distributing across the agents more naturally.

jpkrohling · 2020-10-15T19:45:06Z

Is the agent designed to have a single connection to a collector at a given point in time?

I'd have to double-check with the gRPC client load balancer documentation, but I think that's indeed the case. The agent has a list of backends, but will only failover once its "current" backend fails.

jpkrohling · 2020-10-19T09:43:21Z

@jkandasa do you remember from your load-tests what's the expected behavior here?

JMCFTW · 2021-10-27T02:54:33Z

I encountered the same issue when using Opentelemetry collector with headless Jaeger gRPC collector in Kubernetes.
Opentelemetry collector will never get new list of Jaeger collector after Jaeger collector is scaling out/down.

Here is my configurations:

Opentelemetry collector:

exporters:
  jaeger:
    endpoint: "dns:///jaeger-collector-svc.monitoring.svc.cluster.local:14250"
    balancer_name: "round_robin"
    insecure: true
  logging:
    loglevel: info
extensions:
  health_check: {}
processors:
  batch: {}
receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}
  jaeger:
    protocols:
      thrift_compact: {}
      thrift_http: {}
extensions:
  health_check: {}
service:
  extensions:
    - health_check
  pipelines:
    traces:
      exporters:
        - logging
        - jaeger
      processors:
        - batch
      receivers:
        - otlp
        - jaeger

Jaeger collector headless service in Kubernetes:

apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: jaeger-collector
  name: jaeger-collector-svc
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: admin
    port: 14269
    protocol: TCP
    targetPort: 14269
  - name: receive-span-from-jaeger-agent
    port: 14250
    protocol: TCP
    targetPort: 14250
  - name: receive-span-from-jaeger-client
    port: 14268
    protocol: TCP
    targetPort: 14268
  selector:
    app: jaeger-collector
  sessionAffinity: None
  type: ClusterIP

jpkrohling · 2021-10-27T07:15:20Z

@JMCFTW, this is something to be checked and handled at the OpenTelemetry Collector side of things. I just created an issue there (open-telemetry/opentelemetry-collector#4274) and assigned it to myself.

JMCFTW · 2021-10-27T07:58:42Z

@JMCFTW, this is something to be checked and handled at the OpenTelemetry Collector side of things. I just created an issue there (open-telemetry/opentelemetry-collector#4274) and assigned it to myself.

Hi @jpkrohling,

Thanks for referencing this issue in Opentelemetry collector.

I'm not sure this issue can be handled in Opentelemetry collector or not, because it seems like gRPC client parameters doesn't have an option can let client(Opentelemetry collector) to do DNS name re-resolution after server(Jaeger collector) is auto scaling out/down.

So in my opinion, a possible workarounds is to let MaxConnectionAge of gRPC server parameter configurable in Jaeger collector? but I don't know it's good or bad idea.

Since I'm not investigate this issue very long time so please feel free to correct me if I'm wrong or have misunderstood something.

jpkrohling · 2021-10-27T08:00:14Z

That's a good hint, thanks! I think I faced a similar issue before, and if a fix is needed here on the Jaeger side of things, I'll fix it here.

jpkrohling · 2021-11-11T15:40:09Z

Just as a status update, I'm able to reproduce this. Reading some source code from gRPC Go, I was expecting the DNS resolution to happen every 30s, adding the new backends to the list and making them available as subchannels, but looks like it's not happening. I'll check a couple of things, and if they don't work, I'll give the MaxConnectionAge suggestion a try.

The following screenshot shows a situation that started with 10 replicas and later scaled to 20 replicas, expecting the new ones to eventually start receiving traffic.

The disparity of two of the numbers is because I had a wrong configuration. The remaining 8 similar numbers are after adjusting the config to take advantage of both. This is the config used:

receivers:
  otlp:
    protocols: 
      grpc:

exporters:
  jaeger:
    tls:
      insecure: true
    endpoint: dns:///simple-prod-collector-headless:14250
    balancer_name: round_robin

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger]

jpkrohling · 2021-12-02T13:08:35Z

I got some time to do some extra experiments, and I agree that setting the MaxConnectionAge would be a good solution. The OpenTelemetry Collector was able to send all the spans to Jaeger Collector, despite the scaling events:

otelcol_receiver_accepted_spans{receiver="otlp",service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest",transport="grpc"} 2.009878e+06

otelcol_exporter_send_failed_requests{service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest"} 4293
otelcol_exporter_send_failed_spans{exporter="jaeger",service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest"} 0
otelcol_exporter_sent_spans{exporter="jaeger",service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest"} 2.009878e+06

With the MaxConnectionAge, here's how the rate of spans per instance looks like:

In the image above, you can see that we had a few nodes at first, ingesting around 20 spans per minute. Then, new nodes appeared, so that each node now takes care of 10 spans per minute.

I'll create a PR adding both MaxConnectionAge and MaxConnectionAgeGrace as CLI options.

JMCFTW · 2021-12-03T02:23:19Z

Hi @jpkrohling , thanks for adding flags!

So according to RELEASE.md, this change will be released at 5 January 2022 right?

jpkrohling · 2021-12-03T08:09:00Z

Correct. If you want to test this change before that, I can tag and generate a container image based on the current main.

jpkrohling added the question label Jul 23, 2019

jpkrohling closed this as completed Oct 2, 2020

jpkrohling reopened this Oct 15, 2020

jpkrohling mentioned this issue Oct 27, 2021

gRPC client might not be properly balancing requests open-telemetry/opentelemetry-collector#4274

Closed

jpkrohling self-assigned this Dec 2, 2021

jpkrohling mentioned this issue Dec 2, 2021

Add MaxConnectionAge[Grace] to collector's gRPC server #3422

Merged

jpkrohling closed this as completed in #3422 Dec 2, 2021

distirbution of traces/span amongst collector #1678

distirbution of traces/span amongst collector #1678

Comments

prana24 commented Jul 23, 2019

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

jpkrohling commented Jul 23, 2019

prana24 commented Jul 23, 2019

yurishkuro commented Jul 23, 2019

prana24 commented Aug 2, 2019

jpkrohling commented Aug 2, 2019

prana24 commented Aug 2, 2019

jpkrohling commented Aug 5, 2019

prana24 commented Aug 5, 2019

prana24 commented Aug 6, 2019 • edited by jpkrohling Loading

jpkrohling commented Aug 6, 2019

prana24 commented Aug 6, 2019

jpkrohling commented Aug 6, 2019

prana24 commented Aug 6, 2019

jpkrohling commented Aug 6, 2019

jpkrohling commented Aug 6, 2019

prana24 commented Aug 6, 2019

jpkrohling commented Aug 6, 2019

prana24 commented Aug 6, 2019

jkandasa commented Aug 6, 2019 • edited Loading

jpkrohling commented Aug 6, 2019

prana24 commented Aug 6, 2019

jkandasa commented Aug 6, 2019

prana24 commented Aug 7, 2019

piwenzi commented Nov 20, 2019 • edited Loading

jpkrohling commented Nov 20, 2019

piwenzi commented Nov 20, 2019 • edited by jpkrohling Loading

piwenzi commented Nov 20, 2019 • edited by jpkrohling Loading

parberge commented Oct 2, 2020

jpkrohling commented Oct 2, 2020

jpkrohling commented Oct 2, 2020

parberge commented Oct 2, 2020

Kerptastic commented Oct 15, 2020

parberge commented Oct 15, 2020 via email

jpkrohling commented Oct 15, 2020

Kerptastic commented Oct 15, 2020

jpkrohling commented Oct 15, 2020

Kerptastic commented Oct 15, 2020 • edited Loading

jpkrohling commented Oct 15, 2020

jpkrohling commented Oct 19, 2020

JMCFTW commented Oct 27, 2021 • edited Loading

jpkrohling commented Oct 27, 2021

JMCFTW commented Oct 27, 2021 • edited Loading

jpkrohling commented Oct 27, 2021

jpkrohling commented Nov 11, 2021

jpkrohling commented Dec 2, 2021

JMCFTW commented Dec 3, 2021 • edited Loading

jpkrohling commented Dec 3, 2021

prana24 commented Aug 6, 2019 •

edited by jpkrohling

Loading

jkandasa commented Aug 6, 2019 •

edited

Loading

piwenzi commented Nov 20, 2019 •

edited

Loading

piwenzi commented Nov 20, 2019 •

edited by jpkrohling

Loading

piwenzi commented Nov 20, 2019 •

edited by jpkrohling

Loading

Kerptastic commented Oct 15, 2020 •

edited

Loading

JMCFTW commented Oct 27, 2021 •

edited

Loading

JMCFTW commented Oct 27, 2021 •

edited

Loading

JMCFTW commented Dec 3, 2021 •

edited

Loading