Skip to content
This repository has been archived by the owner on May 18, 2020. It is now read-only.

Closes #74 - Changed Kubernetes Template to use DaemonSet #75

Merged

Conversation

jpkrohling
Copy link
Collaborator

Signed-off-by: Juraci Paixão Kröhling [email protected]

@jpkrohling
Copy link
Collaborator Author

The tests are failing locally, due to this error:

[ERROR] io.jaegertracing.kubernetes.CassandraETest  Time elapsed: 6.276 s  <<< ERROR!
java.lang.RuntimeException: io.fabric8.kubernetes.clnt.v3_1.KubernetesClientException: Failure executing: POST at: https://192.168.39.71:8443/apis/extensions/v1/namespaces/itest-cc9d5c47/daemonsets. Message: the server could not find the requested resource. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=the server could not find the requested resource, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).
Caused by: io.fabric8.kubernetes.clnt.v3_1.KubernetesClientException: Failure executing: POST at: https://192.168.39.71:8443/apis/extensions/v1/namespaces/itest-cc9d5c47/daemonsets. Message: the server could not find the requested resource. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=the server could not find the requested resource, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).

I assume this is some incompatibility between the plugin and the Kubernetes version. I did some manual tests with both Cassandra and Elasticsearch, and it works:

image

The application I used for this sample was an adaptation of OpenShift's hello-openshift and can be found here:

https://github.com/jpkrohling/origin/tree/JPK-AddedJaegerTracingToHelloWorld/examples/hello-openshift

To manually test using this application, follow the installation instructions on the readme, replacing the remote URLs for local paths (production/configmap.yml instead of kubectl create -f https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/master/production/configmap.yml)

Once Jaeger is installed, add an application, such as:

kubectl create -f https://raw.githubusercontent.com/jpkrohling/origin/JPK-AddedJaegerTracingToHelloWorld/examples/hello-openshift/hello-openshift.yaml

A pod like hello-openshift-deployment-6bb7f5c687-d54lx should be created. The logs should be like this:

2018/03/19 17:37:35 Initializing logging reporter
serving on 8080
2018/03/19 17:37:35 Jaeger tracer initialized
serving on 8888
Servicing request.
2018/03/19 17:37:37 Reporting span 38ab873eb8f039a8:38ab873eb8f039a8:0:1
Servicing request.
2018/03/19 17:37:47 Reporting span 6ff4ff825093f3db:6ff4ff825093f3db:0:1
Servicing request.

At this point, you should see traces on Jaeger.

@yurishkuro
Copy link
Member

what is "SWS-326"?

@jpkrohling jpkrohling force-pushed the SWS-326-SwitchToDaemonSets branch from abccf95 to ccedc6d Compare March 20, 2018 07:43
@jpkrohling jpkrohling changed the title SWS-326 - Changed Kubernetes Template to use DaemonSet Closes #74 - Changed Kubernetes Template to use DaemonSet Mar 20, 2018
@jpkrohling
Copy link
Collaborator Author

jpkrohling commented Mar 20, 2018

Sorry, my bad. SWS-236 is the JIRA tracking my activity. I changed the commit and the PR title to refer to the issue on this repo here.

Copy link

@pieterlange pieterlange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a reference to the host IP (localhost is local to the pod scope)

env:
- name: JAEGER_AGENT_HOST
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP

@@ -64,18 +65,24 @@ Once everything is ready, `kubectl get service jaeger-query` tells you where to

### Deploying the agent as sidecar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer deploying as sidecar by default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not by default, but providing instructions on how to deploy the agent as a sidecar is still useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right.

There should be some note for correct agent discovery from the apps though, as localhost (the node) is not available on the pod scope.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand what you mean, but when using a sidecar, localhost is indeed correct.

labels:
app: jaeger
jaeger-infra: agent-instance
spec:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a hostnetwork: true in here somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that? Do we want the agent to receive spans from outside of the Kubernetes cluster?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to make sure the agent listens on the interface IP, not localhost (again, because you can't route to localhost from the pod)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something here. I'm under the assumption that the target applications are not going to send spans to the agent using localhost as its address (as opposed to how it happens with a sidecar or bare metal deployment). Rather, they would send spans to a known address, like this:

https://github.com/jpkrohling/origin/blob/JPK-AddedJaegerTracingToHelloWorld/examples/hello-openshift/hello_openshift.go#L43

- name: jaeger-configuration-volume
mountPath: /conf
ports:
- containerPort: 5775

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity you can also explicitly add hostPort's in here

- key: agent
path: agent.yaml
name: jaeger-configuration-volume
- apiVersion: v1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service is unnecessary (as the pods on each node send their UDP reports to the host IP)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but one benefit of having a service is that the instrumented application can refer to the agent via the hostname jaeger-agent and not care about whether it's being deployed as a daemonset or regular pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means traffic may route into different nodes - the whole point of running the daemonset was that each pod would be able to route to the (single) jaeger-agent on the node it's scheduled on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reinforce @pieterlange comment, this means traffic will be mostly routed to a different nodes

@pieterlange
Copy link

Collapsing the discussion into a single comment for clarity: the discussion in #74 centers on (UDP) traffic routing in the case of daemonset deployments. The goal here is to make sure each pod that submits jaeger reports can submit them to the jaeger agent running on the same node.

In the PR you use a service for routing the UDP traffic into jaeger-agents, which works, is convenient and is briefly described in the configuration as we can use a "well known" name for agent service discovery but is not achieving the desired effect (local node routing).

As such we have to work around this issue by binding the daemonset pods to the hostnetwork (exposing the jaeger agent on the node IP) and add an environment variable containing the host IP to apps that want to submit jaeger reports. (see snippet above)

@jpkrohling
Copy link
Collaborator Author

The goal here is to make sure each pod that submits jaeger reports can submit them to the jaeger agent running on the same node.

Would the client need to know the IP where the DaemonSet is running? Or would this act as localhost from the perspective of the client? If the client needs to know the IP, I'd then rather use the service name, as it would provide a way for the client to connect to a more distant agent, in case the local agent is down. Also, I would have guessed that Kubernetes would route connections to the "closest" DaemonSet: is that not the case?

@pieterlange
Copy link

pieterlange commented Mar 20, 2018

Would the client need to know the IP where the DaemonSet is running?

Yes, but the IP can be dynamically added to the environment using this stanza:

env:
- name: JAEGER_AGENT_HOST
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP

Or would this act as localhost from the perspective of the client?

In practice, it is the local host - but we need to connect to the host IP address because localhost means something else from the Pod context (in Pod context it means the pod itself - this would work for sidecars)

If the client needs to know the IP, I'd then rather use the service name, as it would provide a way for the client to connect to a more distant agent, in case the local agent is down.

Giving the IP is relatively easy, we just need to make sure the environment variable is used by whatever is submitting data over UDP to the agent. Since we can't trust delivery using UDP we should make sure this data goes to the agent on the node as the node is running an agent anyway.
You're right kubernetes would make sure the Service is only fronting working instances of the agent, but i think this is the wrong tradeoff to make here.

Also, I would have guessed that Kubernetes would route connections to the "closest" DaemonSet: is that not the case?

Maybe in the future, but not right now.

@jpkrohling
Copy link
Collaborator Author

I just tried removing the service and adding hostNetwork: true to the DaemonSet, but then it has problems accessing the jaeger-collector:

{"level":"info","ts":1521556402.228892,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"jaeger-collector:14267"}
{"level":"error","ts":1521556402.230131,"caller":"peerlistmgr/peer_list_mgr.go:171","msg":"Unable to connect","host:port":"jaeger-collector:14267","connCheckTimeout":0.25,"error":"dial tcp: lookup jaeger-collector on 192.168.122.1:53: no such host","stacktrace":"github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).ensureConnections\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:171\ngithub.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr.(*PeerListManager).maintainConnections\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go:101"}

I guess that's because the agent and collector are not on the same network anymore.

@ledor473
Copy link
Member

ledor473 commented Mar 20, 2018

I don't think we need the hostNetwork: true when using the fieldPath: status.hostIP

@pieterlange
Copy link

pieterlange commented Mar 20, 2018

You need to set either hostNetwork: true or specify hostPorts in the ports: {} section of the agent spec.

My oversight here was that if you use hostNetwork: true you should also set dnsPolicy: ClusterFirstWithHostNet on the podSpec for the agent.

On second consideration it's maybe cleaner to not use hostNetwork: true but use the hostPort way of exposing the service on the node IP. That configuration was broken for a while for a bunch of CNI's so force of habit automatically guided me towards hostNetwork..

The environment configuration is needed to connect to the agent from the applications.

@jpkrohling jpkrohling force-pushed the SWS-326-SwitchToDaemonSets branch from ccedc6d to 35a7cb7 Compare March 20, 2018 15:49
@jpkrohling
Copy link
Collaborator Author

Thanks, adding dnsPolicy: ClusterFirstWithHostNet did the trick. I just updated this PR to incorporate the changes you mentioned (hostNetwork/dnsPolicy). The sample I used for testing was also updated: https://github.com/jpkrohling/origin/blob/JPK-AddedJaegerTracingToHelloWorld/examples/hello-openshift/hello_openshift.go

@jpkrohling
Copy link
Collaborator Author

@pieterlange , @ledor473 , if this looks good to you, I'll prepare to merge this by Friday.

@pavolloffay
Copy link
Member

I have a meta a request. We include all objects in one file, however most projects define objects in separate files. It has several advantages. One being that agent can be deployed as sidecar - so when it's all defined in one file users have to remove it from there.

Could we do the same?

@pieterlange
Copy link

pieterlange commented Mar 21, 2018

@pavolloffay Let's not dump that into this PR. Ideally this should be deployed through a helm chart so you can just tweak a parameter to switch between deployment modes. (a different beast altogether)

@jpkrohling LGTM

@jpkrohling
Copy link
Collaborator Author

jpkrohling commented Mar 21, 2018

@pavolloffay , as the "owner" of the tests, do you know why the test is failing?

Caused by: io.fabric8.kubernetes.clnt.v3_1.KubernetesClientException: Failure executing: POST at: https://192.168.39.234:8443/apis/extensions/v1/namespaces/itest-23f563c3/daemonsets. Message: the server could not find the requested resource. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=the server could not find the requested resource, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).

Do I need to adjust the test somehow, or is it a fabric8 problem?

@pavolloffay
Copy link
Member

@jpkrohling I don't know why it is failing.

@pieterlange
Copy link

I don't know the test environment but it's quite possible that daemonsets aren't allowed/available in this type of environment.

@pavolloffay pavolloffay mentioned this pull request Mar 22, 2018
@@ -24,8 +24,10 @@ items:
jaeger-infra: collector-deployment
spec:
replicas: 1
strategy:
type: Recreate
selector:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is this selector mandatory?
  • What does it do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. Without the selector, this happens:

The Deployment "jaeger-collector" is invalid: 
* spec.selector: Required value
* spec.template.metadata.labels: Invalid value: map[string]string{"jaeger-infra":"collector-pod", "app":"jaeger"}: `selector` does not match template `labels`

It looks like it's not required when the version is set to extensions/v1beta1, so, I assume this became a requirement when it was moved out of beta.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this works with the beta version, I'll keep the template using that, as the test framework seems to require it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had also a question about back compatibility. Will apps/v1 work on older k8s versions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but similarly, are we guaranteeing backwards compatibility? If so, up to which version?

I'd rather break it "now" (if anything) that this feature moved from beta, than commit to keep backwards compat with a beta version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way: as the test framework seems to require this older notation, I'm reverting this part of the change, but the backwards compatibility question is a good one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to commit to anything right now, but k8s itself does back. compatibility for betav1 so we could leverage it without doing anything special.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can support more versions basically for free the better.

@@ -24,8 +24,10 @@ items:
jaeger-infra: collector-deployment
spec:
replicas: 1
strategy:
type: Recreate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove recreate? Why do you keep it for query deployment then?

@pavolloffay
Copy link
Member

Other thing in my mind. Do we want to add daemonset to all-in-one template? At the moment there is agent service

@jpkrohling
Copy link
Collaborator Author

Do we want to add daemonset to all-in-one template?

Do you mean replacing the Deployment by a DaemonSet?

@pavolloffay
Copy link
Member

Do you mean replacing the Deployment by a DaemonSet?

Maybe. For back compatibility we can keep the deployment for some time

@jpkrohling jpkrohling force-pushed the SWS-326-SwitchToDaemonSets branch from 35a7cb7 to ec38c01 Compare March 22, 2018 10:44
@jpkrohling jpkrohling force-pushed the SWS-326-SwitchToDaemonSets branch from ec38c01 to 68b1434 Compare March 22, 2018 10:45
@jpkrohling
Copy link
Collaborator Author

Maybe. For back compatibility we can keep the deployment for some time

Let's discuss this in a new issue. I'd also like to get a feedback from @pieterlange and @ledor473 on this before doing this change.

@pavolloffay
Copy link
Member

@pieterlange
Copy link

I think it's OK to just support the latest stable release

@jpkrohling jpkrohling merged commit 68b1434 into jaegertracing:master Mar 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants