Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaeger instance is not getting upgraded automatically #1242

Closed
chandu9333 opened this issue Oct 8, 2020 · 16 comments · Fixed by #1253
Closed

Jaeger instance is not getting upgraded automatically #1242

chandu9333 opened this issue Oct 8, 2020 · 16 comments · Fixed by #1253
Labels
bug Something isn't working

Comments

@chandu9333
Copy link

chandu9333 commented Oct 8, 2020

Jaeger instance is not getting upgraded automatically when I upgraded the Jaeger Operator is from 1.19.0 to 1.20.0

My environment: GKE Cluster
BTW, I am using https://github.com/jaegertracing/jaeger-operator/blob/master/deploy/operator.yaml for Operator installation
and using https://github.com/jaegertracing/jaeger-operator/blob/master/deploy/examples/all-in-one-with-options.yaml for Jaeger instance.

Is my understanding wrong, that we need to upgrade Jaeger instance independently once Operator is upgraded.

Thanks

@jpkrohling
Copy link
Contributor

Could you please share the operator logs upon start? The operator should indeed take care of upgrading all instances at previous versions. I also need the output for kubectl get jaegers, showing your non-upgraded Jaeger instance.

@jpkrohling jpkrohling added the needs-info We need some info from you! If not provided after a few weeks, we'll close this issue. label Oct 9, 2020
@chandu9333
Copy link
Author

chandu9333 commented Oct 9, 2020

Operator logs:

time="2020-10-09T09:20:48Z" level=info msg=Versions arch=amd64 identity=myns.jaeger-operator jaeger=1.20.0 jaeger-operator=v1.20.0 operator-sdk=v0.18.2 os=linux version=go1.14.4

chandu9333@test:~$ kubectl -n myns get jaeger
NAME STATUS VERSION STRATEGY STORAGE AGE
jaeger-instance Running 1.19.2 allinone memory 41m

@jpkrohling
Copy link
Contributor

That's just the first line of the log. Is that all you see? There should be a lot more :-)

@chandu9333
Copy link
Author

chandu9333 commented Oct 9, 2020

I could see below log:

time="2020-10-09T09:20:48Z" level=info msg=Versions arch=amd64 identity=myns.jaeger-operator jaeger=1.20.0 jaeger-operator=v1.20.0 operator-sdk=v0.18.2 os=linux version=go1.14.4
I1009 09:20:49.291046       1 request.go:621] Throttling request took 1.033169816s, request: GET:https://<IP>:443/apis/charts.helm.k8s.io/v1alpha1?timeout=32s
time="2020-10-09T09:21:04Z" level=info msg="Consider running the operator in a cluster-wide scope for extra features"
time="2020-10-09T09:21:05Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
I1009 09:21:05.713191       1 request.go:621] Throttling request took 1.0431635s, request: GET:https://<IP>:443/apis/cert-manager.io/v1beta1?timeout=32s
time="2020-10-09T09:21:12Z" level=info msg="Auto-detected the platform" platform=kubernetes
time="2020-10-09T09:21:12Z" level=info msg="Auto-detected ingress api" ingress-api=networking
time="2020-10-09T09:21:12Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=no
time="2020-10-09T09:21:12Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2020-10-09T09:21:12Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
I1009 09:21:15.749306       1 request.go:621] Throttling request took 2.943649241s, request: GET:https://<IP>:443/apis/apm.k8s.elastic.co/v1?timeout=32s
time="2020-10-09T09:21:20Z" level=warning msg="could not generate and serve custom resource metrics" error="discovering resource information failed for Jaeger in jaegertracing.io/v1: unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
time="2020-10-09T09:21:21Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
I1009 09:21:25.751451       1 request.go:621] Throttling request took 1.44745299s, request: GET:https://<IP>:443/apis/integreatly.org/v1alpha1?timeout=32s
I1009 09:21:35.775960       1 request.go:621] Throttling request took 3.297778484s, request: GET:https://<IP>:443/apis/custom.metrics.k8s.io/v1beta1?timeout=32s
time="2020-10-09T09:21:37Z" level=warning msg="could not create ServiceMonitor object" error="unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
time="2020-10-09T09:21:37Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
time="2020-10-09T09:48:48Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
time="2020-10-09T09:48:48Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
time="2020-10-09T09:49:22Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
time="2020-10-09T10:06:22Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"
time="2020-10-09T10:06:22Z" level=warning msg="failed to setup the Jaeger exporter" error="write udp 127.0.0.1:41127->127.0.0.1:6831: write: connection refused"

@jpkrohling
Copy link
Contributor

Are you sure your Kubernetes cluster is in a sane state?

@chandu9333
Copy link
Author

Yes it is

@objectiser
Copy link
Contributor

@chandu9333 looking at the example you are using, it references a very old version of Jaeger:

image: jaegertracing/all-in-one:1.13

Could you either try with a more recent version (1.19) - or remove the image line from the CR?

@chandu9333
Copy link
Author

chandu9333 commented Oct 9, 2020

I changed image: jaegertracing/all-in-one:1.13 to image: jaegertracing/all-in-one:1.19

Observed few things:

  1. I am able to upgrade Operator by changing jaegertracing/jaeger-operator:1.19 to jaegertracing/jaeger-operator:1.20
NAME                                     READY   STATUS      RESTARTS   AGE
jaeger-instance-7bff55675-6tc2v          1/1     Running     0          18m
jaeger-operator-7d5665c856-95wh5         1/1     Running     0          19m

If you see the above output, 1st Jaeger-operator is upgraded then instance got updated automatically.

  1. I can verify the Jaeger instance events shows, pulling out the image 1.20.0 if run kubectl describe pod jaeger-instance-7bff55675-6tc2v command

instance

Events:
  Type    Reason     Age    From                                                          Message
  ----    ------     ----   ----                                                          -------
  Normal  Scheduled  4m32s  default-scheduler                                             Successfully assigned my-ns/jaeger-instance-7bff55675-6tc2v to mycluster-gke
  Normal  Pulled     4m31s  kubelet, mycluster-gke  Container image "jaegertracing/all-in-one:1.20.0" already present on machine
  Normal  Created    4m31s  kubelet, mycluster-gke Created container jaeger
  Normal  Started    4m31s  kubelet, mycluster-gke Started container jaeger
  1. when I run kubectl get jaeger command then the output still shows 1.19.2 only
NAME              STATUS    VERSION   STRATEGY   STORAGE   AGE
jaeger-instance   Running   1.19.2    allinone   memory    50m

Do we need to make any changes to show the upgraded version under kubectl get jaeger list.

Thanks

@jpkrohling
Copy link
Contributor

Could you run the operator with --log-level=debug? Upon bootstrap, the Jaeger Operator will list all Jaeger instances that it manages and will attempt to upgrade all of them. I don't see anything about it in the log: this, and the other warnings in the log, are what made me think that your cluster isn't healthy.

The debug-level log entries will help us understand if the Jaeger Operator is even finding the Jaeger instances.

@chandu9333
Copy link
Author

chandu9333 commented Oct 13, 2020

I have enabled --log-level=debug in Operator.yaml file and deployed the operator
Please check my comments below.

Log during the deployment using jaeger-operator version 1.19 (cr points to 1.19 image)

time="2020-10-13T07:49:42Z" level=info msg=Versions arch=amd64 identity=my-ns.jaeger-operator jaeger=1.19.2 jaeger-operator=v1.19.0 operator-sdk=v0.18.2 os=linux version=go1.14.4
I1013 07:49:43.220853       1 request.go:621] Throttling request took 1.025828713s, request: GET:https://IP:443/apis/enterprisesearch.k8s.elastic.co/v1beta1?timeout=32s
time="2020-10-13T07:50:07Z" level=info msg="Consider running the operator in a cluster-wide scope for extra features"
I1013 07:50:08.234133       1 request.go:621] Throttling request took 1.043145701s, request: GET:https://IP:443/apis/kafka.strimzi.io/v1alpha1?timeout=32s
time="2020-10-13T07:50:15Z" level=info msg="Auto-detected the platform" platform=kubernetes
time="2020-10-13T07:50:15Z" level=info msg="Auto-detected ingress api" ingress-api=networking
time="2020-10-13T07:50:15Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=no
time="2020-10-13T07:50:15Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2020-10-13T07:50:15Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
I1013 07:50:18.274961       1 request.go:621] Throttling request took 2.944503099s, request: GET:https://IP:443/apis/pxc.percona.com/v1-5-0?timeout=32s
time="2020-10-13T07:50:23Z" level=warning msg="could not generate and serve custom resource metrics" error="discovering resource information failed for Jaeger in jaegertracing.io/v1: unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
I1013 07:50:28.277252       1 request.go:621] Throttling request took 1.44747271s, request: GET:https://IP:443/apis/internal.autoscaling.k8s.io/v1alpha1?timeout=32s
I1013 07:50:38.287855       1 request.go:621] Throttling request took 3.246205942s, request: GET:https://IP:443/apis/integreatly.org/v1alpha1?timeout=32s
time="2020-10-13T07:50:39Z" level=warning msg="could not create ServiceMonitor object" error="unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
time="2020-10-13T07:50:39Z" level=debug msg="Not running on OpenShift, so won't configure OAuthProxy imagestream."
time="2020-10-13T07:50:39Z" level=debug msg="Reconciling Jaeger" execution="2020-10-13 07:50:39.841876538 +0000 UTC" instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="Strategy chosen" instance=jaeger-instance namespace=my-ns strategy=allinone
time="2020-10-13T07:50:39Z" level=debug msg="Creating all-in-one deployment" instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="Assembling the UI configmap" instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="Assembling the Sampling configmap" instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="Assembling an all-in-one deployment" instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="skipping agent daemonset" instance=jaeger-instance namespace=my-ns strategy=
time="2020-10-13T07:50:39Z" level=debug msg="updating service account" account=jaeger-instance instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="updating config maps" configMap=jaeger-instance-ui-configuration instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="updating config maps" configMap=jaeger-instance-sampling-configuration instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:39Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-collector
time="2020-10-13T07:50:39Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-query
time="2020-10-13T07:50:40Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-agent
time="2020-10-13T07:50:40Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-collector-headless
time="2020-10-13T07:50:40Z" level=debug msg="updating deployment" deployment=jaeger-instance instance=jaeger-instance namespace=my-ns
time="2020-10-13T07:50:40Z" level=debug msg="Deployment has stabilized" desired=1 name=jaeger-instance namespace=my-ns ready=1
time="2020-10-13T07:50:40Z" level=debug msg="Reconciling Jaeger completed" execution="2020-10-13 07:50:39.841876538 +0000 UTC" instance=jaeger-instance namespace=my-ns

**Logs during the upgrade (Using 1.20 image which is again created a new operator pod) **

time="2020-10-13T07:59:45Z" level=info msg=Versions arch=amd64 identity=my-ns.jaeger-operator jaeger=1.20.0 jaeger-operator=v1.20.0 operator-sdk=v0.18.2 os=linux version=go1.14.4
I1013 07:59:46.201610       1 request.go:621] Throttling request took 1.032274139s, request: GET:https://IP:443/apis/admissionregistration.k8s.io/v1?timeout=32s
time="2020-10-13T07:59:53Z" level=info msg="Consider running the operator in a cluster-wide scope for extra features"
I1013 07:59:56.241410       1 request.go:621] Throttling request took 2.940179401s, request: GET:https://IP:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
time="2020-10-13T08:00:01Z" level=info msg="Auto-detected the platform" platform=kubernetes
time="2020-10-13T08:00:01Z" level=info msg="Auto-detected ingress api" ingress-api=networking
time="2020-10-13T08:00:01Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=no
time="2020-10-13T08:00:01Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2020-10-13T08:00:01Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
I1013 08:00:06.248832       1 request.go:621] Throttling request took 1.397398195s, request: GET:https://IP:443/apis/apm.k8s.elastic.co/v1beta1?timeout=32s
time="2020-10-13T08:00:09Z" level=warning msg="could not generate and serve custom resource metrics" error="discovering resource information failed for Jaeger in jaegertracing.io/v1: unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
I1013 08:00:16.251300       1 request.go:621] Throttling request took 3.296738372s, request: GET:https://IP:443/apis/cert-manager.io/v1beta1?timeout=32s
time="2020-10-13T08:00:25Z" level=warning msg="could not create ServiceMonitor object" error="unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
time="2020-10-13T08:00:25Z" level=debug msg="Not running on OpenShift, so won't configure OAuthProxy imagestream."
time="2020-10-13T08:00:25Z" level=debug msg="Reconciling Jaeger" execution="2020-10-13 08:00:25.949458463 +0000 UTC" instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:25Z" level=debug msg="Strategy chosen" instance=jaeger-instance namespace=my-ns strategy=allinone
time="2020-10-13T08:00:25Z" level=debug msg="Creating all-in-one deployment" instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:25Z" level=debug msg="Assembling the UI configmap" instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:25Z" level=debug msg="Assembling the Sampling configmap" instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:25Z" level=debug msg="Assembling an all-in-one deployment" instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:25Z" level=debug msg="skipping agent daemonset" instance=jaeger-instance namespace=my-ns strategy=
time="2020-10-13T08:00:26Z" level=debug msg="updating service account" account=jaeger-instance instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:26Z" level=debug msg="updating config maps" configMap=jaeger-instance-ui-configuration instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:26Z" level=debug msg="updating config maps" configMap=jaeger-instance-sampling-configuration instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:26Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-agent
time="2020-10-13T08:00:26Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-collector-headless
time="2020-10-13T08:00:26Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-collector
time="2020-10-13T08:00:26Z" level=debug msg="updating service" instance=jaeger-instance namespace=my-ns service=jaeger-instance-query
time="2020-10-13T08:00:26Z" level=debug msg="updating deployment" deployment=jaeger-instance instance=jaeger-instance namespace=my-ns
time="2020-10-13T08:00:26Z" level=debug msg="Deployment has stabilized" desired=1 name=jaeger-instance namespace=my-ns ready=1
time="2020-10-13T08:00:26Z" level=debug msg="Reconciling Jaeger completed" execution="2020-10-13 08:00:25.949458463 +0000 UTC" instance=jaeger-instance namespace=my-ns

Once the Operator is up and running I can see the Jaeger-operator and Jaeger version as 1.20 on the pod (exec to pod) as below

Jaeger-operator version display as below for old deployment using 1.19 image

chandu9333@test:~$ kubectl -n my-ns exec -it **jaeger-operator-55fb49b65b-7lgsw** -- bash
bash-4.4$ jaeger-operator version
{"jaeger-operator":"v1.19.0","build-date":"2020-08-27T11:46:42Z","jaeger-version":"1.19.2","go-version":"go1.14.4","operator-sdk-version":"v0.18.2"}
bash-4.4$ command terminated with exit code 137

When I start deployment again using 1.20 image (operator.yaml file), pod gets terminated and re-created the new pod with new version
Below is the output once I exec into operator pod

chandu9333@test:~$ kubectl -n my-ns exec -it **jaeger-operator-79ffc9fc48-q6rs8** -- bash

bash-4.4$ jaeger-operator version
{"jaeger-operator":"v1.20.0","build-date":"2020-09-30T10:52:11Z","jaeger-version":"1.20.0","go-version":"go1.14.4","operator-sdk-version":"v0.18.2"}

After the upgrade also still shows the jaeger version as 1.19.2

chandu9333@test:~$ kubectl -n my-ns get jaeger
NAME              STATUS    VERSION   STRATEGY   STORAGE   AGE
jaeger-instance   Running   1.19.2    allinone   memory    100m

One more question (sorry for my layman terms as I am new to k8s) ?

Why pods get terminated and re-created when we update image. Is this expected behavior?

Lets assume, I have an operator and all-in-one deployment ready with 1.19 and able to see the traces
After somedays, If want to upgrade it to latest version as soon as it is available in the market, do I loose data/spans/traces/mesh which are from old version?

@jpkrohling
Copy link
Contributor

Looks like the upgrade is indeed somehow broken, I was able to reproduce your case. @rubenvp8510 is it perhaps because of the semantic versioning changes? Could you investigate it ?

@jpkrohling
Copy link
Contributor

jpkrohling commented Oct 13, 2020

Why pods get terminated and re-created when we update image. Is this expected behavior?

This is the expected Kubernetes behavior: whenever a deployment changes, new pods are created to use the new configuration and the old pods are killed.

do I loose data/spans/traces/mesh which are from old version?

If you are using the in-memory storage, then yes. Otherwise, the collector should gracefully shutdown and Kubernetes will only shift traffic to the new pod once it's determined to be healthy.

@chandu9333
Copy link
Author

chandu9333 commented Oct 13, 2020

Okay. So if I use Production/streaming strategy with elastic search as backend storage, we don't loose any data after upgradation right?

@jpkrohling
Copy link
Contributor

So if I use Production/streaming strategy with elastic search as backend storage, we don't loose any data after upgradation right?

You won't lose any data that is already in the storage. You should also not lose any in-flight data while the old pod is shutting down and the new one is starting, but I wouldn't be surprised if a few spans would be lost during this process.

@chandu9333
Copy link
Author

Thanks.
When can we expect the fix for the issue regarding the version?

@jpkrohling
Copy link
Contributor

We might have it ready for the next release (1.21.0), which should be due in a month or so. But no promises.

@jpkrohling jpkrohling added bug Something isn't working and removed needs-info We need some info from you! If not provided after a few weeks, we'll close this issue. labels Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants