Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create Collector : Webhook deadline exceeded #100

Closed
CyberHippo opened this issue Oct 21, 2020 · 41 comments
Closed

Cannot create Collector : Webhook deadline exceeded #100

CyberHippo opened this issue Oct 21, 2020 · 41 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@CyberHippo
Copy link
Contributor

Hi,

I cannot create the simplest OpenTelemetryCollector. I get the following error when I try to create it from STDIN:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": Post https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=30s: context deadline exceeded

The Opentelemetry Operator Controller Manager is up and running in the namespace opentelemetry-operator-system:

$ kubectl get po -n opentelemetry-operator-system                 
NAME                                                         READY   STATUS    RESTARTS   AGE
opentelemetry-operator-controller-manager-56f75fbb5d-qrdst   2/2     Running   0          17m

And the logs of both containers (manager and kube-rbac-proxy) do not show any error.

I installed the required resources using:

$ kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Is there something I am missing ?

Thank you for your help !

@jpkrohling
Copy link
Member

Do you have the cert-manager installed?

@jpkrohling jpkrohling added question Further information is requested and removed needs-triage labels Oct 21, 2020
@CyberHippo
Copy link
Contributor Author

Hi @jpkrohling,

Yes it is installed and running.

$ kubectl get issuers.cert-manager.io -n opentelemetry-operator-system
NAME                                       READY   AGE
opentelemetry-operator-selfsigned-issuer   True    48s


$ kubectl get certificates.cert-manager.io -n opentelemetry-operator-system 
NAME                                  READY   SECRET                AGE
opentelemetry-operator-serving-cert   True    webhook-server-cert   60s

Here are some logs from the cert-manager:

$ kubectl logs cert-manager-9bf7fffdd-4d6wd -n cert-manager
I1021 15:33:14.069252       1 conditions.go:144] Found status change for Certificate "opentelemetry-operator-serving-cert" condition "Ready": "False" -> "True"; setting lastTransitionTime to 2020-10-21 15:33:14.069241742 +0000 UTC m=+875584.981646563
I1021 15:33:14.151210       1 controller.go:135] cert-manager/controller/certificates "msg"="finished processing work item" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert" 
I1021 15:33:14.151251       1 controller.go:129] cert-manager/controller/certificates "msg"="syncing item" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert" 
I1021 15:33:14.151863       1 util.go:177] cert-manager/controller/certificates "msg"="certificate scheduled for renewal" "duration_until_renewal"="1439h59m58.848183422s" "related_resource_kind"="Secret" "related_resource_name"="webhook-server-cert" "related_resource_namespace"="opentelemetry-operator-system" "resource_kind"="Certificate" "resource_name"="opentelemetry-operator-serving-cert" "resource_namespace"="opentelemetry-operator-system" 
I1021 15:33:14.151907       1 sync.go:310] cert-manager/controller/certificates "msg"="certificate does not require re-issuance. certificate renewal scheduled near expiry time." "related_resource_kind"="CertificateRequest" "related_resource_name"="opentelemetry-operator-serving-cert-2532130321" "related_resource_namespace"="opentelemetry-operator-system" "resource_kind"="Certificate" "resource_name"="opentelemetry-operator-serving-cert" "resource_namespace"="opentelemetry-operator-system" 
I1021 15:33:14.152272       1 controller.go:135] cert-manager/controller/certificates "msg"="finished processing work item" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert" 

@jpkrohling jpkrohling self-assigned this Oct 22, 2020
@jpkrohling
Copy link
Member

Could you try killing the pod? Is it possible that the operator deployment started before the cert-manager was ready? Are you able to consistently reproduce in minikube? I just tried it out, and it seems to work for me:

$ kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/v0.16.1/cert-manager.yaml
customresourcedefinition.apiextensions.k8s.io/certificaterequests.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/certificates.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/challenges.acme.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/clusterissuers.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/issuers.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/orders.acme.cert-manager.io created
namespace/cert-manager created
serviceaccount/cert-manager-cainjector created
serviceaccount/cert-manager created
serviceaccount/cert-manager-webhook created
clusterrole.rbac.authorization.k8s.io/cert-manager-cainjector created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-issuers created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-certificates created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-orders created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-challenges created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim created
clusterrole.rbac.authorization.k8s.io/cert-manager-view created
clusterrole.rbac.authorization.k8s.io/cert-manager-edit created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-issuers created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-certificates created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-orders created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-challenges created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim created
role.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created
role.rbac.authorization.k8s.io/cert-manager:leaderelection created
role.rbac.authorization.k8s.io/cert-manager-webhook:dynamic-serving created
rolebinding.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created
rolebinding.rbac.authorization.k8s.io/cert-manager:leaderelection created
rolebinding.rbac.authorization.k8s.io/cert-manager-webhook:dynamic-serving created
service/cert-manager created
service/cert-manager-webhook created
deployment.apps/cert-manager-cainjector created
deployment.apps/cert-manager created
deployment.apps/cert-manager-webhook created
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created

$ kubectl get pods -n cert-manager
cert-manager-cainjector-fc6c787db-lqdlb   1/1     Running             0          27s

$ kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
namespace/opentelemetry-operator-system created
customresourcedefinition.apiextensions.k8s.io/opentelemetrycollectors.opentelemetry.io created
role.rbac.authorization.k8s.io/opentelemetry-operator-leader-election-role created
clusterrole.rbac.authorization.k8s.io/opentelemetry-operator-manager-role created
clusterrole.rbac.authorization.k8s.io/opentelemetry-operator-proxy-role created
clusterrole.rbac.authorization.k8s.io/opentelemetry-operator-metrics-reader created
rolebinding.rbac.authorization.k8s.io/opentelemetry-operator-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/opentelemetry-operator-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/opentelemetry-operator-proxy-rolebinding created
service/opentelemetry-operator-controller-manager-metrics-service created
service/opentelemetry-operator-webhook-service created
deployment.apps/opentelemetry-operator-controller-manager created
certificate.cert-manager.io/opentelemetry-operator-serving-cert created
issuer.cert-manager.io/opentelemetry-operator-selfsigned-issuer created
mutatingwebhookconfiguration.admissionregistration.k8s.io/opentelemetry-operator-mutating-webhook-configuration created
validatingwebhookconfiguration.admissionregistration.k8s.io/opentelemetry-operator-validating-webhook-configuration created

$ kubectl get pods -n opentelemetry-operator-system
NAME                                                         READY   STATUS    RESTARTS   AGE
opentelemetry-operator-controller-manager-548b94f546-tmzpf   2/2     Running   0          18s

$ kubectl apply -f config/samples/core_v1alpha1_opentelemetrycollector.yaml 
opentelemetrycollector.opentelemetry.io/opentelemetrycollector-sample created

$ kubectl get pods -n default
NAME                                                       READY   STATUS              RESTARTS   AGE
opentelemetrycollector-sample-collector-5d9bbb498c-hf2r7   1/1     Running             0          10s

$ kubectl logs deployments/opentelemetrycollector-sample-collector | tail -n 1
{"level":"info","ts":1603351057.2240055,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}

@jpkrohling jpkrohling added needs-info and removed question Further information is requested labels Oct 22, 2020
@CyberHippo
Copy link
Contributor Author

Hi @jpkrohling,

I also tried to reproduce it on minikube and everything worked fine.
I am closing this issue until I have time for further investigation.

Thank you for your quick input !

@krak3n
Copy link

krak3n commented Apr 12, 2021

I am hitting this as well but I cannot replicate it in minikube, everything there works fine but as soon as I try this in our cluster it times out. There is zero logging from the operator so I have no idea whats going on 😭

@jpkrohling
Copy link
Member

@krak3n, there's no logging in the operator because the problem is probably before the operator gets the chance to see the change. Do you have cert-manager installed? Anything suspicious when you run kubectl get events?

@krak3n
Copy link

krak3n commented Apr 13, 2021

Yeah @jpkrohling I thought as much, sorry for my cries of desperation lol It is likely an issue with our cluster, permissions or something thats blocking the Post https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s webhook call 🤷

There is nothing in kubectl get events that look related to this so kinda finding my way in the dark.

cert manager is all running fine, its been there for months and recently updated it to the latest release so I don't think the issue is there.

@jpkrohling
Copy link
Member

And I assume the operator itself is also up and running, right? Are you able to expose the service via, say, kubectl port-forward and run a curl call directly?

@krak3n
Copy link

krak3n commented Apr 13, 2021

@jpkrohling yup opened a port forward to the service with:

kubectl port-forward service/opentelemetry-operator-webhook-service 9443:443

Then made a POST request to the same endpoint and works fine:

http --verify=no POST https://localhost:9443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector\?timeout=10s
HTTP/1.1 200 OK
Content-Length: 128
Content-Type: text/plain; charset=utf-8
Date: Tue, 13 Apr 2021 10:06:52 GMT

{
    "response": {
        "allowed": false,
        "status": {
            "code": 400,
            "message": "contentType=, expected application/json",
            "metadata": {}
        },
        "uid": ""
    }
}

And see the error in the operator.

@jpkrohling
Copy link
Member

Then either the TLS cert isn't being accepted by the Kubernetes API (the caller), or there's a networking issue. Not sure what I can do at this point to help you, though :-/

@krak3n
Copy link

krak3n commented Apr 13, 2021

Yeah I am at a loss as well 🤷

The cert seems fine based on the sate of the resources.

➜ kubectl get Issuer
NAME                                       READY   AGE
opentelemetry-operator-selfsigned-issuer   True    19h
➜ kubectl get Certificate
NAME                                  READY   SECRET                                                   AGE
opentelemetry-operator-serving-cert   True    opentelemetry-operator-controller-manager-service-cert   19h
➜ kubectl get CertificateRequest
NAME                                        APPROVED   DENIED   READY   ISSUER                                     REQUESTOR                                         AGE
opentelemetry-operator-serving-cert-nddc5   True                True    opentelemetry-operator-selfsigned-issuer   system:serviceaccount:cert-manager:cert-manager   19h

And if I do the exact same setup in minikube it's all fine. I'll keep digging.

@krak3n
Copy link

krak3n commented Apr 13, 2021

@jpkrohling so I managed to get it working by changing the failurePolicy from Fail to Ignore for both opentelemetry-operator-mutating-webhook-configuration and opentelemetry-operator-validating-webhook-configuration webhooks. It took it's time but worked and now my Deployment is running.

@krak3n
Copy link

krak3n commented Apr 13, 2021

Ok so here is the full set of events from the opentelemetry-operator-system:

18m         Normal    LeaderElection      configmap/9f7554c3.opentelemetry.io                               opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m_aa8d154f-a7da-4d9c-b43b-31f23cf4c560 became leader
18m         Normal    LeaderElection      lease/9f7554c3.opentelemetry.io                                   opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m_aa8d154f-a7da-4d9c-b43b-31f23cf4c560 became leader
18m         Normal    Scheduled           pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Successfully assigned opentelemetry-operator-system/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m to gke-staging-medium-da90c193-r7mp
18m         Warning   FailedMount         pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    MountVolume.SetUp failed for volume "cert" : secret "opentelemetry-operator-controller-manager-service-cert" not found
18m         Normal    Pulling             pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Pulling image "quay.io/opentelemetry/opentelemetry-operator:0.23.0"
18m         Normal    Pulled              pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Successfully pulled image "quay.io/opentelemetry/opentelemetry-operator:0.23.0"
18m         Normal    Created             pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Created container manager
18m         Normal    Started             pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Started container manager
18m         Normal    Pulling             pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Pulling image "gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0"
18m         Normal    Pulled              pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Successfully pulled image "gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0"
18m         Normal    Created             pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Created container kube-rbac-proxy
18m         Normal    Started             pod/opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m    Started container kube-rbac-proxy
18m         Normal    SuccessfulCreate    replicaset/opentelemetry-operator-controller-manager-67c98bf7b5   Created pod: opentelemetry-operator-controller-manager-67c98bf7b5-jjm8m
18m         Normal    ScalingReplicaSet   deployment/opentelemetry-operator-controller-manager              Scaled up replica set opentelemetry-operator-controller-manager-67c98bf7b5 to 1
18m         Warning   BadConfig           certificaterequest/opentelemetry-operator-serving-cert-4mx96      Certificate will be issued with an empty Issuer DN, which contravenes RFC 5280 and could break some strict clients
18m         Normal    CertificateIssued   certificaterequest/opentelemetry-operator-serving-cert-4mx96      Certificate fetched from issuer successfully
18m         Normal    cert-manager.io     certificaterequest/opentelemetry-operator-serving-cert-4mx96      Certificate request has been approved by cert-manager.io
18m         Normal    Issuing             certificate/opentelemetry-operator-serving-cert                   Issuing certificate as Secret does not exist
18m         Normal    Generated           certificate/opentelemetry-operator-serving-cert                   Stored new private key in temporary Secret resource "opentelemetry-operator-serving-cert-qcwml"
18m         Normal    Requested           certificate/opentelemetry-operator-serving-cert                   Created new CertificateRequest resource "opentelemetry-operator-serving-cert-4mx96"
18m         Normal    Issuing             certificate/opentelemetry-operator-serving-cert                   The certificate has been successfully issued
13m         Normal    Scheduled           pod/otelcol-deployment-collector-75b4ccdf69-nxqlh                 Successfully assigned opentelemetry-operator-system/otelcol-deployment-collector-75b4ccdf69-nxqlh to gke-staging-medium-da90c193-ngf2
13m         Normal    Pulling             pod/otelcol-deployment-collector-75b4ccdf69-nxqlh                 Pulling image "otel/opentelemetry-collector:0.23.0"
13m         Normal    Pulled              pod/otelcol-deployment-collector-75b4ccdf69-nxqlh                 Successfully pulled image "otel/opentelemetry-collector:0.23.0"
13m         Normal    Created             pod/otelcol-deployment-collector-75b4ccdf69-nxqlh                 Created container otc-container
13m         Normal    Started             pod/otelcol-deployment-collector-75b4ccdf69-nxqlh                 Started container otc-container
13m         Normal    Scheduled           pod/otelcol-deployment-collector-75b4ccdf69-t5ffv                 Successfully assigned opentelemetry-operator-system/otelcol-deployment-collector-75b4ccdf69-t5ffv to gke-staging-medium-474d970c-mwr2
12m         Normal    Pulling             pod/otelcol-deployment-collector-75b4ccdf69-t5ffv                 Pulling image "otel/opentelemetry-collector:0.23.0"
12m         Normal    Pulled              pod/otelcol-deployment-collector-75b4ccdf69-t5ffv                 Successfully pulled image "otel/opentelemetry-collector:0.23.0"
12m         Normal    Created             pod/otelcol-deployment-collector-75b4ccdf69-t5ffv                 Created container otc-container
12m         Normal    Started             pod/otelcol-deployment-collector-75b4ccdf69-t5ffv                 Started container otc-container
13m         Normal    SuccessfulCreate    replicaset/otelcol-deployment-collector-75b4ccdf69                Created pod: otelcol-deployment-collector-75b4ccdf69-nxqlh
13m         Normal    SuccessfulCreate    replicaset/otelcol-deployment-collector-75b4ccdf69                Created pod: otelcol-deployment-collector-75b4ccdf69-t5ffv
13m         Normal    ScalingReplicaSet   deployment/otelcol-deployment-collector                           Scaled up replica set otelcol-deployment-collector-75b4ccdf69 to 2

So I am guessing the actual issue is the FailedMount of the certificate.

@krak3n
Copy link

krak3n commented Apr 13, 2021

@CyberHippo did you ever figure out what your issue was?

@CyberHippo
Copy link
Contributor Author

@krak3n No I did not. I am now using a custom version of the otel-config.yaml for k8s which works perfectly fine.

@krak3n
Copy link

krak3n commented Apr 13, 2021

Thanks @CyberHippo thats a shame, I was really hoping to use sidecar injection, oh well.

@jpkrohling jpkrohling reopened this Apr 13, 2021
@jpkrohling
Copy link
Member

jpkrohling commented Apr 13, 2021

It is in my interest to have you both using the operator. Please help me reproduce this issue so that I can fix it :-) If you can't consistently reproduce with minikube, would one of you be able to give me access to your cluster?

@krak3n
Copy link

krak3n commented Apr 13, 2021

@jpkrohling it's a work cluster, though it is just staging, might be tricky, I'll run it past the team at stand up tomorrow, I was gonna try and set up my own GKE cluster and see if I can replicate there rather than minikube.

I'll do another fresh install on the cluster tomorrow morning and give you all the logs / events I can get my hands on.

@jpkrohling
Copy link
Member

A long shot, but which version of the cert-manager are you using? Have you tried using their latest version?

@krak3n
Copy link

krak3n commented Apr 14, 2021

Morning @jpkrohling - we run cert-manager 1.3.0 which I believe is the latest, we've been running cert manager for a while in the cluster and is currently handling some lets encrypt certs a well.

I attempted a fresh install of the operator this morning using the example in the README.md and the time out issue still occurs, here are all the logs I could gather, let me know if you need anything else (I am on the CNCF slack (name: @Chris Reeves) if you would like to DM me.

  • cert-manager
I0414 08:00:25.600333       1 conditions.go:182] Setting lastTransitionTime for Certificate "opentelemetry-operator-serving-cert" condition "Ready" to 2021-04-14 08:00:25.600318817 +0000 UTC m=+60613.733976819
I0414 08:00:25.600696       1 trigger_controller.go:189] cert-manager/controller/certificates-trigger "msg"="Certificate must be re-issued" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert" "message"="Issuing certificate as Secret does not exist" "reason"="DoesNotExist"
I0414 08:00:25.600716       1 conditions.go:182] Setting lastTransitionTime for Certificate "opentelemetry-operator-serving-cert" condition "Issuing" to 2021-04-14 08:00:25.600711772 +0000 UTC m=+60613.734369764
I0414 08:00:25.780804       1 conditions.go:95] Setting lastTransitionTime for Issuer "opentelemetry-operator-selfsigned-issuer" condition "Ready" to 2021-04-14 08:00:25.780794636 +0000 UTC m=+60613.914452625
E0414 08:00:25.797327       1 controller.go:158] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opentelemetry-operator-serving-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert"
I0414 08:00:25.799520       1 conditions.go:182] Setting lastTransitionTime for Certificate "opentelemetry-operator-serving-cert" condition "Ready" to 2021-04-14 08:00:25.799509253 +0000 UTC m=+60613.933167313
E0414 08:00:25.960836       1 controller.go:158] cert-manager/controller/certificates-key-manager "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opentelemetry-operator-serving-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert"
I0414 08:00:26.013583       1 conditions.go:242] Setting lastTransitionTime for CertificateRequest "opentelemetry-operator-serving-cert-t88lc" condition "Approved" to 2021-04-14 08:00:26.013571101 +0000 UTC m=+60614.147229114
I0414 08:00:26.020670       1 conditions.go:242] Setting lastTransitionTime for CertificateRequest "opentelemetry-operator-serving-cert-t88lc" condition "Ready" to 2021-04-14 08:00:26.020659706 +0000 UTC m=+60614.154317694
E0414 08:00:26.071022       1 controller.go:158] cert-manager/controller/certificaterequests-approver "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificaterequests.cert-manager.io \"opentelemetry-operator-serving-cert-t88lc\": the object has been modified; please apply your changes to the latest version and try again" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert-t88lc"
I0414 08:00:26.087278       1 conditions.go:171] Found status change for Certificate "opentelemetry-operator-serving-cert" condition "Ready": "False" -> "True"; setting lastTransitionTime to 2021-04-14 08:00:26.087269032 +0000 UTC m=+60614.220927022
E0414 08:00:26.117437       1 controller.go:158] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opentelemetry-operator-serving-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert"
I0414 08:00:26.120755       1 conditions.go:171] Found status change for Certificate "opentelemetry-operator-serving-cert" condition "Ready": "False" -> "True"; setting lastTransitionTime to 2021-04-14 08:00:26.120746135 +0000 UTC m=+60614.254404130
E0414 08:00:26.190538       1 controller.go:158] cert-manager/controller/certificates-issuing "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opentelemetry-operator-serving-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert"
E0414 08:00:26.304743       1 controller.go:158] cert-manager/controller/certificates-key-manager "msg"="re-queuing item due to error processing" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opentelemetry-operator-serving-cert\": the object has been modified; please apply your changes to the latest version and try again" "key"="opentelemetry-operator-system/opentelemetry-operator-serving-cert"
  • opentelemetry-operator-controller-manager (manager pod)
{"level":"info","ts":1618387229.828056,"msg":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.23.0","opentelemetry-collector":"0.23.0","build-date":"2021-04-07T08:49:35Z","go-version":"go1.15.11","go-arch":"amd64","go-os":"linux"}
{"level":"info","ts":1618387229.8285353,"logger":"setup","msg":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
I0414 08:00:31.028974       1 request.go:655] Throttling request took 1.002536146s, request: GET:https://10.148.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
{"level":"info","ts":1618387231.2243612,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1618387231.2247417,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1618387231.225007,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1618387231.2252865,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1618387231.225385,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1618387231.2256408,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":1618387231.225721,"logger":"setup","msg":"starting manager"}
I0414 08:00:31.226068       1 leaderelection.go:243] attempting to acquire leader lease opentelemetry-operator-system/9f7554c3.opentelemetry.io...
{"level":"info","ts":1618387231.2260823,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1618387231.2262328,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1618387231.2269692,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1618387231.2271342,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":1618387231.227137,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
I0414 08:00:31.246072       1 leaderelection.go:253] successfully acquired lease opentelemetry-operator-system/9f7554c3.opentelemetry.io
{"level":"info","ts":1618387231.2463834,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1618387231.3241131,"logger":"upgrade","msg":"looking for managed instances to upgrade"}
{"level":"info","ts":1618387231.424108,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1618387231.4245396,"logger":"upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":1618387233.3249664,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1618387233.4256015,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1618387233.5269732,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1618387233.727477,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1618387233.8280468,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting Controller","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector"}
{"level":"info","ts":1618387233.8282218,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting workers","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","worker count":1}
  • opentelemetry-operator-controller-manager (kube-rbac-proxy pod)
I0414 08:00:33.063002       1 main.go:186] Valid token audiences:
I0414 08:00:33.063176       1 main.go:232] Generating self signed cert as no cert is provided
I0414 08:00:33.279236       1 main.go:281] Starting TCP socket on 0.0.0.0:8443
I0414 08:00:33.279806       1 main.go:288] Listening securely on 0.0.0.0:8443
  • Events from the opentelemety-operator-system namespace
7m44s       Normal    LeaderElection      configmap/9f7554c3.opentelemetry.io                               opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8_90f783b5-4197-411e-a8ae-f4cb7557b16f became leader
7m44s       Normal    LeaderElection      lease/9f7554c3.opentelemetry.io                                   opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8_90f783b5-4197-411e-a8ae-f4cb7557b16f became leader
7m50s       Normal    Scheduled           pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Successfully assigned opentelemetry-operator-system/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8 to gke-staging-cpu-3c07030d-ckgg
7m50s       Warning   FailedMount         pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    MountVolume.SetUp failed for volume "cert" : secret "opentelemetry-operator-controller-manager-service-cert" not found
7m49s       Normal    Pulling             pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Pulling image "quay.io/opentelemetry/opentelemetry-operator:0.23.0"
7m46s       Normal    Pulled              pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Successfully pulled image "quay.io/opentelemetry/opentelemetry-operator:0.23.0"
7m46s       Normal    Created             pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Created container manager
7m46s       Normal    Started             pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Started container manager
7m46s       Normal    Pulling             pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Pulling image "gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0"
7m43s       Normal    Pulled              pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Successfully pulled image "gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0"
7m43s       Normal    Created             pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Created container kube-rbac-proxy
7m42s       Normal    Started             pod/opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8    Started container kube-rbac-proxy
7m50s       Normal    SuccessfulCreate    replicaset/opentelemetry-operator-controller-manager-67c98bf7b5   Created pod: opentelemetry-operator-controller-manager-67c98bf7b5-7bbb8
7m50s       Normal    ScalingReplicaSet   deployment/opentelemetry-operator-controller-manager              Scaled up replica set opentelemetry-operator-controller-manager-67c98bf7b5 to 1
7m49s       Warning   BadConfig           certificaterequest/opentelemetry-operator-serving-cert-t88lc      Certificate will be issued with an empty Issuer DN, which contravenes RFC 5280 and could break some strict clients
7m49s       Normal    CertificateIssued   certificaterequest/opentelemetry-operator-serving-cert-t88lc      Certificate fetched from issuer successfully
7m50s       Normal    Issuing             certificate/opentelemetry-operator-serving-cert                   Issuing certificate as Secret does not exist
7m50s       Normal    Generated           certificate/opentelemetry-operator-serving-cert                   Stored new private key in temporary Secret resource "opentelemetry-operator-serving-cert-qsb4q"
7m49s       Normal    Requested           certificate/opentelemetry-operator-serving-cert                   Created new CertificateRequest resource "opentelemetry-operator-serving-cert-t88lc"
7m49s       Normal    Issuing             certificate/opentelemetry-operator-serving-cert                   The certificate has been successfully issued

@jpkrohling
Copy link
Member

This is running in GKE, right? I have meetings the whole day today and tomorrow, and I'm off Friday, but I'll try to get to this one early next week.

@krak3n
Copy link

krak3n commented Apr 14, 2021

Hi @jpkrohling yup it's in GKE, yeah no worries. hit me up when ever you get some time 👍

@krak3n
Copy link

krak3n commented Apr 20, 2021

HI @jpkrohling we can't give you direct access to the cluster but we could jump on a call or something and you could debug through me lol Not ideal I know 🤷

@CyberHippo
Copy link
Contributor Author

@krak3n Is your GKE cluster private ? If yes, which ports are open on your GCP Firewall rule gke-<name>-<id>-master ?

@krak3n
Copy link

krak3n commented May 4, 2021

It is private and allows 443

@krak3n
Copy link

krak3n commented May 6, 2021

@CyberHippo do other ports need to be open?

@CyberHippo
Copy link
Contributor Author

@krak3n I'm not sure. But I suspect that another port needs to be open as well. I encountered a similar issue with an ingress-controller and a private GKE cluster. @jpkrohling Do you know if specific ports need to be open on the master ?

@jpkrohling
Copy link
Member

From the top of my head, I can't think of any other ports we might use, but I would check the kubebuilder docs.

@jpkrohling
Copy link
Member

There's some information scattered around this issue, so, let me consolidate it here:

  • 8080 for the controller metrics (should not be exposed)
  • 9443 for the webhook in the controller container (exposed as opentelemetry-operator-webhook-service, port 443)
  • 8443 for the rbac sidecar proxy (exposed as opentelemetry-operator-controller-manager-metrics-service, port 8443, proxies to 8080)

@ruibinghao
Copy link
Contributor

ruibinghao commented May 12, 2021

I am running into a similar issue on Openshift but with a different signature in the error msg. (BTW, did have cert-manager installed without any issue):

% oc apply -f otel-collector.yaml                                                 
Error from server (InternalError): error when creating "otel-collector.yaml": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": no endpoints available for service "opentelemetry-operator-webhook-service"

The operator controller pod was initially running fine but as soon as I run the above command to create a collector instance, it got into a crashLoopBackoff situation.

% oc get pods
NAME                                                         READY   STATUS             RESTARTS   AGE
opentelemetry-operator-controller-manager-7686c8f98d-kk549   1/2     CrashLoopBackOff   10         31m

% oc get pod                   
I0512 11:06:31.093423   12623 request.go:621] Throttling request took 1.190667106s, request: GET:https://api.vivaocp.comcast.net:6443/apis/autoscaling.openshift.io/v1beta1?timeout=32s
oc describ poNAME                                                         READY   STATUS             RESTARTS   AGE
opentelemetry-operator-controller-manager-7686c8f98d-kk549   1/2     CrashLoopBackOff   10         35m
rhao000@HQSML-1717921 opentelemetry % oc describe po opentelemetry-operator-controller-manager-7686c8f98d-kk549
Name:         opentelemetry-operator-controller-manager-7686c8f98d-kk549
Namespace:    opentelemetry-operator-system
Priority:     0
Node:         vivaocp-245mm-worker-0-gzc2s/10.0.0.20
Start Time:   Wed, 12 May 2021 10:30:42 -0400
Labels:       control-plane=controller-manager
              pod-template-hash=7686c8f98d
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.64.2.165"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.64.2.165"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Running
IP:           10.64.2.165
IPs:
  IP:           10.64.2.165
Controlled By:  ReplicaSet/opentelemetry-operator-controller-manager-7686c8f98d
Containers:
  manager:
    Container ID:  cri-o://f18f34b715537c6185cdf79d30374e96116284bf49bf7445d3eb570715320916
    Image:         quay.io/opentelemetry/opentelemetry-operator:v0.25.0
    Image ID:      quay.io/opentelemetry/opentelemetry-operator@sha256:c55858ac62c7086503cdc77cf789c4fec283c85b804ed84bb1e1a121c065025c
    Port:          9443/TCP
    Host Port:     0/TCP
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
      --enable-leader-election
    State:          Running
      Started:      Wed, 12 May 2021 11:06:44 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 12 May 2021 11:01:12 -0400
      Finished:     Wed, 12 May 2021 11:01:37 -0400
    Ready:          True
    Restart Count:  11
    Limits:
      cpu:     100m
      memory:  30Mi
    Requests:
      cpu:        100m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-wgwgm (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://797b1c5f8922a864ee1306d62acf0db3669b5f657a6cf00af81d4739300d50cf
    Image:         gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0
    Image ID:      gcr.io/kubebuilder/kube-rbac-proxy@sha256:e10d1d982dd653db74ca87a1d1ad017bc5ef1aeb651bdea089debf16485b080b
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    State:          Running
      Started:      Wed, 12 May 2021 10:30:48 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-wgwgm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  opentelemetry-operator-controller-manager-service-cert
    Optional:    false
  default-token-wgwgm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-wgwgm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason          Age                  From               Message
  ----     ------          ----                 ----               -------
  Normal   Scheduled       36m                  default-scheduler  Successfully assigned opentelemetry-operator-system/opentelemetry-operator-controller-manager-7686c8f98d-kk549 to vivaocp-245mm-worker-0-gzc2s
  Normal   AddedInterface  36m                  multus             Add eth0 [10.64.2.165/23]
  Normal   Pulling         36m                  kubelet            Pulling image "quay.io/opentelemetry/opentelemetry-operator:v0.25.0"
  Normal   Pulled          36m                  kubelet            Successfully pulled image "quay.io/opentelemetry/opentelemetry-operator:v0.25.0" in 2.779485135s
  Normal   Pulled          36m                  kubelet            Container image "gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0" already present on machine
  Normal   Created         36m                  kubelet            Created container kube-rbac-proxy
  Normal   Started         36m                  kubelet            Started container kube-rbac-proxy
  Normal   Started         34m (x4 over 36m)    kubelet            Started container manager
  Normal   Created         32m (x5 over 36m)    kubelet            Created container manager
  Normal   Pulled          30m (x5 over 35m)    kubelet            Container image "quay.io/opentelemetry/opentelemetry-operator:v0.25.0" already present on machine
  Warning  BackOff         63s (x140 over 35m)  kubelet            Back-off restarting failed container

Logs from the controller pod:

% oc logs -f opentelemetry-operator-controller-manager-7686c8f98d-kk549 -c manager
{"level":"info","ts":1620831672.3603256,"msg":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.25.0","opentelemetry-collector":"0.25.0","build-date":"2021-05-10T10:22:53Z","go-version":"go1.15.12","go-arch":"amd64","go-os":"linux"}
{"level":"info","ts":1620831672.3619967,"logger":"setup","msg":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
I0512 15:01:13.462670       1 request.go:655] Throttling request took 1.001831764s, request: GET:https://172.30.0.1:443/apis/k8s.nginx.org/v1?timeout=32s
{"level":"info","ts":1620831676.8576038,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1620831676.8578858,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1620831676.8579705,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1620831676.8580375,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1620831676.858082,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1620831676.8581784,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":1620831676.8582199,"logger":"setup","msg":"starting manager"}
I0512 15:01:16.858514       1 leaderelection.go:243] attempting to acquire leader lease opentelemetry-operator-system/9f7554c3.opentelemetry.io...
{"level":"info","ts":1620831676.8585513,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1620831676.8585052,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1620831676.85886,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1620831676.8590858,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":1620831676.8591828,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
I0512 15:01:34.419454       1 leaderelection.go:253] successfully acquired lease opentelemetry-operator-system/9f7554c3.opentelemetry.io
{"level":"info","ts":1620831694.419665,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1620831694.4199674,"logger":"upgrade","msg":"looking for managed instances to upgrade"}
{"level":"info","ts":1620831694.520045,"logger":"upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":1620831694.520072,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}

As a result of the pod crash, there was no endpoints for the opentelemetry-operator-webhook-service service.

% oc get services
I0512 12:12:14.450657   13283 request.go:621] Throttling request took 1.007599832s, request: GET:https://api.vivaocp.comcast.net:6443/apis/operator.knative.dev/v1alpha1?timeout=32s
NAME                                                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
opentelemetry-operator-controller-manager-metrics-service   ClusterIP   172.30.17.111   <none>        8443/TCP   147m
opentelemetry-operator-webhook-service                      ClusterIP   172.30.170.58   <none>        443/TCP    147m
rhao000@HQSML-1717921 opentelemetry % oc describe service opentelemetry-operator-webhook-service
Name:              opentelemetry-operator-webhook-service
Namespace:         opentelemetry-operator-system
Labels:            <none>
Annotations:       Selector:  control-plane=controller-manager
Type:              ClusterIP
IP:                172.30.170.58
Port:              <unset>  443/TCP
TargetPort:        9443/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>

Not sure why the controller pod crashed in the first place.

@ruibinghao
Copy link
Contributor

ruibinghao commented May 12, 2021

Update. I did some research, exit code 137 indicates an out of memory for the container. So I modified the resource request and limit for the operator manager container to below:

  manager:
    Container ID:  cri-o://6f953d74c94f7937e92a377614dcbf022829bc3f2c251e1b016e365ea39bc990
    Image:         quay.io/opentelemetry/opentelemetry-operator:0.24.0
    Image ID:      quay.io/opentelemetry/opentelemetry-operator@sha256:94121fe1290eff08b89db89ca2e848a1684b9fb6805f99523607542a4436bb92
    Port:          9443/TCP
    Host Port:     0/TCP
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
      --enable-leader-election
    State:          Running
      Started:      Wed, 12 May 2021 19:19:16 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:        100m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bk97q (ro)

With this change in place, the operator is no longer crashing and I was able to deploy the simplest collector successfully.

@jpkrohling
Copy link
Member

@ruibinghao, would you be able to send in a PR bumping that? Here's the relevant place:

resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 20Mi

@sallyom
Copy link

sallyom commented May 14, 2021

nice! I was seeing exactly same w/ OpenShift (the no endpoints available for service "opentelemetry-operator-webhook-service") and fixed by updating the limits, requests in the deployment for manager container. :)

@krak3n
Copy link

krak3n commented May 21, 2021

Managed to get some time debugging this today, I set myself up a Private GKE cluster and was able to reproduce the context deadline exceeded issue. I can confirm that this is due ports 9443 not being open to the master.

I created these firewall rules for the master and everything works as it should:

gcloud compute firewall-rules create cert-manager-9443 \
  --source-ranges ${GKE_MASTER_CIDR} \
  --target-tags ${GKE_MASTER_TAG}  \
  --allow TCP:9443

Maybe some documentation should be added regarding private clusters to ensure these ports are open.

Edit: 8443 not needed, just 9443.

@jpkrohling
Copy link
Member

@krak3n, would you be able to document this in the place you'd expect to see this documented?

@krak3n
Copy link

krak3n commented May 25, 2021

Sure will do, assign to me 😄

@jpkrohling jpkrohling assigned krak3n and unassigned jpkrohling May 26, 2021
@jpkrohling jpkrohling added documentation Improvements or additions to documentation and removed needs-info labels May 26, 2021
@Mariusko82
Copy link

Mariusko82 commented Jun 8, 2021

Hi,
I am receiving this on openshift when try to make instance from operator OpenTelemetryCollector:
Danger alert:An error occurred
Error "failed calling webhook "mopentelemetrycollector.kb.io": Post https://opentelemetry-operator-controller-manager-service.acn-dev02.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s: dial tcp 10.131.4.237:443: connect: connection refused" for field "undefined".

Logs from opentelemetry operator:

[root@localhost ~]# oc logs opentelemetry-operator-controller-manager-788f5d5866-rbmwh manager
{"level":"info","ts":1622624370.50199,"msg":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.17.1","opentelemetry-collector":"0.17.0","build-date":"2020-12-17T11:40:23Z","go-version":"go1.13.15","go-arch":"amd64","go-os":"linux"}
{"level":"info","ts":1622624370.5027392,"logger":"setup","msg":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
I0602 08:59:31.553349       1 request.go:655] Throttling request took 1.036648884s, request: GET:https://172.30.0.1:443/apis/iam.policies.ibm.com/v1alpha1?timeout=32s
{"level":"info","ts":1622624377.758386,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1622624377.7587585,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1622624377.7588506,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1622624377.758937,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1622624377.758973,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":1622624377.7590606,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":1622624377.7590933,"logger":"setup","msg":"starting manager"}
I0602 08:59:37.759287       1 leaderelection.go:243] attempting to acquire leader lease acn-dev02/9f7554c3.opentelemetry.io...
{"level":"info","ts":1622624377.760516,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1622624377.760666,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1622624377.7697167,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1622624377.7699249,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":1622624377.7700403,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
I0602 08:59:55.178474       1 leaderelection.go:253] successfully acquired lease acn-dev02/9f7554c3.opentelemetry.io
{"level":"info","ts":1622624395.1786876,"logger":"upgrade","msg":"looking for managed instances to upgrade"}
{"level":"info","ts":1622624395.1786964,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1622624395.2792726,"logger":"upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":1622624395.2792797,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1622624396.38059,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1622624396.482058,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1622624396.5835063,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1622624396.7847738,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting EventSource","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","source":"kind source: /, Kind="}
{"level":"info","ts":1622624396.885339,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting Controller","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector"}
{"level":"info","ts":1622624396.8854218,"logger":"controller-runtime.manager.controller.opentelemetrycollector","msg":"Starting workers","reconciler group":"opentelemetry.io","reconciler kind":"OpenTelemetryCollector","worker count":1}

@jpkrohling
Copy link
Member

I'm closing this, as the original report has been fixed.

@CyberHippo
Copy link
Contributor Author

Thanks again for all your kind help @jpkrohling

@dlouvier
Copy link

dlouvier commented Jan 3, 2022

Managed to get some time debugging this today, I set myself up a Private GKE cluster and was able to reproduce the context deadline exceeded issue. I can confirm that this is due ports 9443 not being open to the master.

I created these firewall rules for the master and everything works as it should:

gcloud compute firewall-rules create cert-manager-9443 \
  --source-ranges ${GKE_MASTER_CIDR} \
  --target-tags ${GKE_MASTER_TAG}  \
  --allow TCP:9443

Maybe some documentation should be added regarding private clusters to ensure these ports are open.

Edit: 8443 not needed, just 9443.

Thank you!

This also resolve the issue in GKE Autopilot!

@dbacinski
Copy link

related issue to document it #1009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

8 participants