Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook timeout when creating an Instrumentation on GKE private cluster #1009

Closed
CyberHippo opened this issue Jul 28, 2022 · 8 comments · Fixed by #1010
Closed

Webhook timeout when creating an Instrumentation on GKE private cluster #1009

CyberHippo opened this issue Jul 28, 2022 · 8 comments · Fixed by #1010
Labels
area:auto-instrumentation Issues for auto-instrumentation

Comments

@CyberHippo
Copy link
Contributor

Hi,
When creating an Instrumentation resource on a private GKE cluster I encounter the following error:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "minstrumentation.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-instrumentation?timeout=10s": context deadline exceeded

This issue is similar to this issue about ingress-nginx on private GKE clusters and is due to GCP firewall rules of private GKE clusters.
I would suggest adding a note about it in the documentation (similarly to kubernetes/ingress-nginx#5487) to help other users of this great operator.

Steps to reproduce

Install the operator in an existing private GKE cluster where cert-manager is installed:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Once the opentelemetry-operator deployment is ready, create an OpenTelemetry Instrumentation resource:

kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: my-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
EOF

Solution

You will need to either add a firewall rule that allows master nodes access to port 9443/tcp on worker nodes, or change the existing rule that allows access to port 80/tcp, 443/tcp and 10254/tcp to also allow access to port 9443/tcp.

kubectl get endpoints --namespace opentelemetry-operator-system opentelemetry-operator-webhook-service
NAME                                     ENDPOINTS         AGE
opentelemetry-operator-webhook-service   10.56.6.98:9443   60m

I could open a PR to add a note in the operator documentation about this issue if needed.

@pavolloffay pavolloffay added the area:auto-instrumentation Issues for auto-instrumentation label Jul 28, 2022
@pavolloffay
Copy link
Member

@CyberHippo would you like to document it?

@CyberHippo
Copy link
Contributor Author

@pavolloffay Yes, I will work on a PR to document it.

@kamalmarhubi
Copy link

An alternative that works is to use 10250 for the webhook port. That way the built-in firewall rule allows traffic. Tested that and it works nicely.

@pavolloffay
Copy link
Member

10250

Does this apply only for GKE or is it a commonly used port for webhooks?

@kamalmarhubi
Copy link

@pavolloffay it's actually the port that the API server uses to talk to the kubelet. I think because of how firewall rules interact with alias IP ranges (used in VPC-native networking), it results in the API server subnet being able to reach the pods at that port. Using it for webhooks seems to be a handy workaround for GKE private clusters without needing to add or edit firewall rules for every webhook you add.

For private clusters, GKE creates firewall rules that allow access on 443 and 10250. You can run the command here to see that rule: https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#view_firewall_rules.

@dbacinski
Copy link

dbacinski commented Nov 3, 2022

I had to modify the existing GKE firewall rule on google console and add port 9443.

Screenshot 2022-11-03 at 11 56 30

now creating resources works:

$ kubectl apply -f collector.yaml
opentelemetrycollector.opentelemetry.io/apm-sidecar created
$ kubectl apply -f instrumentation.yaml 
instrumentation.opentelemetry.io/apm-instrumentation created

before it was failing with:

$ kubectl apply -f collector.yaml
Error from server (InternalError): error when creating "collector.yaml": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": context deadline exceeded

@sreejesh-radhakrishnan-db

HI, has this been added to the document? if so please can you let me know where I can read more on this. I have hit with same issue and I got around by repurposing a port (to master) which was opened for istio. Might like to see this in pre-req section of document if possible. and can avoid log of searching on how to fix failure :-)

@Minivolk02
Copy link

Hi, i have the same issue
I added firewall rule for all ports and networks. but i still have this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:auto-instrumentation Issues for auto-instrumentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants