Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpTel and Jaeger integration is not working properly #492

Closed
jvanz opened this issue Jul 26, 2024 · 4 comments
Closed

OpTel and Jaeger integration is not working properly #492

jvanz opened this issue Jul 26, 2024 · 4 comments
Assignees
Labels
area/observability kind/bug Something isn't working
Milestone

Comments

@jvanz
Copy link
Member

jvanz commented Jul 26, 2024

During the testing of the Kubewarden v1.15.0 release candidates we noticed that Kubewarden integration with our observability stack is not working as expected with the latest OpTel and Jaeger Helm chart versions. It looks like that our current configuration used in the Helm chart to deploy the Kubewarden stack is not working. The OpTel collector is not able to find the Jaeger service to send some tracing data. We need to investigate if there is some compatibility between the latest versions from Jaeger and OpTel or if we just need to update our configurations in our Helm charts.

This are the versions in use during the rc testing with the latest versions:

cert-manager                    cert-manager    1               2024-07-26 09:30:01.528778837 -0300 -03 deployed        cert-manager-v1.13.1            v1.13.1    
jaeger-operator                 jaeger          1               2024-07-26 09:33:31.375358011 -0300 -03 deployed        jaeger-operator-2.54.0          1.57.0     
kubewarden-controller           kubewarden      1               2024-07-26 09:43:30.764117351 -0300 -03 deployed        kubewarden-controller-2.3.0-rc2 v1.15.0-rc2
kubewarden-crds                 kubewarden      1               2024-07-26 09:43:27.75106705 -0300 -03  deployed        kubewarden-crds-1.7.0-rc2       v1.15.0-rc2
kubewarden-defaults             kubewarden      1               2024-07-26 09:43:50.543045482 -0300 -03 deployed        kubewarden-defaults-2.2.0-rc2   v1.15.0-rc2
my-opentelemetry-operator       open-telemetry  1               2024-07-26 09:31:27.972671429 -0300 -03 deployed        opentelemetry-operator-0.64.4   0.103.0    
prometheus                      prometheus      1               2024-07-26 09:36:12.743431631 -0300 -03 deployed        kube-prometheus-stack-61.3.2    v0.75.1    

Issue(s) found:

  • The OpTel collector is not able to connect to Jaeger and send the tracing data. The service configured in our collector does not exist: my-open-telemetry-collector.jaeger.svc.cluster.local:4317:
2024-07-26T14:04:14.693Z        info    exporterhelper/retry_sender.go:118      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "21.017114373s"}
2024-07-26T14:04:17.424Z        info    exporterhelper/retry_sender.go:118      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "25.895047337s"}
2024-07-26T14:04:21.184Z        info    exporterhelper/retry_sender.go:118      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "19.77908214s"}
2024-07-26T14:04:35.711Z        info    exporterhelper/retry_sender.go:118      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "21.097872688s"}
2024-07-26T14:04:38.278Z        error   exporterhelper/queue_sender.go:90       Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "no more retries left: rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "dropped_items": 28}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
        go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:90
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
        go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
2024-07-26T14:04:40.963Z        info    exporterhelper/retry_sender.go:118      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "17.110600816s"}
2024-07-26T14:04:43.320Z        error   exporterhelper/queue_sender.go:90       Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "no more retries left: rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
        go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:90
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
        go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
2024-07-26T14:04:56.809Z        error   exporterhelper/queue_sender.go:90       Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "no more retries left: rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "dropped_items": 128}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
        go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:90
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
        go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
2024-07-26T14:04:58.075Z        error   exporterhelper/queue_sender.go:90       Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "traces", "name": "otlp/jaeger", "error": "no more retries left: rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "dropped_items": 25}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
        go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:90
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1

Acceptance criteria

  • Discover if there is some compatibility issue between Jaeger and OpTel
  • If we have issues between Jaeger and OpTel, find the greatest version that work with our current configuration and updates the docs changing the limit version to be used with both dependencies
  • If there is no compatibility issues beetwen Jaeger and OpTel, update our Helm chart or any other component necessary to make it work again.
  • Tracing data should be visible in Jaeger UI using the required version (considering the previous acceptance criteria)
@jvanz jvanz added this to Kubewarden Jul 26, 2024
@jvanz jvanz added kind/bug Something isn't working area/observability labels Jul 26, 2024
@jvanz jvanz moved this to Todo in Kubewarden Jul 26, 2024
@jvanz jvanz added this to the 1.15 milestone Jul 26, 2024
@viccuad
Copy link
Member

viccuad commented Jul 26, 2024

Adding info:
Kubewarden 1.14 with the otl stack deps, versions doc.kubewarden.io, works fine. Both tracing and metrics. That is, opentelemetry-operator-0.56.0 , jaeger-operator-2.49.0 , kube-prometheus-stack-51.5.3.
These versions are rather old, though.

@kkaempf
Copy link

kkaempf commented Jul 26, 2024

is not working as expected

Please, more details.

  • what's the failed test case ?
  • how does it fail ?
  • what is the expected output ?
  • logs ?

@jvanz
Copy link
Member Author

jvanz commented Jul 26, 2024

is not working as expected

Please, more details.

* what's the failed test case ?

None, this is not spotted by test cases.

* how does it fail ?

The OpTel collector is not able to find the Jaeger service to send some tracing data

I've rephrased that in a seperated section in the issue description.

* what is the expected output ?

Acceptance criteria updated to make that more clear

* logs?

Description updated

@viccuad
Copy link
Member

viccuad commented Jul 29, 2024

After fixing kubewarden/policy-server#847, testing with 1.15.0-rc2, policy-server:latest, and opentelemetry stack from docs.kubewarden.io shows that everything is working as expected :).

On latest opentelemetry stack and 1.15.0-rc2, with policy-server:latest, everything works too.
Yet we hit jaegertracing/helm-charts#549. The workaround is to expand the ClusterRole jaeger-operator with get, list permissions for ingressclasses. Will add workaround to e2e tests.

I consider this done.

What to expect when testing:

Click me

On Jaeger:
There must be a service kubewarden-policy-server that exposes 6 operations (validate_settings, validate, validation, audit, request, policy_log). These come from the policy-server.

On Prometheus:
kubewarden_policy_evaluation_latency_milliseconds_sum is present (created by policy-server)
kubewarden_policy_total is present (created by kubewarden-controller)

On Grafana:
with default policies installed, targetting a ClusterAdmissionPolicy (notice the prefix clusterwide) clusterwide-no-privilege-escalation and an audit scanner run, each metric has some values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability kind/bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants