Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize the Liveness Probe of the Jaeger Ingester by extending the CRD #1539

Closed
ricoberger opened this issue Aug 19, 2021 · 0 comments · Fixed by #1605
Closed

Customize the Liveness Probe of the Jaeger Ingester by extending the CRD #1539

ricoberger opened this issue Aug 19, 2021 · 0 comments · Fixed by #1605

Comments

@ricoberger
Copy link
Contributor

Requirement - what kind of business use case are you trying to solve?

I'm trying out the Jaeger Operator v1.25.0 with the streaming strategy and the Jaeger ClickHouse storage plugin. Now it could happen that the plugin in the Ingester Pod is killed with the following error message, when the container hasn't enough resources:

2021-08-19T15:52:11.780Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=11 error="signal: killed"

In that case the Ingester Pod is still running but doesn't write any messages from Kafka to ClickHouse. Therefore it would be good if we can adjust the liveness probe of the Pod so that the Pod is restarted, when it isn't working anymore.

Problem - what in Jaeger blocks you from solving the requirement?

When the messages from Kafka are not consumed anymore, because the plugin was killed, we have to restart the Pod. A better approach would be, that the Pod is automatically restarted.

For that it must be possible that we can adjust the liveness probe of the Pod via the created Jaeger CR, which is currently not possible.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Extend the Jaeger CRD with a new field, where a user can set a custom liveness probe for the Jaeger Ingester:

---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: streaming

  ingester:
    autoscale: false
    replicas: 5
    resources:
      limits:
        cpu: 2000m
        memory: 1024Mi
      requests:
        cpu: 500m
        memory: 256Mi

    # First approach: Add a liveness probe to check the Port of the storage plugin
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /metrics # Maybe a health check endpoint can be added to the plugin, which can then be used instead of the metrics endpoint
        port: 9090
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 1

    # Seconde approach: Check if the process is running
    livenessProbe:
      exec:
        command:
          - sh
          - -ec
          - ps -ef | grep "/plugin/jaeger-clickhouse --config /plugin-config/config.yaml" | grep -v grep

    options:
      log-level: warn
      ingester:
        deadlockInterval: 300s
      kafka:
        consumer:
          topic: jaeger-spans
          brokers: kafka-kafka-0.kafka-kafka-brokers.tracing.svc.cluster.local:9092,kafka-kafka-1.kafka-kafka-brokers.tracing.svc.cluster.local:9092,kafka-kafka-2.kafka-kafka-brokers.tracing.svc.cluster.local:9092

Any open questions to address

Maybe this can also checked by the Jaeger Ingester and the liveness probe of the Ingester fails, when the storage plugin was killed. If you think this would be the better approach I can also create an issue in the jaegertracing/jaeger repository.

Processes when everything looks fine
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      2:56 /go/bin/ingester-linux --grpc-storage-plugin.binary=/plugin/jaeger-clickhouse --grpc-storage-plugin.configuration-file=/plugin-config/config.yaml --grpc-storage-plugin.log-level=debug --ingester.deadlockInterval=300s --kafk
   12 root      3:16 /plugin/jaeger-clickhouse --config /plugin-config/config.yaml
   35 root      0:00 sh
   40 root      0:00 ps -ef
Processes after the storage plugin was killed
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      1:29 /go/bin/ingester-linux --grpc-storage-plugin.binary=/plugin/jaeger-clickhouse --grpc-storage-plugin.configuration-file=/plugin-config/config.yaml --grpc-storage-plugin.log-level=debug --ingester.deadlockInterval=300s --kafk
   33 root      0:00 sh
   38 root      0:00 ps -ef
Jaeger Ingester logs
2021/08/19 15:51:35 maxprocs: Updating GOMAXPROCS=1: using minimum allowed GOMAXPROCS
2021-08-19T15:51:35.174Z [DEBUG] starting plugin: path=/plugin/jaeger-clickhouse args=["/plugin/jaeger-clickhouse", "--config", "/plugin-config/config.yaml"]
2021-08-19T15:51:35.177Z [DEBUG] plugin started: path=/plugin/jaeger-clickhouse pid=11
2021-08-19T15:51:35.178Z [DEBUG] waiting for RPC address: path=/plugin/jaeger-clickhouse
2021-08-19T15:51:35.198Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_index_local ON CLUSTER '{cluster}'
  | (
  |     timestamp  DateTime CODEC (Delta, ZSTD(1)),
  |     traceID    String CODEC (ZSTD(1)),
  |     service    LowCardinality(String) CODEC (ZSTD(1)),
  |     operation  LowCardinality(String) CODEC (ZSTD(1)),
  |     durationUs UInt64 CODEC (ZSTD(1)),
  |     tags       Array(String) CODEC (ZSTD(1)),
  |     INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 64,
  |     INDEX idx_duration durationUs TYPE minmax GRANULARITY 1
  | ) ENGINE ReplicatedMergeTree
  |       TTL timestamp + INTERVAL 3 DAY DELETE
  |       PARTITION BY toDate(timestamp)
  |       ORDER BY (service, -toUnixTimestamp(timestamp))
  |       SETTINGS index_granularity = 1024;
   timestamp=2021-08-19T15:51:35.198Z
2021-08-19T15:51:35.472Z [DEBUG] jaeger-clickhouse: Running SQL statement:
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_spans_local ON CLUSTER '{cluster}'
  | (
  |     timestamp DateTime CODEC (Delta, ZSTD(1)),
  |     traceID   String CODEC (ZSTD(1)),
  |     model     String CODEC (ZSTD(3))
  | ) ENGINE ReplicatedMergeTree
  |       TTL timestamp + INTERVAL 3 DAY DELETE
  |       PARTITION BY toDate(timestamp)
  |       ORDER BY traceID
  |       SETTINGS index_granularity = 1024;
   @module=jaeger-clickhouse timestamp=2021-08-19T15:51:35.472Z
2021-08-19T15:51:35.797Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE MATERIALIZED VIEW IF NOT EXISTS jaeger_operations_local ON CLUSTER '{cluster}'
  |         ENGINE ReplicatedMergeTree
  |             TTL date + INTERVAL 3 DAY DELETE
  |             PARTITION BY toYYYYMM(date) ORDER BY (date, service, operation)
  |             SETTINGS index_granularity=32
  |         POPULATE
  | AS SELECT toDate(timestamp) AS date,
  |           service,
  |           operation,
  |           count()           as count
  |    FROM jaeger.jaeger_index_local -- here goes local index table
  |    GROUP BY date, service, operation;
   timestamp=2021-08-19T15:51:35.797Z
2021-08-19T15:51:36.149Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_spans_archive_local ON CLUSTER '{cluster}'
  | (
  |     timestamp DateTime CODEC (Delta, ZSTD(1)),
  |     traceID   String CODEC (ZSTD(1)),
  |     model     String CODEC (ZSTD(3))
  | ) ENGINE ReplicatedMergeTree
  |       TTL timestamp + INTERVAL 3 DAY DELETE
  |       PARTITION BY toYYYYMM(timestamp)
  |       ORDER BY traceID
  |       SETTINGS index_granularity = 1024
   timestamp=2021-08-19T15:51:36.149Z
2021-08-19T15:51:36.484Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_spans -- global table name
  |     ON CLUSTER '{cluster}' AS jaeger.jaeger_spans_local -- local table name
  |     ENGINE = Distributed('{cluster}', jaeger, jaeger_spans_local, cityHash64(traceID)); -- local table name
   timestamp=2021-08-19T15:51:36.483Z
2021-08-19T15:51:36.852Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_index -- global table name
  |     ON CLUSTER '{cluster}' AS jaeger.jaeger_index_local -- local table name
  |     ENGINE = Distributed('{cluster}', jaeger, jaeger_index_local, cityHash64(traceID)); -- local table name
   timestamp=2021-08-19T15:51:36.851Z
2021-08-19T15:51:37.322Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_spans_archive -- global table name
  |     ON CLUSTER '{cluster}' AS jaeger.jaeger_spans_archive_local -- local table name
  |     ENGINE = Distributed('{cluster}', jaeger, jaeger_spans_archive_local, cityHash64(traceID)); -- local table name
   timestamp=2021-08-19T15:51:37.321Z
2021-08-19T15:51:37.663Z [DEBUG] jaeger-clickhouse: Running SQL statement: @module=jaeger-clickhouse
  statement=
  | CREATE TABLE IF NOT EXISTS jaeger_operations -- operations table
  |     ON CLUSTER '{cluster}' AS jaeger.jaeger_operations_local -- local operations table
  |     ENGINE = Distributed('{cluster}', jaeger, jaeger_operations_local, rand()); -- local operations table
   timestamp=2021-08-19T15:51:37.662Z
2021-08-19T15:51:37.978Z [DEBUG] using plugin: version=1
2021-08-19T15:51:37.979Z [DEBUG] jaeger-clickhouse: plugin address: address=/tmp/plugin792476791 network=unix timestamp=2021-08-19T15:51:37.977Z
2021-08-19T15:52:11.780Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=11 error="signal: killed"
{"level":"error","ts":1629388331.78125,"caller":"grpclog/component.go:79","msg":"[transport]transport: loopyWriter.run returning. Err: write unix @->/tmp/plugin792476791: write: broken pipe","system":"grpc","grpc_log":true,"stacktrace":"google.golang.org/grpc/grpclog.(*componentData).Errorf\n\tgoogle.golang.org/[email protected]/grpclog/component.go:79\ngoogle.golang.org/grpc/internal/transport.newHTTP2Client.func3\n\tgoogle.golang.org/[email protected]/internal/transport/http2_client.go:399"}
2021-08-19T15:52:11.784Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Canceled desc = context canceled"
{"level":"warn","ts":1629388331.7952156,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {unused unused <nil> 0 <nil>}. Err: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin792476791: connect: connection refused\". Reconnecting...","system":"grpc","grpc_log":true}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant