From 4238012c349ab3346a82ec49eebbb0d515e848e0 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 22 Jul 2020 14:39:09 -0700 Subject: [PATCH] docs: new monitoring and metrics docs (DOCS-3693) (#5808) * docs: first draft of new monitoring docs * docs: delete old docs out of place * docs: first set of metrics * docs: fix text * docs: fix text * docs: fix copy * docs: more metrics wip * docs: full list of metrics * docs: enable PQ metrics by default * docs: update rates * Apply suggestions from code review Co-authored-by: Jim Galasyn * Apply suggestions from code review Co-authored-by: Jim Galasyn * docs: fix link Co-authored-by: Jim Galasyn --- docs/operate-and-deploy/index.md | 40 --- .../installation/server-config/index.md | 56 --- docs/operate-and-deploy/monitoring.md | 79 +++++ docs/reference/metrics.md | 332 ++++++++++++++++++ mkdocs.yml | 3 + 5 files changed, 414 insertions(+), 96 deletions(-) create mode 100644 docs/operate-and-deploy/monitoring.md create mode 100644 docs/reference/metrics.md diff --git a/docs/operate-and-deploy/index.md b/docs/operate-and-deploy/index.md index ce66d5191879..16534e2ce5c4 100644 --- a/docs/operate-and-deploy/index.md +++ b/docs/operate-and-deploy/index.md @@ -51,46 +51,6 @@ Troubleshooting If ksqlDB isn't behaving as expected, see [Troubleshoot ksqlDB issues](../troubleshoot-ksqldb.md) -Monitoring and Metrics ----------------------- - -ksqlDB includes JMX (Java Management Extensions) metrics which give -insights into what is happening inside your ksqlDB servers. These metrics -include the number of messages, the total throughput, throughput -distribution, error rate, and more. - -To enable JMX metrics, set `JMX_PORT` before starting the ksqlDB server: - -```bash -export JMX_PORT=1099 && \ -/bin/ksql-server-start /etc/ksqldb/ksql-server.properties -``` - -The `ksql-print-metrics` command line utility collects these metrics and -prints them to the console. You can invoke this utility from your -terminal: - -```bash -/bin/ksql-print-metrics -``` - -Your output should resemble: - -``` -messages-consumed-avg: 96416.96196183885 -messages-consumed-min: 88900.3329377909 -error-rate: 0.0 -num-persistent-queries: 2.0 -messages-consumed-per-sec: 193024.78294586178 -messages-produced-per-sec: 193025.4730374501 -num-active-queries: 2.0 -num-idle-queries: 0.0 -messages-consumed-max: 103397.81191436431 -``` - -For more information about {{ site.kstreams }} metrics, see -[Monitoring Streams Applications](https://docs.confluent.io/current/streams/monitoring.html). - Next Steps ---------- diff --git a/docs/operate-and-deploy/installation/server-config/index.md b/docs/operate-and-deploy/installation/server-config/index.md index e859725560c9..6161bc26ea52 100644 --- a/docs/operate-and-deploy/installation/server-config/index.md +++ b/docs/operate-and-deploy/installation/server-config/index.md @@ -171,62 +171,6 @@ JAVA_HOME export JAVA_HOME= ``` -JMX Metrics ------------ - -To enable JMX metrics, set `JMX_PORT` before starting the ksqlDB server: - -```bash -export JMX_PORT=1099 && \ -/bin/ksql-server-start /etc/ksqldb/ksql-server.properties -``` - -Run the `ksql-print-metrics` tool to see the available JMX metrics for -ksqlDB. - -```bash -/bin/ksql-print-metrics -``` - -Your output should resemble: - -``` - _confluent-ksql-default_bytes-consumed-total: 926543.0 - _confluent-ksql-default_num-active-queries: 4.0 - _confluent-ksql-default_ksql-engine-query-stats-RUNNING-queries: 4 - _confluent-ksql-default_ksql-engine-query-stats-NOT_RUNNING-queries: 0 - _confluent-ksql-default_messages-consumed-min: 0.0 - _confluent-ksql-default_messages-consumed-avg: 29.48784732897881 - _confluent-ksql-default_num-persistent-queries: 4.0 - _confluent-ksql-default_ksql-engine-query-stats-ERROR-queries: 0 - _confluent-ksql-default_num-idle-queries: 0.0 - _confluent-ksql-default_messages-consumed-per-sec: 105.07699698626074 - _confluent-ksql-default_messages-produced-per-sec: 11.256903025105757 - _confluent-ksql-default_error-rate: 0.0 - _confluent-ksql-default_ksql-engine-query-stats-PENDING_SHUTDOWN-queries: 0 - _confluent-ksql-default_ksql-engine-query-stats-REBALANCING-queries: 0 - _confluent-ksql-default_messages-consumed-total: 10503.0 - _confluent-ksql-default_ksql-engine-query-stats-CREATED-queries: 0 - _confluent-ksql-default_messages-consumed-max: 100.1243737430132 -``` - -The following table describes the available ksqlDB metrics. - -| JMX Metric | Description | -| ------------------------- | -------------------------------------------------------------------------------------------------- | -| bytes-consumed-total | Number of bytes consumed across all queries. | -| error-rate | Number of messages that have been consumed but not processed across all queries. | -| messages-consumed-avg | Average number of messages consumed by a query per second. | -| messages-consumed-per-sec | Number of messages consumed per second across all queries. | -| messages-consumed-min | Number of messages consumed per second for the query with the fewest messages consumed per second. | -| messages-consumed-max | Number of messages consumed per second for the query with the most messages consumed per second. | -| messages-consumed-total | Number of messages consumed across all queries. | -| messages-produced-per-sec | Number of messages produced per second across all queries. | -| num-persistent-queries | Number of persistent queries that are currently executing. | -| num-active-queries | Number of queries that are actively processing messages. | -| num-idle-queries | Number of queries with no messages available to process. | - - Non-interactive (Headless) ksqlDB Usage --------------------------------------- diff --git a/docs/operate-and-deploy/monitoring.md b/docs/operate-and-deploy/monitoring.md new file mode 100644 index 000000000000..dfd08622c766 --- /dev/null +++ b/docs/operate-and-deploy/monitoring.md @@ -0,0 +1,79 @@ +# Monitoring + +## Context + +ksqlDB publishes metrics via JMX ([Java Management +Extensions](https://www.oracle.com/java/technologies/javase/javamanagement.html)) +which help you monitor what is happening inside of ksqlDB's servers. For a +comprehensive list of metrics, see [the reference section](../reference/metrics.md). + +## Enable monitoring + +You must enable monitoring explicitly on each ksqlDB server. To enable +it in a Docker-based deployment, export an environment variable named +`KSQL_JMX_OPTS` with your JMX configuration and expose the port that +JMX will communicate over. + +The following Docker Compose example shows how you can configure +monitoring for ksqlDB server. The surrounding components, like the +broker and CLI, are omitted for brevity. You can see an example of a +complete setup in the [quickstart](https://ksqldb.io/quickstart.html). + +```yaml +ksqldb-server: + image: confluentinc/ksqldb-server:0.10.1 + hostname: ksqldb-server + container_name: ksqldb-server + depends_on: + - broker + - schema-registry + ports: + - "8088:8088" + - "1099:1099" + environment: + KSQL_LISTENERS: "http://0.0.0.0:8088" + KSQL_BOOTSTRAP_SERVERS: "broker:9092" + KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081" + KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true" + KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true" + KSQL_KSQL_QUERY_PULL_METRICS_ENABLED: "true" + KSQL_JMX_OPTS: > + -Djava.rmi.server.hostname=localhost + -Dcom.sun.management.jmxremote + -Dcom.sun.management.jmxremote.port=1099 + -Dcom.sun.management.jmxremote.authenticate=false + -Dcom.sun.management.jmxremote.ssl=false + -Dcom.sun.management.jmxremote.rmi.port=1099 +``` + +With respect to monitoring, here it what this does: + +- The environment variable `KSQL_JMX_OPTS` is supplied to the server + with various arguments. The `>` character lets you write a + multi-line string in Yaml, which makes this long argument easier to + read. The advertised hostname, port, and security settings are + configured. JMX has a wide range of [configuration + options](https://docs.oracle.com/javase/8/docs/technotes/guides/management/agent.html), + and you can set these however you like. + +- Port `1099` is exposed, which corresponds to the JMX port set in the + `KSQL_JMX_OPTS` configuration. This enables remote monitoring tools + to communicate into ksqlDB's process. + +## Verifying your monitoring setup + +An easy way to check that ksqlDB is properly emitting metrics is by +using `jconsole`. JConsole is a graphical monitoring tool to monitor +the JVM, and it ships with by default with Oracle JDK installations. + +On your host machine, run the command: + +```bash +jconsole +``` + +You will be prompted for a host and port. If you used the example +configuration, `localhost:1099` establishes the connection. You +should see a series of graphs showing resource utilization. If you +don't, make sure the networking between your machine and the Docker +container is configured correctly. diff --git a/docs/reference/metrics.md b/docs/reference/metrics.md new file mode 100644 index 000000000000..4a1b32cfeed6 --- /dev/null +++ b/docs/reference/metrics.md @@ -0,0 +1,332 @@ +# Metrics + +ksqlDB emits a variety of JMX metrics to help you understand +[monitor](../operate-and-deploy/monitoring.md) what its servers are +doing. This reference describes each metric and grouping. + +## All persistent queries + +Metrics that describe the full set of persistent queries on a given server. + +``` +io.confluent.ksql.metrics:type=ksql-engine-query-stats +``` + +### Attributes + +**Number of persistent queries** + +`_confluent-ksql-default_num-persistent-queries` + +The current number of persistent queries running in this engine. + +**Number of active queries** + +`_confluent-ksql-default_num-active-queries` + +The current number of active queries running in this engine. + +**Number of running queries** + +`_confluent-ksql-default_ksql-engine-query-stats-RUNNING-queries` + +Count of queries in `RUNNING` state. + +**Number of idle queries** + +`_confluent-ksql-default_num-idle-queries` + +Number of inactive queries. + +**Number of not running queries** + +`_confluent-ksql-default_ksql-engine-query-stats-NOT_RUNNING-queries` + +Count of queries in `NOT_RUNNING` state. + +**Number of rebalancing queries** + +`_confluent-ksql-default_ksql-engine-query-stats-REBALANCING-queries` + +Count of queries in `REBALANCING` state. + +**Number of created queries** + +`_confluent-ksql-default_ksql-engine-query-stats-CREATED-queries` + +Count of queries in `CREATED` state. + +**Number of pending shutdown queries** + +`_confluent-ksql-default_ksql-engine-query-stats-PENDING_SHUTDOWN-queries` + +Count of queries in `PENDING_SHUTDOWN` state. + +**Number of error queries** + +`_confluent-ksql-default_ksql-engine-query-stats-ERROR-queries` + +Count of queries in `ERROR` state. + +**Total bytes consumed** + +`_confluent-ksql-default_bytes-consumed-total` + +The total number of bytes consumed across all queries. + +**Minimum messages consumed** + +`_confluent-ksql-default_messages-consumed-min` + +Min msgs consumed by query. + +**Maximum messages consumed** + +`_confluent-ksql-default_messages-consumed-max` + +Max msgs consumed by query. + +**Average messages consumed** + +`_confluent-ksql-default_messages-consumed-avg` + +Mean msgs consumed by query. + +**Messages consumed per second** + +`_confluent-ksql-default_messages-consumed-per-sec` + +The number of messages consumed per second across all queries. + +**Messages consumed total** + +`_confluent-ksql-default_messages-consumed-total` + +The total number of messages consumed across all queries. + +**Messages produced per second** + +`_confluent-ksql-default_messages-produced-per-sec` + +The number of messages produced per second across all queries. + +**Error rate** + +`_confluent-ksql-default_error-rate` + +The number of messages that were consumed but not processed. Messages may not be processed if, for instance, the message contents could not be deserialized due to an incompatible schema. Alternately, a consumed message may not have been produced, hence being effectively dropped. Such messages would also be counted toward the error rate. + +**Liveness indicator** + +`_confluent-ksql-default_liveness-indicator` + +A metric with constant value `1` indicating the server is up and emitting metrics. + +## Persistent query status + +Metrics that describe the health of each persistent query. + +``` +io.confluent.ksql.metrics:type=ksql-queries +``` + +### Attributes + +**Query status** + +`query-status` + +The current status of the given query. + +**Error status** + +`error-status` + +The current error status of the given query, if the state is in ERROR state. + +## Persistent query production + +Metrics that describe the producer activity of each persistent query. + +``` +io.confluent.ksql.metrics:type=producer-metrics +``` + +### Attributes + +**Total messages** + +`total-messages` + +The total number of messages produced. + +**Messages per second** + +`messages-per-sec` + +The total number of messages produced per second. + +## Persistent query consumption + +Metrics that describe the consumer activity of each persistent query. + +``` +io.confluent.ksql.metrics:type=consumer-metrics +``` + +### Attributes + +**Total messages** + +`consumer-total-messages` + +The total number of messages consumed. + +**Messages per second** + +`consumer-messages-per-sec` + +The total number of messages consumed per second. + +**Total bytes** + +`consumer-total-bytes` + +The total number of bytes consumed. + +## Runtime + +Because ksqlDB persistent queries directly compile into {{ +site.kstreams }} topologies, many useful [{{ site.kstreams }} +metrics](https://docs.confluent.io/current/streams/monitoring.html) +are emitted for each persistent query. These metrics are omitted from +this reference to avoid redundancy. + +## HTTP server + +ksqlDB's REST API is built ontop of Vert, and consequentially exposes +many [Vert.x +metrics](https://vertx.io/docs/vertx-dropwizard-metrics/java/) +directly. These metrics are omitted from this reference to avoid redundancy. + + +## Pull queries + +Metrics that describe the activity of pull queries on each server. + +``` +io.confluent.ksql.metrics:type=_confluent-ksql-default_pull-query +``` + +!!! info + Pull query metrics must be enabled explicitly by setting + the `ksql.query.pull.metrics.enabled` server configuration to `true`. + +### Attributes + +**Pull query total requests** + +`pull-query-requests-total` + +Total number of pull query requests. + +**Pull query request rate** + +`pull-query-requests-rate` + +Rate of pull query requests per second. + +**Pull query requests error count** + +`pull-query-requests-error-total` + +Total number of erroneous pull query requests. + +**Pull query request error rate** + +`pull-query-requests-error-rate` + +Rate of erroneous pull query requests per second. + +**Local pull query requests count** + +`pull-query-requests-local` + +Count of local pull query requests. + +**Local pull query requests rate** + +`pull-query-requests-local-rate` + +Rate of local pull query requests per second. + +**Remote pull query requests count** + +`pull-query-requests-remote` + +Count of remote pull query requests. + +**Remote pull query requests rate** + +`pull-query-requests-remote-rate` + +Rate of remote pull query requests per second. + +**Pull query minimum request latency** + +`pull-query-requests-latency-latency-min` + +Average time for a pull query request. + +**Pull query maximum request latency** + +`pull-query-requests-latency-latency-max` + +Max time for a pull query request. + +**Pull query average request latency** + +`pull-query-requests-latency-latency-avg` + +Average time for a pull query request. + +**Pull query latency 50th percentile** + +`pull-query-requests-latency-distribution-50` + +Latency distribution of the 50th percentile. + +**Pull query latency 75th percentile** + +`pull-query-requests-latency-distribution-75` + +Latency distribution of the 75th percentile. + +**Pull query latency 75th percentile** + +`pull-query-requests-latency-distribution-90` + +Latency distribution of the 90th percentile. + +**Pull query latency 99th percentile** + +`pull-query-requests-latency-distribution-99` + +Latency distribution of the 99th percentile. + +## Command runner + +Metrics that describe the health of the `CommandRunner` thread, which +enables each node to participate in distributed computation. + +``` +io.confluent.ksql.metrics:type=_confluent-ksql-default_ksql-rest-app-command-runner +``` + +### Attributes + +**Thread status** + +`status` + +The status of the commandRunner thread as it processes the command topic. diff --git a/mkdocs.yml b/mkdocs.yml index a781408604af..0d0aeb3de4d9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -113,6 +113,7 @@ nav: - Scalar functions: developer-guide/ksqldb-reference/scalar-functions.md - Aggregation functions: developer-guide/ksqldb-reference/aggregate-functions.md - Table Functions: developer-guide/ksqldb-reference/table-functions.md + - Metrics: reference/metrics.md - REST API: - REST API Index: developer-guide/api.md # old reference topic, rename file to index.md - Introspect query status: developer-guide/ksqldb-rest-api/status-endpoint.md @@ -137,6 +138,7 @@ nav: - Configuration Parameter Reference: operate-and-deploy/installation/server-config/config-reference.md - Configure Security for ksqlDB: operate-and-deploy/installation/server-config/security.md - Upgrade ksqlDB: operate-and-deploy/installation/upgrading.md + - Monitoring: operate-and-deploy/monitoring.md - Plan Capacity: operate-and-deploy/capacity-planning.md - KSQL and ksqlDB: operate-and-deploy/ksql-vs-ksqldb.md - Changelog: operate-and-deploy/changelog.md @@ -150,6 +152,7 @@ nav: - Integrate with PostgreSQL: tutorials/connect-integration.md - Troubleshooting: troubleshoot-ksqldb.md - Frequently Asked Questions: faq.md + markdown_extensions: - toc: permalink: true