Skip to content

Latest commit

 

History

History
243 lines (191 loc) · 13.4 KB

2020-07-28-3642-jmx_rfc.md

File metadata and controls

243 lines (191 loc) · 13.4 KB

RFC 3642 - 2020-08-28 - Collecting metrics from the JVM via JMX

This RFC is to introduce a new metrics source to consume metrics from Java Virtual Machines (JVM) using the JMX protocol. The high level plan is to implement one source that collects metrics from JVM servers.

Background reading on JVM/JMX monitoring:

Scope

This RFC will cover:

  • A new source for JVM-based metrics using JMX.

Motivation

Users want to collect, transform, and forward metrics to better observe how their JVM-based applications are performing.

Internal Proposal

Build a single source called jmx_metrics (name to be confirmed) to collect JVM metrics.

The recommended implementation is to use a/the Rust JMX client to connect to a JVM server by an address specified in configuration.

This will require the end user to configure JMX on their JVM instances bound to an external port that can be queried by Vector.

Both the Telegraf and Prometheus JMX exporter provide a Java-based agent. Prometheus uses a self-maintained agent and Telegraf uses the Jolokia agent through a collector.

The Prometheus JMX exporter states:

This exporter is intended to be run as a Java Agent, exposing a HTTP server and serving metrics of the local JVM. It can be also run as an independent HTTP server and scrape remote JMX targets, but this has various disadvantages, such as being harder to configure and being unable to expose process metrics (e.g., memory and CPU usage). Running the exporter as a Java Agent is thus strongly encouraged.

This may require some testing as we proceed.

The query will return these metrics by parsing the query results and converting them into metrics:

  • jmx_up Used as an uptime metric (0/1)
  • jmx_config_reload_success_total (counter)
  • jmx_process_cpu_seconds_total (counter)
  • jmx_process_start_time_seconds (gauge)
  • jmx_process_open_fds (gauge)
  • jmx_process_max_fds (gauge)
  • jmx_jvm_threads_current (gauge)
  • jmx_jvm_threads_daemon (gauge)
  • jmx_jvm_threads_peak (gauge)
  • jmx_jvm_threads_started_total (counter)
  • jmx_jvm_threads_deadlocked (gauge)
  • jmx_jvm_threads_deadlocked_monitor (gauge)
  • jmx_jvm_threads_state tagged with state (gauge)
  • jmx_config_reload_failure_total (counter)
  • jmx_jvm_buffer_pool_used_bytes tagged with pool (gauge)
  • jmx_jvm_buffer_pool_capacity_bytes tagged with pool (gauge)
  • jmx_jvm_buffer_pool_used_buffers tagged with pool (gauge)
  • jmx_jvm_classes_loaded (gauge)
  • jmx_jvm_classes_loaded_total (counter)
  • jmx_jvm_classes_unloaded_total (counter)
  • jmx_java_lang_MemoryPool_UsageThresholdSupported tagged with name (gauge)
  • jmx_java_lang_Threading_ThreadContentionMonitoringEnabled (gauge)
  • jmx_java_lang_OperatingSystem_CommittedVirtualMemorySize (gauge)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageAfterGc_used tagged with name, key (counter?)
  • jmx_java_lang_Threading_ThreadContentionMonitoringSupported (gauge)
  • jmx_jvm_gc_collection_seconds tagged with gc (summary)
  • jmx_jvm_memory_bytes_used tagged with area (gauge)
  • jmx_jvm_memory_bytes_committed tagged with area (gauge)
  • jmx_jvm_memory_bytes_max tagged with area (gauge)
  • jmx_jvm_memory_bytes_init tagged with area (gauge)
  • jmx_jvm_memory_pool_bytes_used tagged with pool (gauge)
  • jmx_jvm_memory_pool_bytes_committed tagged with pool (gauge)
  • jmx_jvm_memory_pool_bytes_max tagged with pool (gauge)
  • jmx_jvm_memory_pool_bytes_init tagged with pool (gauge)
  • jmx_jvm_info tagged with version, vendor, runtime (gauge)
  • jmx_jvm_memory_pool_allocated_bytes_total tagged with pool (counter)
  • jmx_java_lang_Memory_HeapMemoryUsage_committed maybe gauge? TBD
  • jmx_java_lang_OperatingSystem_TotalSwapSpaceSize maybe gauge? TBD
  • jmx_java_lang_MemoryPool_CollectionUsage_max tagged with name (untyped)
  • jmx_java_lang_Runtime_StartTime (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_endTime tagged with name (untyped)
  • jmx_java_lang_Memory_HeapMemoryUsage_max (untyped)
  • jmx_java_lang_MemoryPool_UsageThreshold tagged with name (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsageThresholdCount tagged with name (untyped)
  • jmx_java_lang_Memory_NonHeapMemoryUsage_used (untyped)
  • jmx_java_lang_Threading_PeakThreadCount (untyped)
  • jmx_java_lang_MemoryPool_PeakUsage_used tagged with name (untyped)
  • jmx_java_lang_ClassLoading_TotalLoadedClassCount (untyped)
  • jmx_java_lang_OperatingSystem_MaxFileDescriptorCount (untyped)
  • jmx_java_lang_ClassLoading_Verbose (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_id tagged with name (untyped)
  • jmx_java_lang_Threading_CurrentThreadUserTime (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageAfterGc_committed tagged with name and key (untyped)
  • jmx_java_lang_Threading_ThreadCount (untyped)
  • jmx_java_lang_MemoryPool_PeakUsage_committed tagged with name (untyped)
  • jmx_java_lang_Memory_ObjectPendingFinalizationCount (untyped)
  • jmx_java_lang_MemoryPool_Usage_used tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_CollectionCount tagged with name (untyped)
  • jmx_java_lang_Threading_SynchronizerUsageSupported (untyped)
  • jmx_java_lang_Runtime_BootClassPathSupported (untyped)
  • jmx_java_nio_BufferPool_Count tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_GcThreadCount tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageBeforeGc_committed tagged with name and key (untyped)
  • jmx_java_lang_Threading_CurrentThreadCpuTimeSupported (untyped)
  • jmx_java_lang_ClassLoading_LoadedClassCount (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsage_init tagged with name (untyped)
  • jmx_java_lang_MemoryPool_PeakUsage_max tagged with name (untyped)
  • jmx_java_lang_MemoryPool_Usage_max tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_Valid tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageBeforeGc_used tagged with name and key (untyped)
  • jmx_java_lang_Threading_ThreadAllocatedMemoryEnabled (untyped)
  • jmx_java_lang_MemoryManager_Valid tagged with name (untyped)
  • jmx_java_lang_MemoryPool_Usage_init tagged with name (untyped)
  • jmx_java_lang_OperatingSystem_ProcessCpuLoad (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsage_committed tagged with name (untyped)
  • jmx_java_lang_OperatingSystem_TotalPhysicalMemorySize (untyped)
  • jmx_java_lang_Memory_NonHeapMemoryUsage_committed (untyped)
  • jmx_java_lang_Compilation_TotalCompilationTime (untyped)
  • jmx_java_lang_Memory_Verbose (untyped)
  • jmx_java_lang_MemoryPool_Valid tagged with name (untyped)
  • jmx_java_lang_OperatingSystem_FreeSwapSpaceSize (untyped)
  • jmx_java_lang_MemoryPool_UsageThresholdExceeded tagged with name (untyped)
  • jmx_java_lang_Threading_CurrentThreadCpuTime (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsageThreshold tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_CollectionTime tagged with name (untyped)
  • jmx_java_lang_Compilation_CompilationTimeMonitoringSupported (untyped)
  • jmx_java_lang_MemoryPool_Usage_committed tagged with name (untyped)
  • jmx_java_lang_Memory_NonHeapMemoryUsage_init (untyped)
  • jmx_java_lang_MemoryPool_PeakUsage_init tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_startTime tagged with name (untyped)
  • jmx_java_lang_OperatingSystem_AvailableProcessors (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsageThresholdSupported tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageBeforeGc_max tagged with name and key (untyped)
  • jmx_java_lang_ClassLoading_UnloadedClassCount (untyped)
  • jmx_java_nio_BufferPool_MemoryUsed tagged with name (untyped)
  • jmx_java_nio_BufferPool_TotalCapacity tagged with name (untyped)
  • jmx_java_lang_Memory_HeapMemoryUsage_used (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsage_used tagged with name (untyped)
  • jmx_java_lang_Memory_HeapMemoryUsage_init (untyped)
  • jmx_java_lang_OperatingSystem_SystemCpuLoad (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageAfterGc_init tagged with name and keys (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageBeforeGc_init tagged with name and key (untyped)
  • jmx_java_lang_Threading_ThreadAllocatedMemorySupported (untyped)
  • jmx_java_lang_Memory_NonHeapMemoryUsage_max (untyped)
  • jmx_java_lang_Threading_DaemonThreadCount (untyped)
  • jmx_java_lang_Threading_ThreadCpuTimeSupported (untyped)
  • jmx_java_lang_OperatingSystem_SystemLoadAverage (untyped)
  • jmx_java_lang_Threading_TotalStartedThreadCount (untyped)
  • jmx_java_lang_OperatingSystem_ProcessCpuTime (untyped)
  • jmx_java_lang_OperatingSystem_FreePhysicalMemorySize (untyped)
  • jmx_java_lang_Runtime_Uptime (untyped)
  • jmx_java_lang_MemoryPool_CollectionUsageThresholdExceeded tagged with name (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_duration tagged with name (untyped)
  • jmx_java_lang_Threading_ObjectMonitorUsageSupported (untyped)
  • jmx_java_lang_GarbageCollector_LastGcInfo_memoryUsageAfterGc_max tagged with name and key (untyped)
  • jmx_java_lang_MemoryPool_UsageThresholdCount tagged with name (untyped)
  • jmx_java_lang_Threading_ThreadCpuTimeEnabled (untyped)
  • jmx_java_lang_OperatingSystem_OpenFileDescriptorCount (untyped)

The type of metric for any metric marked untyped is unclear but likely gauges. We'll need to do deeper research to determine:

  1. Whether we want to keep them.
  2. What type they are if we retain them.

Naming of metrics is determined via:

  • jmx _ metric_name

This is in line with JMX's internal naming.

All metrics will be tagged with the endpoint and host.

This list is the default JVM metrics available via JMX. Individual applications (Kafka, Cassandra, Tomcat, etc) will expose their own MBeans and metrics and we can walk each of them and parse their output to construct and return the metrics as the Prometheus jmx_exporter does.

The Prometheus exporter contains accept/deny patterns for specific MBean objects and a rule-based rewrite engine to match and construct metrics from specific objects. Both of these seems like reasonable future work items, albeit we may want to look at whether a transform might be more effective?

Doc-level Proposal

The following additional source configuration will be added:

[sources.my_source_id]
  type = "jmx_metrics" # required
  endpoint = "service:jmx:rmi:///jndi/rmi://127.0.0.1:1234/jmxrmi" # required - address of the JMX webserver to scrape.
  username = "user" # optional - username for any JMX authentication.
  password = "password" # optional - password for any JMX authentication.
  scrape_interval_secs = 15 # optional, default, seconds
  namespace = "jmx" # optional, default is "jmx", namespace to put metrics under
  • We'd also add a guide for doing this with authentication.

Rationale

The JVM is a popular platform for running applications. Additionally, it is the basis for running a number of other infrastructure tools, including Kafka and Cassandra. User frequently want to understand it's performance.

Additionally, as part of Vector's vision to be the "one tool" for ingesting and shipping observability data, it makes sense to add as many sources as possible to reduce the likelihood that a user will not be able to ingest metrics from their tools.

Prior Art

Drawbacks

  • Additional maintenance and integration testing burden of a new source

Alternatives

  1. Having users run telegraf or Prom node exporter and using Vector's Prometheus source to scrape it. We could not add the source directly to Vector and instead instruct users to run Prometheus' jmx_exporter and point Vector at the resulting data.

  2. Or we could use something like jmxtrans and add a Vector OutputWriter.

I decided against both of these this as they would be in conflict with one of the listed principles of Vector:

One Tool. All Data. - One simple tool gets your logs, metrics, and traces (coming soon) from A to B.

Vector principles

If users are already running Prometheus though, they could opt for the Prometheus path.

Outstanding Questions

  • Java agent or not.

Plan Of Attack

Incremental steps that execute this change. Generally this is in the form of:

  • Submit a PR with the initial source implementation

Future Work

  • Accept/Deny list for MBean object patterns.
  • Rule-based engine for matching and constructing specific metrics.