Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON output data format, support nanosecond timestamps. #2124

Closed
tjmcs opened this issue Dec 4, 2016 · 6 comments · Fixed by #2587
Closed

JSON output data format, support nanosecond timestamps. #2124

tjmcs opened this issue Dec 4, 2016 · 6 comments · Fixed by #2587

Comments

@tjmcs
Copy link

tjmcs commented Dec 4, 2016

Bug report

In the current released version of telegraf and the latest version on the master branch, the precision value declared for metrics generated by the agent is not honored for all Serializers. Specifically, when we select a file output filter and request an influx data_format, with the precision value for the agent set to 1ns (for nanosecond precision), we see values that look like this in the output:

system,host=192-168-1-7.tpgi.com.au load1=1.93,load5=2.11,load15=2.14,n_users=7i,n_cpus=8i 1480843620054960352
system,host=192-168-1-7.tpgi.com.au uptime=2339551i,uptime_format="27 days,  1:52" 1480843620054996864

On the other hand, when we change that output data_format value to json, the values that we see look like this:

{"fields":{"load1":2.16,"load15":2.14,"load5":2.13,"n_cpus":8,"n_users":7},"name":"system","tags":{"host":"192-168-1-7.tpgi.com.au"},"timestamp":1480843720}
{"fields":{"uptime":2339651,"uptime_format":"27 days,  1:54"},"name":"system","tags":{"host":"192-168-1-7.tpgi.com.au"},"timestamp":1480843720}

It's obvious that the timestamp values shown in the influx data_format example shown above are actual nanosecond timestamp values, while the timestamps shown for the json data_format example have been truncated to the nearest second.

It's also obvious where this occurs. While the precision for the agent does set the precision of the time stamp for the metrics that are collected, when converting that data to JSON in this line of code, that timestamp is converted to a time in seconds. The influx data_format is actually generated in a completely different fashion than the json data_format; by calling a simple m.Serialize() method in the current commit on the master branch of the repository or []string{metric.String()} method in the v1.1.1 release of telegraf.

Relevant telegraf.conf:

This is the telegraf.conf file that was used for the influx data_format example:

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)


# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"


# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default, precision will be set to the same timestamp order as the
  ## collection interval, with the maximum being 1s.
  ## Precision will NOT be used for service inputs, such as logparser and statsd.
  ## Valid values are "ns", "us" (or "µs"), "ms", "s".
  precision = "1ns"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false


###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Send telegraf metrics to file(s)
[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["stdout", "/tmp/metrics.out"]

  ## Data format to output.
  ## Each data format has it's own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "influx"



###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################

# # Print all metrics that pass through this filter.
# [[processors.printer]]



###############################################################################
#                            AGGREGATOR PLUGINS                               #
###############################################################################

# # Keep the aggregate min/max of each metric passing through.
# [[aggregators.minmax]]
#   ## General Aggregator Arguments:
#   ## The period on which to flush & clear the aggregator.
#   period = "30s"
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false



###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration


while this is the telegraf.conf file used for the json data_format example:

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)


# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"


# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default, precision will be set to the same timestamp order as the
  ## collection interval, with the maximum being 1s.
  ## Precision will NOT be used for service inputs, such as logparser and statsd.
  ## Valid values are "ns", "us" (or "µs"), "ms", "s".
  precision = "1ns"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false


###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Send telegraf metrics to file(s)
[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["stdout", "/tmp/metrics.out"]

  ## Data format to output.
  ## Each data format has it's own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "json"



###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################

# # Print all metrics that pass through this filter.
# [[processors.printer]]



###############################################################################
#                            AGGREGATOR PLUGINS                               #
###############################################################################

# # Keep the aggregate min/max of each metric passing through.
# [[aggregators.minmax]]
#   ## General Aggregator Arguments:
#   ## The period on which to flush & clear the aggregator.
#   period = "30s"
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false



###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration


both were generated by using the telegraf -input-filter system -output-filter file config command and hand editing the resulting configuration file to set the precision to 1ns so that we would get timestamps to the nearest nanosecond in the metrics that were measured by the agent; the only difference is the data_format for the file output filter.

System info:

Tested using Telegraf version 1.1.1 and a version of Telegraf built from the latest commit to the master branch (504f4e6 ) under both CentOS 7.2 and OS X 10.12.7

Steps to reproduce:

  1. create a configuration file with a file output filter using the telegraf -input-filter system -output-filter file config command
  2. hand edit that file to set the precision to 1ns
  3. create a copy of that file and change the data_format to json in the copy that you just created
  4. run the command telegraf --config <filename> where the string <filename> is the configuration that you want to test (the first file and the second file)

Expected behavior:

I expected to see that the timestamps for both the influx and json data_format cases would have the same precision; the precision declared in the configuration of the telegraf agent.

Actual behavior:

The precision of the metric timestamps depends not only on the precision declared in the configuration of the telegraf agent but also on the data_format declared for the output filter, specifically:

  • In the influx data_format case, the nanosecond precision that was declared for the agent was honored, and the timestamps shown in the output generated by the agent were to the nearest nanosecond
  • In the json data_format case the precision declared for the agent was not honored, instead the timestamps associated with the metric (which were to the nearest nanosecond) were arbitrarily converted to timestamp values in seconds (by dividing the timestamp associated with each metric value, converted to nanoseconds, by 1000000000).
@sparrc sparrc added this to the Future Milestone milestone Dec 5, 2016
@sparrc
Copy link
Contributor

sparrc commented Dec 5, 2016

The agent precision refers more to the precision at the metric collection point. InfluxDB is specifically a data protocol which supports nanosecond precision.

So this is just a request for the JSON output format to support nanosecond timestamps.

@sparrc sparrc changed the title The agent precision should be honored for all Serializers JSON output data format, support nanosecond timestamps. Dec 5, 2016
@tjmcs
Copy link
Author

tjmcs commented Dec 5, 2016

If you'd rather call this a feature request than a bug fix I'm open to that; I'm just having difficulty with the fact that the timestamp in the influx data_format is returned in the units that were used when the metric was collected (nanosecond, milliseconds, microseconds or seconds) but the timestamp returned in the json data_format is always in seconds, regardless of the precision that was used when the metric was collected.

It just seemed to me that the timestamp should be consistent, regardless of the data_format that is defined (so either both should always return seconds or the timestamp returned for both should be the timestamp that was generated when the metric was collected, not a truncated timestamp for one and a full-fidelity timestamp for the other.

@sparrc
Copy link
Contributor

sparrc commented Dec 5, 2016

I disagree, I think that the output format defines the precision. Influx is defined to be nanosecond, graphite is second, and JSON in this case is second as well.

@tjmcs
Copy link
Author

tjmcs commented Dec 5, 2016

OK, I see that now. If I change the precision to 1us, 1ms, or 1s for the influx data format it continues to return a time in nanoseconds, regardless of the precision of the data collected.

Is that documented somewhere (that the time returned for the influx data format is always in nanoseconds and the time returned for the graphite and json data formats is always in seconds), other than in the statements used in the code to display the event's timestamp, and I just missed it?

Given that behavior, then I agree that this is a feature request. That said, I don't know how the change would be supported since it would be a breaking change for people who are currently collecting json and graphite data formats and are expecting the timestamp to be in seconds. To make the change I guess we'd have to add some sort of output_precision parameter to the configuration file and have it default to the current behavior for those data formats if the parameter wasn't specified.

@sparrc
Copy link
Contributor

sparrc commented Dec 5, 2016

Yes it would need to be an argument for the json output format, graphite doesn't support anything other than s.

@tjmcs
Copy link
Author

tjmcs commented Dec 5, 2016

That actually makes things much simpler than I thought it would be; let me see if I can throw together a pull request and submit something in short order on this. I was afraid I'd have to pull the precision from the Agent configuration somehow; having it as a configuration parameter in the json data_format output simplifies things greatly.

Look for a CLA signature from me and a pull request sometime this week :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants