Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All metrics values delayed when inserting in the past beyond the default retention policy #3144

Closed
lgosselin opened this issue Aug 21, 2017 · 0 comments · Fixed by #3155
Closed
Labels
area/influxdb bug unexpected problem or unintended behavior
Milestone

Comments

@lgosselin
Copy link
Contributor

lgosselin commented Aug 21, 2017

I have a process that is posting measurements with a timestamp linked to the data processed to telegraf. Usually, as it is working on (almost) realtime data, the timestamp are more or less current. However, occasionally, it can be asked to reprocess old data and the measurements will be send again but at the original timestamps.

If there is a default retention policy on the database, when reprocessing older data, all the metrics in Chronograf dashboard are delayed by a few minutes. (How much seems to vary between environment). When that process stops emitting past events, the dashboard still lag a bit before returning to normal.

Setting the retention policy is critical to reproduce. It causes partial writes at Influx level and telegraf seems a bit confused and appears to hold hostage other measurements even if issued by another input plugin. However I have not seen any missing measurement value when it gets back to normal.

Environment setup using docker on linux:

  • Use the docker-compose.yml in attach (shameful rip off https://github.com/influxdata/TICK-docker/tree/master/1.2 with updated versions, little adjustments, and http_listener enabled)
  • Use the telegraf configuration provided in attachment
  • Start the environment using : docker-compose up
  • Create a default retention policy on telegraf database: docker-compose run influxdb-cli -execute 'CREATE RETENTION POLICY realtime ON telegraf DURATION 4w REPLICATION 1 DEFAULT;'
  • Open a browser on chronograf (localhost:8888), go to host list, use the "system" dashboard for your host.
  • Setup refresh to "Every 10s" and timerange to "Past 15 minutes".
  • Wait a few minutes to have data points collected
  • Validate that the collected data is up-to-date (for example, use the tooltip on the CPU usage measurements to validate the time)

Then begin to reproduce:

  • Post a few events in the past beyond the retention policy: curl -i -XPOST "http://localhost:8186/write?db=telegraf&precision=ns" --data-binary "@test.txt"
  • Wait 1 or 2 minutes and confirm that most of the measurements don't reach the dashboard anymore. You should have a gap on almost all charts (at least those who refresh their X axis).

If it does not work for you, try posting several times (5 times, 1 or 2 seconds apart seems to be enough for me).

The telegraf logs should reveal something along the line of:

E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: points beyond retention policy dropped=xx]
E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

The expected behaviour would be to have no delay at all (or close to none) in unrelated metrics, especially if coming from other plugins.

The actual behaviour: No new metrics value available during a (variable) time period (at least 4-5 min, sometimes way more).

issue_telegraf.zip

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Aug 21, 2017
@danielnelson danielnelson added this to the 1.4.0 milestone Aug 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants