[[input.kapacitor]] Invalid character 'x' in string escape code #7563

mertbakir · 2020-05-22T02:19:57Z

I've tried to use kapacitor.input plugin and get the error below. Couldn't find a way to solve it. I think it's about the kapacitor's json output. I get the json output when manually typing the url into browser.

Relevant telegraf.conf:

[[inputs.kapacitor]]
  urls = ["http://localhost:9092/kapacitor/v1/debug/vars"]
  timeout = "5s"

Telegraf Log

E! [inputs.kapacitor] Error in plugin: [url=http://localhost:9092/kapacitor/v1/debug/vars]: invalid character 'x' in string escape code

System info:

Centos 7.6
Telegraf: 14.2
Kapacitor: 1.5

The text was updated successfully, but these errors were encountered:

danielnelson · 2020-05-22T04:22:01Z

Can you attach the output from the link?:

curl http://localhost:9092/kapacitor/v1/debug/vars -o kapa.txt

mertbakir · 2020-05-22T04:30:13Z

@danielnelson here it's.
kapa.txt

danielnelson · 2020-05-22T18:41:32Z

I ran jsonlint on the output to get a bit more info:

$ cat ~/kapa.txt | jsonlint -s -
<stdin>:5:78702: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78702, offset 78854
   |  String started at line 5, column 78701, offset 78853
<stdin>:5:78706: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78706, offset 78858
   |  String started at line 5, column 78701, offset 78853
<stdin>:5:78710: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78710, offset 78862
   |  String started at line 5, column 78701, offset 78853
<stdin>:5:78714: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78714, offset 78866
   |  String started at line 5, column 78701, offset 78853
<stdin>: has errors

Here is the section of the file containing the issue, formatted.

{
    "name": "ingress",
    "tags": {
        "cluster_id": "85ac55eb-351e-46c1-ade0-a784dd04625f",
        "database": "SERVERS",
        "host": "ALTAIR",
        "measurement": "\xb2\x8e\xee\x80disk",
        "retention_policy": "autogen",
        "server_id": "c64c5903-e50b-459c-a519-b30f9791d943",
        "task_master": "main"
    }
}

It appears that Kapacitor is not properly escaping non utf-8 characters in the JSON, and it might be that this value cannot be correctly escaped at all for JSON.

https://github.com/influxdata/kapacitor/blob/master/services/httpd/handler.go#L543-L554

I can move this issue over to Kapacitor, we won't be able to fix it in Telegraf, but first let's try to track down where is measurement is coming from. Does it appear in InfluxDB if you show measurements?

mertbakir · 2020-05-23T00:50:25Z

Oh, I see now it's because of that corrupted measurement. Yes, I see it in influxDB.
I had opened a topic in community before. It looks like these are same cases. Last time I wasn't able to delete the corrupted measurement and just waited for the retention policy to kick in.

I think what's important here, is to find the reason of this measurement.

danielnelson · 2020-05-26T18:26:27Z

I think what's important here, is to find the reason of this measurement.

Agreed, are you able to show the tags/fields for the measurement? I'm not exactly sure how to escape this in InfluxDB, normally the queries would be:

show field keys from disk
show tag keys from disk

How often are values for this measurement reported?

mertbakir · 2020-05-27T09:12:53Z

10s interval, from around 400-500 nodes. It's the default interval set on top of config.

> show field keys from disk
name: disk
fieldKey     fieldType
--------     ---------
free         integer
inodes_free  integer
inodes_total integer
inodes_used  integer
total        integer
used         integer
used_percent float

> show tag keys from disk
name: disk
tagKey
------
device
env
fstype
host
ip
mode
os
path
role
site

I can call the measurement but it returns the working one. There are 2 of them. 1 is corrupted and I couldn't manage to query from it.

These are from influx terminal:

disk
����disk

Today, I got one more corrupt measurement. ��zcpu

mertbakir · 2020-05-30T09:12:57Z

@danielnelson
I have redirect pretty much all measurements to a new database (bucket) except the ones that come from sql plugin. Since, you said that people reported sqlserver input cause such bug. I'll report back in next weeks. Hopefully, with no corrupt measurements in new bucket.

mertbakir · 2020-06-16T18:18:38Z

Here is the update with bad news for me :/

I have another corrupt measurement in the new database. I actually have no clue what can be causing this. So, I'll try to explain my setup and what I did step by step.

InfluxDB 1.8 container running on docker which is running centos 7.6 (This happened with influx 1.7 too.)
There are ~500 telegraf agents, all pushing data to one telegraf (the gateway).
Gateway separates data into different databases (buckets all in same influxdb instance) using namepass and namedrop features.
I recently realised a continuous query for downsampling which was failing and returning errors. Removed it. Didn't fix, the problem recurred.
We were suspicious about sqlserver plugin. So, I put all data into one bucket and sqlserver data into another bucket. Again this is done using namepass/drop feature using telegraf as a gateway. This also, didn't fix and the problem recurred in bucket which is the new bucket I've created.
There is also kapacitor. However, kapacitor doesn't write data into that bucket, it only creates alerts using stream nodes.

Where should I check? What do you suggest? Corrupted measurements seems pretty random. It doesn't happen very often. Probably more than 1 month passed over the previous occurence.

danielnelson · 2020-06-16T19:55:42Z

Hmm, w?em, can you tell which measurement this is supposed to be by the tags/fields?

Would you be willing to run a development version of Telegraf with some extra checks and logging?

mertbakir · 2020-06-16T21:18:11Z

I am not sure which measreument it should be but it seems like win_mem or win_system. Older errors was obvious: ???cpu and ???disk.

I can run a dev. version but currently I don't have the automation to update all agents. Should I look into logs from all telegrafs? May be I should collect telegraf logs into influxdb.

mertbakir · 2020-06-16T21:33:01Z

Is there a chance that this happens at the telegraf gateway? Because, it doesn't occur into same measurement. Also, it happened in cpu and win_mem/win_system measurement . I have no windows servers using cpu plugin/measurement and no linux servers using win_mem. So, the error doesn't come from a specific telegraf agent. It either happens at the gateway or it happens at multiple telegraf agents, not at a specific one.

I'm just laying my thoughts here. Sounds reasonable?

mertbakir · 2020-06-16T21:56:48Z

Updates:

I've found similar fields in other measurements.

danielnelson · 2020-06-17T00:09:57Z

Here are some builds of #7696, based on 1.14.4. It includes utf-8 checking whenever metrics are created, if it finds an invalid byte sequence it will be replace with the replacement char and will log a message. There are still other places the corrupted data could be coming from, but this is the most likely place to catch it.

I'll have to think about exactly how/if we can roll this out to an official version, since I know some people depend on being able to send binary data.

Can you start by running this on your gateway node and keeping an eye out log messages?

mertbakir · 2020-06-17T13:07:15Z

@danielnelson should I set debug = true ?

Current setting:

  debug = false
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"

mertbakir · 2020-06-17T13:44:06Z

Is there a chance that this happens at the telegraf gateway? Because, it doesn't occur into same measurement. Also, it happened in cpu and win_mem/win_system measurement . I have no windows servers using cpu plugin/measurement and no linux servers using win_mem. So, the error doesn't come from a specific telegraf agent. It either happens at the gateway or it happens at multiple telegraf agents, not at a specific one.

I'm just laying my thoughts here. Sounds reasonable?

I want to update my logic here. ???zcpu could be win_cpu thus there is a good chance these errors comes from a windows servers instead of gateway node. Because, all bad measurements I've noticed are collected from windows servers. Today, I realized bad metrics and corrupted hostnames which are belong to sql servers.

I've used the new telegraf version you've provided on gateway node and 2 other sql node.
I've created 2 new buckets. one for sql nodes (all metrics collected from sql servers goes to there directly, not to gateway.) and one for all other metrics which comes from gateway.

danielnelson · 2020-06-17T22:04:03Z

should I set debug = true ?

It's not required, the new errors are logged at error level.

Could definitely be related to sqlserver or even win_perf_counters.

mertbakir · 2020-06-19T12:30:34Z

I've got no new error about utf-8 parsing from the new telegraf agent. It was deployed on 2 sql servers and the gateway node. Yet, I found some other:

This occurs at all windows servers:

2020-06-19T04:18:33Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.

From some of the sqls:

2020-06-18T21:01:45Z E! [outputs.influxdb] When writing to [http://...8086]: Post http://...:8086/write?db=SQL: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-06-18T21:01:45Z E! [agent] Error writing to outputs.influxdb: could not write any address
2020-06-19T05:13:48Z W! [agent] [inputs.sqlserver] did not complete within its interval

2020-06-02T03:34:25Z E! [inputs.win_perf_counters] Error in plugin: (pdhErr=258) The system cannot find message text for message number 0x%1 in the message file for %2.

This is from 1.14.4 on an sql server:

2020-06-17T15:08:48Z E! [outputs.influxdb] When writing to [http://....:8086]: received error partial write: unable to parse 'sqlserver_requests
...very long statement_text here....
Committed granted_query_memory_pages=0i,request_id=0i,cpu_time_ms=4i,total_elasped_time_ms=4i,writes=0i,logical_reads=858i,objectid=531532977i,blocking_session_id=0i,wait_time_ms=0i,percent_complete=0,open_transaction=0i,session_id=564i 1592424390000000000': max key length exceeded: 82027 > 65535 dropped=0; discarding points

danielnelson · 2020-06-22T19:10:34Z

E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)% Processor Time: A counter with a negative denominator value was detected.

Not sure on this one, could be a Telegraf bug or an issue with the perf counters. How often does it occur?

E! [outputs.influxdb] When writing to [http://...8086]: Post http://...:8086/write?db=SQL: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This is a general timeout writing to the output, may not be something to worry about unless it happens frequently.

E! [inputs.win_perf_counters] Error in plugin: (pdhErr=258) The system cannot find message text for message number 0x%1 in the message file for %2.

I think this one is new to me, would be interesting if you could narrow down the reproduction steps for this.

E! [outputs.influxdb] When writing to [http://....:8086]: received error partial write: unable to parse 'sqlserver_requests

This one is a prime candidate for who is to blame for the corrupted data, but I know you said earlier that you disable the sqlserver plugin and still had the issue. There is some additional discussion aboutt this in #6976 and #7037.

I do recommend either disabling this query with exclude_query = [ 'SqlRequests'] or converting the statement_text tag to a field using the converter processor.

mertbakir · 2020-07-06T08:00:11Z

Hi @danielnelson,
I'm back with some information. Sorry for the late response.

First one occurs pretty often. This is just a random sample from a random server. There is hundreds of similar rows.

2020-02-06T03:51:39Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T03:51:45Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T03:58:06Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:14:14Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:21:41Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:23:53Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:54:01Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:55:08Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:06:52Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:06:59Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:14:23Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:24:04Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.

I'll try to get some insight on the third one.

Lastly, about our prime candidate, I've excluded 'SqlRequests' and now we wait and see if the error occurs again. I've checked logs again, no logs about parsing error. Also, I haven't seen a "corrupt measurement/tag" in main bucket which was separated from SQL.

mertbakir · 2020-07-09T06:54:31Z

Update: Yesterday, I moved back to gateway node to a container and today corrupted data(measurement) occured again in non-sqlbucket. I believe it comes from a windows server, this time corrupted measurement is "win_pagingfile". This is proves it's not sql related.

mertbakir · 2020-08-18T10:24:49Z

Update: It's been more than a month and problem didn't re-occurred. I'm going to close this thread. By my experiments, it happens when gateway node running in a docker container. Now, I run multiple telegraf instances as a service directly on the host, all is good so far.

danielnelson added bug unexpected problem or unintended behavior need more info labels May 22, 2020

danielnelson mentioned this issue Jun 16, 2020

Check strings for valid utf-8 when creating metrics. #7696

Closed

3 tasks

mertbakir closed this as completed Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[input.kapacitor]] Invalid character 'x' in string escape code #7563

[[input.kapacitor]] Invalid character 'x' in string escape code #7563

mertbakir commented May 22, 2020 •

edited

Loading

danielnelson commented May 22, 2020

mertbakir commented May 22, 2020

danielnelson commented May 22, 2020

mertbakir commented May 23, 2020 •

edited

Loading

danielnelson commented May 26, 2020

mertbakir commented May 27, 2020 •

edited

Loading

mertbakir commented May 30, 2020 •

edited

Loading

mertbakir commented Jun 16, 2020 •

edited

Loading

danielnelson commented Jun 16, 2020

mertbakir commented Jun 16, 2020 •

edited

Loading

mertbakir commented Jun 16, 2020 •

edited

Loading

mertbakir commented Jun 16, 2020

danielnelson commented Jun 17, 2020

mertbakir commented Jun 17, 2020

mertbakir commented Jun 17, 2020 •

edited

Loading

danielnelson commented Jun 17, 2020

mertbakir commented Jun 19, 2020

danielnelson commented Jun 22, 2020

mertbakir commented Jul 6, 2020 •

edited

Loading

mertbakir commented Jul 9, 2020

mertbakir commented Aug 18, 2020

[[input.kapacitor]] Invalid character 'x' in string escape code #7563

[[input.kapacitor]] Invalid character 'x' in string escape code #7563

Comments

mertbakir commented May 22, 2020 • edited Loading

Relevant telegraf.conf:

Telegraf Log

System info:

danielnelson commented May 22, 2020

mertbakir commented May 22, 2020

danielnelson commented May 22, 2020

mertbakir commented May 23, 2020 • edited Loading

danielnelson commented May 26, 2020

mertbakir commented May 27, 2020 • edited Loading

mertbakir commented May 30, 2020 • edited Loading

mertbakir commented Jun 16, 2020 • edited Loading

danielnelson commented Jun 16, 2020

mertbakir commented Jun 16, 2020 • edited Loading

mertbakir commented Jun 16, 2020 • edited Loading

mertbakir commented Jun 16, 2020

danielnelson commented Jun 17, 2020

mertbakir commented Jun 17, 2020

mertbakir commented Jun 17, 2020 • edited Loading

danielnelson commented Jun 17, 2020

mertbakir commented Jun 19, 2020

danielnelson commented Jun 22, 2020

mertbakir commented Jul 6, 2020 • edited Loading

mertbakir commented Jul 9, 2020

mertbakir commented Aug 18, 2020

mertbakir commented May 22, 2020 •

edited

Loading

mertbakir commented May 23, 2020 •

edited

Loading

mertbakir commented May 27, 2020 •

edited

Loading

mertbakir commented May 30, 2020 •

edited

Loading

mertbakir commented Jun 16, 2020 •

edited

Loading

mertbakir commented Jun 16, 2020 •

edited

Loading

mertbakir commented Jun 16, 2020 •

edited

Loading

mertbakir commented Jun 17, 2020 •

edited

Loading

mertbakir commented Jul 6, 2020 •

edited

Loading