Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[input.kapacitor]] Invalid character 'x' in string escape code #7563

Closed
mertbakir opened this issue May 22, 2020 · 21 comments
Closed

[[input.kapacitor]] Invalid character 'x' in string escape code #7563

mertbakir opened this issue May 22, 2020 · 21 comments
Labels
bug unexpected problem or unintended behavior

Comments

@mertbakir
Copy link

mertbakir commented May 22, 2020

I've tried to use kapacitor.input plugin and get the error below. Couldn't find a way to solve it. I think it's about the kapacitor's json output. I get the json output when manually typing the url into browser.

Relevant telegraf.conf:

[[inputs.kapacitor]]
  urls = ["http://localhost:9092/kapacitor/v1/debug/vars"]
  timeout = "5s"

Telegraf Log

E! [inputs.kapacitor] Error in plugin: [url=http://localhost:9092/kapacitor/v1/debug/vars]: invalid character 'x' in string escape code

System info:

Centos 7.6
Telegraf: 14.2
Kapacitor: 1.5

@danielnelson
Copy link
Contributor

Can you attach the output from the link?:

curl http://localhost:9092/kapacitor/v1/debug/vars -o kapa.txt

@danielnelson danielnelson added bug unexpected problem or unintended behavior need more info labels May 22, 2020
@mertbakir
Copy link
Author

@danielnelson here it's.
kapa.txt

@danielnelson
Copy link
Contributor

I ran jsonlint on the output to get a bit more info:

$ cat ~/kapa.txt | jsonlint -s -
<stdin>:5:78702: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78702, offset 78854
   |  String started at line 5, column 78701, offset 78853
<stdin>:5:78706: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78706, offset 78858
   |  String started at line 5, column 78701, offset 78853
<stdin>:5:78710: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78710, offset 78862
   |  String started at line 5, column 78701, offset 78853
<stdin>:5:78714: Error: JSON strings may not use the \x hex-escape
   |  At line 5, column 78714, offset 78866
   |  String started at line 5, column 78701, offset 78853
<stdin>: has errors

Here is the section of the file containing the issue, formatted.

{
    "name": "ingress",
    "tags": {
        "cluster_id": "85ac55eb-351e-46c1-ade0-a784dd04625f",
        "database": "SERVERS",
        "host": "ALTAIR",
        "measurement": "\xb2\x8e\xee\x80disk",
        "retention_policy": "autogen",
        "server_id": "c64c5903-e50b-459c-a519-b30f9791d943",
        "task_master": "main"
    }
}

It appears that Kapacitor is not properly escaping non utf-8 characters in the JSON, and it might be that this value cannot be correctly escaped at all for JSON.

https://github.com/influxdata/kapacitor/blob/master/services/httpd/handler.go#L543-L554

I can move this issue over to Kapacitor, we won't be able to fix it in Telegraf, but first let's try to track down where is measurement is coming from. Does it appear in InfluxDB if you show measurements?

@mertbakir
Copy link
Author

mertbakir commented May 23, 2020

Oh, I see now it's because of that corrupted measurement. Yes, I see it in influxDB.
I had opened a topic in community before. It looks like these are same cases. Last time I wasn't able to delete the corrupted measurement and just waited for the retention policy to kick in.

I think what's important here, is to find the reason of this measurement.

@danielnelson
Copy link
Contributor

I think what's important here, is to find the reason of this measurement.

Agreed, are you able to show the tags/fields for the measurement? I'm not exactly sure how to escape this in InfluxDB, normally the queries would be:

show field keys from disk
show tag keys from disk

How often are values for this measurement reported?

@mertbakir
Copy link
Author

mertbakir commented May 27, 2020

10s interval, from around 400-500 nodes. It's the default interval set on top of config.

> show field keys from disk
name: disk
fieldKey     fieldType
--------     ---------
free         integer
inodes_free  integer
inodes_total integer
inodes_used  integer
total        integer
used         integer
used_percent float
> show tag keys from disk
name: disk
tagKey
------
device
env
fstype
host
ip
mode
os
path
role
site

I can call the measurement but it returns the working one. There are 2 of them. 1 is corrupted and I couldn't manage to query from it.

These are from influx terminal:

disk
����disk

Today, I got one more corrupt measurement. ���zcpu

@mertbakir
Copy link
Author

mertbakir commented May 30, 2020

@danielnelson
I have redirect pretty much all measurements to a new database (bucket) except the ones that come from sql plugin. Since, you said that people reported sqlserver input cause such bug. I'll report back in next weeks. Hopefully, with no corrupt measurements in new bucket.

@mertbakir
Copy link
Author

mertbakir commented Jun 16, 2020

Here is the update with bad news for me :/

I have another corrupt measurement in the new database. I actually have no clue what can be causing this. So, I'll try to explain my setup and what I did step by step.

  • InfluxDB 1.8 container running on docker which is running centos 7.6 (This happened with influx 1.7 too.)
  • There are ~500 telegraf agents, all pushing data to one telegraf (the gateway).
  • Gateway separates data into different databases (buckets all in same influxdb instance) using namepass and namedrop features.
  • I recently realised a continuous query for downsampling which was failing and returning errors. Removed it. Didn't fix, the problem recurred.
  • We were suspicious about sqlserver plugin. So, I put all data into one bucket and sqlserver data into another bucket. Again this is done using namepass/drop feature using telegraf as a gateway. This also, didn't fix and the problem recurred in bucket which is the new bucket I've created.
  • There is also kapacitor. However, kapacitor doesn't write data into that bucket, it only creates alerts using stream nodes.

Where should I check? What do you suggest? Corrupted measurements seems pretty random. It doesn't happen very often. Probably more than 1 month passed over the previous occurence.

image

@danielnelson
Copy link
Contributor

Hmm, w?em, can you tell which measurement this is supposed to be by the tags/fields?

Would you be willing to run a development version of Telegraf with some extra checks and logging?

@mertbakir
Copy link
Author

mertbakir commented Jun 16, 2020

I am not sure which measreument it should be but it seems like win_mem or win_system. Older errors was obvious: ???cpu and ???disk.

I can run a dev. version but currently I don't have the automation to update all agents. Should I look into logs from all telegrafs? May be I should collect telegraf logs into influxdb.

@mertbakir
Copy link
Author

mertbakir commented Jun 16, 2020

Is there a chance that this happens at the telegraf gateway? Because, it doesn't occur into same measurement. Also, it happened in cpu and win_mem/win_system measurement . I have no windows servers using cpu plugin/measurement and no linux servers using win_mem. So, the error doesn't come from a specific telegraf agent. It either happens at the gateway or it happens at multiple telegraf agents, not at a specific one.

I'm just laying my thoughts here. Sounds reasonable?

@mertbakir
Copy link
Author

Updates:

I've found similar fields in other measurements.

image

image

@danielnelson
Copy link
Contributor

Here are some builds of #7696, based on 1.14.4. It includes utf-8 checking whenever metrics are created, if it finds an invalid byte sequence it will be replace with the replacement char and will log a message. There are still other places the corrupted data could be coming from, but this is the most likely place to catch it.

I'll have to think about exactly how/if we can roll this out to an official version, since I know some people depend on being able to send binary data.

Can you start by running this on your gateway node and keeping an eye out log messages?

@mertbakir
Copy link
Author

@danielnelson should I set debug = true ?

Current setting:

  debug = false
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"

@mertbakir
Copy link
Author

mertbakir commented Jun 17, 2020

Is there a chance that this happens at the telegraf gateway? Because, it doesn't occur into same measurement. Also, it happened in cpu and win_mem/win_system measurement . I have no windows servers using cpu plugin/measurement and no linux servers using win_mem. So, the error doesn't come from a specific telegraf agent. It either happens at the gateway or it happens at multiple telegraf agents, not at a specific one.

I'm just laying my thoughts here. Sounds reasonable?

I want to update my logic here. ???zcpu could be win_cpu thus there is a good chance these errors comes from a windows servers instead of gateway node. Because, all bad measurements I've noticed are collected from windows servers. Today, I realized bad metrics and corrupted hostnames which are belong to sql servers.

  1. I've used the new telegraf version you've provided on gateway node and 2 other sql node.
  2. I've created 2 new buckets. one for sql nodes (all metrics collected from sql servers goes to there directly, not to gateway.) and one for all other metrics which comes from gateway.

@danielnelson
Copy link
Contributor

should I set debug = true ?

It's not required, the new errors are logged at error level.

Could definitely be related to sqlserver or even win_perf_counters.

@mertbakir
Copy link
Author

I've got no new error about utf-8 parsing from the new telegraf agent. It was deployed on 2 sql servers and the gateway node. Yet, I found some other:

This occurs at all windows servers:

2020-06-19T04:18:33Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.

From some of the sqls:

2020-06-18T21:01:45Z E! [outputs.influxdb] When writing to [http://...8086]: Post http://...:8086/write?db=SQL: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-06-18T21:01:45Z E! [agent] Error writing to outputs.influxdb: could not write any address
2020-06-19T05:13:48Z W! [agent] [inputs.sqlserver] did not complete within its interval
2020-06-02T03:34:25Z E! [inputs.win_perf_counters] Error in plugin: (pdhErr=258) The system cannot find message text for message number 0x%1 in the message file for %2.

This is from 1.14.4 on an sql server:

2020-06-17T15:08:48Z E! [outputs.influxdb] When writing to [http://....:8086]: received error partial write: unable to parse 'sqlserver_requests
...very long statement_text here....
Committed granted_query_memory_pages=0i,request_id=0i,cpu_time_ms=4i,total_elasped_time_ms=4i,writes=0i,logical_reads=858i,objectid=531532977i,blocking_session_id=0i,wait_time_ms=0i,percent_complete=0,open_transaction=0i,session_id=564i 1592424390000000000': max key length exceeded: 82027 > 65535 dropped=0; discarding points

@danielnelson
Copy link
Contributor

E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)% Processor Time: A counter with a negative denominator value was detected.

Not sure on this one, could be a Telegraf bug or an issue with the perf counters. How often does it occur?

E! [outputs.influxdb] When writing to [http://...8086]: Post http://...:8086/write?db=SQL: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This is a general timeout writing to the output, may not be something to worry about unless it happens frequently.

E! [inputs.win_perf_counters] Error in plugin: (pdhErr=258) The system cannot find message text for message number 0x%1 in the message file for %2.

I think this one is new to me, would be interesting if you could narrow down the reproduction steps for this.

E! [outputs.influxdb] When writing to [http://....:8086]: received error partial write: unable to parse 'sqlserver_requests

This one is a prime candidate for who is to blame for the corrupted data, but I know you said earlier that you disable the sqlserver plugin and still had the issue. There is some additional discussion aboutt this in #6976 and #7037.

I do recommend either disabling this query with exclude_query = [ 'SqlRequests'] or converting the statement_text tag to a field using the converter processor.

@mertbakir
Copy link
Author

mertbakir commented Jul 6, 2020

Hi @danielnelson,
I'm back with some information. Sorry for the late response.

First one occurs pretty often. This is just a random sample from a random server. There is hundreds of similar rows.

2020-02-06T03:51:39Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T03:51:45Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T03:58:06Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:14:14Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:21:41Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:23:53Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:54:01Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T04:55:08Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:06:52Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:06:59Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:14:23Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.
2020-02-06T05:24:04Z E! [inputs.win_perf_counters] Error in plugin: error while getting value for counter \Processor(_Total)\% Processor Time: A counter with a negative denominator value was detected.

I'll try to get some insight on the third one.

Lastly, about our prime candidate, I've excluded 'SqlRequests' and now we wait and see if the error occurs again. I've checked logs again, no logs about parsing error. Also, I haven't seen a "corrupt measurement/tag" in main bucket which was separated from SQL.

@mertbakir
Copy link
Author

Update: Yesterday, I moved back to gateway node to a container and today corrupted data(measurement) occured again in non-sqlbucket. I believe it comes from a windows server, this time corrupted measurement is "win_pagingfile". This is proves it's not sql related.

@mertbakir
Copy link
Author

Update: It's been more than a month and problem didn't re-occurred. I'm going to close this thread. By my experiments, it happens when gateway node running in a docker container. Now, I run multiple telegraf instances as a service directly on the host, all is good so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants