-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[[input.kapacitor]] Invalid character 'x' in string escape code #7563
Comments
Can you attach the output from the link?:
|
@danielnelson here it's. |
I ran jsonlint on the output to get a bit more info:
Here is the section of the file containing the issue, formatted.
It appears that Kapacitor is not properly escaping non utf-8 characters in the JSON, and it might be that this value cannot be correctly escaped at all for JSON. https://github.com/influxdata/kapacitor/blob/master/services/httpd/handler.go#L543-L554 I can move this issue over to Kapacitor, we won't be able to fix it in Telegraf, but first let's try to track down where is measurement is coming from. Does it appear in InfluxDB if you |
Oh, I see now it's because of that corrupted measurement. Yes, I see it in influxDB. I think what's important here, is to find the reason of this measurement. |
Agreed, are you able to show the tags/fields for the measurement? I'm not exactly sure how to escape this in InfluxDB, normally the queries would be:
How often are values for this measurement reported? |
10s interval, from around 400-500 nodes. It's the default interval set on top of config.
I can call the measurement but it returns the working one. There are 2 of them. 1 is corrupted and I couldn't manage to query from it. These are from influx terminal:
Today, I got one more corrupt measurement. |
@danielnelson |
Here is the update with bad news for me :/ I have another corrupt measurement in the new database. I actually have no clue what can be causing this. So, I'll try to explain my setup and what I did step by step.
Where should I check? What do you suggest? Corrupted measurements seems pretty random. It doesn't happen very often. Probably more than 1 month passed over the previous occurence. |
Hmm, Would you be willing to run a development version of Telegraf with some extra checks and logging? |
I am not sure which measreument it should be but it seems like I can run a dev. version but currently I don't have the automation to update all agents. Should I look into logs from all telegrafs? May be I should collect telegraf logs into influxdb. |
Is there a chance that this happens at the telegraf gateway? Because, it doesn't occur into same measurement. Also, it happened in I'm just laying my thoughts here. Sounds reasonable? |
Here are some builds of #7696, based on 1.14.4. It includes utf-8 checking whenever metrics are created, if it finds an invalid byte sequence it will be replace with the replacement char and will log a message. There are still other places the corrupted data could be coming from, but this is the most likely place to catch it. I'll have to think about exactly how/if we can roll this out to an official version, since I know some people depend on being able to send binary data. Can you start by running this on your gateway node and keeping an eye out log messages? |
@danielnelson should I set Current setting:
|
I want to update my logic here.
|
It's not required, the new errors are logged at error level. Could definitely be related to sqlserver or even win_perf_counters. |
I've got no new error about utf-8 parsing from the new telegraf agent. It was deployed on 2 sql servers and the gateway node. Yet, I found some other: This occurs at all windows servers:
From some of the sqls:
This is from 1.14.4 on an sql server:
|
Not sure on this one, could be a Telegraf bug or an issue with the perf counters. How often does it occur?
This is a general timeout writing to the output, may not be something to worry about unless it happens frequently.
I think this one is new to me, would be interesting if you could narrow down the reproduction steps for this.
This one is a prime candidate for who is to blame for the corrupted data, but I know you said earlier that you disable the sqlserver plugin and still had the issue. There is some additional discussion aboutt this in #6976 and #7037. I do recommend either disabling this query with |
Hi @danielnelson, First one occurs pretty often. This is just a random sample from a random server. There is hundreds of similar rows.
I'll try to get some insight on the third one. Lastly, about our prime candidate, I've excluded 'SqlRequests' and now we wait and see if the error occurs again. I've checked logs again, no logs about parsing error. Also, I haven't seen a "corrupt measurement/tag" in main bucket which was separated from SQL. |
Update: Yesterday, I moved back to gateway node to a container and today corrupted data(measurement) occured again in non-sqlbucket. I believe it comes from a windows server, this time corrupted measurement is "win_pagingfile". This is proves it's not sql related. |
Update: It's been more than a month and problem didn't re-occurred. I'm going to close this thread. By my experiments, it happens when gateway node running in a docker container. Now, I run multiple telegraf instances as a service directly on the host, all is good so far. |
I've tried to use kapacitor.input plugin and get the error below. Couldn't find a way to solve it. I think it's about the kapacitor's json output. I get the json output when manually typing the url into browser.
Relevant telegraf.conf:
Telegraf Log
System info:
Centos 7.6
Telegraf: 14.2
Kapacitor: 1.5
The text was updated successfully, but these errors were encountered: