-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More detail in Telegraf logs for log parser errors #3187
Comments
Sorry about that, what the log message is referring to is that one of the metrics collected has a tag value that ends with a backslash. This value is impossible to encode in line protocol due to some unfortunate escaping rules. I'll add a few more details to the log message. I don't think its very likely that this error caused the crash, it does indicate that the logfiles being parsed have complex escaping though. Its most likely that there was a line in your logfile that triggered an error in Telegraf, I'd like to try to find which line this was. You should be able to use the Telegraf log to determine the approximate time in your logfile. Can you copy the log files it was processing to a location where they will not be modified and run a instance of telegraf with only your logparser configuration with |
Alright... so I processed every single log file individually, and none of them resulted in a panic. So this is clearly not helpful haha... so perhaps something external caused it? I'm not sure... |
I have had the same sort of trouble in this piece of code before. Since we use a go map reproducing the error can depend on the ordering of the random ordering that it is iterated over. Do you think you could try running it multiple times, perhaps 10 times? |
Just so I'm clear on what you want me to try... You want me to iterate 10 times on each file individually? Or just iterate through all files in one shot over and over again? I would think the former? Cause otherwise, the issue is we don't even know which file the error is happening in, right? |
Ultimately, the goal is to use the process of elimination to narrow it down to a small set of lines or even the exact line, but first lets just confirm we can reproduce it. So if doing them in one shot is easier I think you could start with that, but after reproducing it we will want to narrow it down by doing it one file at a time. |
I think I have found the cause of this crash, will try to have a fix tonight. |
HUZZAH!!!!!! I was going to comment back in the morning... I was running the tests and replicated the failure a couple times, and was trying to narrow it down to which file was causing it... it's slow going haha... so this is good news :) Thanks! |
Since you were able to reproduce a few times, maybe you can try to trigger it again once rc4 is out? |
Yeah, absolutely!! Will give it a whirl as soon as it's out! |
Here are the official builds of rc4: https://github.com/influxdata/telegraf/releases/tag/1.4.0-rc4 |
Looks like the fix must've worked... I've re-run the test a number of times now, and haven't been able to reproduce the panic. Awesome! Thanks so much! |
Thanks for doing all that testing. |
Bug report
I just got the below error in my telegraf log, and given that I have multiple log parsers configured in my telegraf.conf, it's fairly difficult to figure out which log parser it's talking about.
Also, I looked through the logs to find a record where there might be a trailing backslash, and I couldn't find one... so not sure what this error is talking about, but ultimately, this error caused Telegraf to die, and so I lost lots of data overnight.
It would be great to have just a little bit more information in the standard logging.
Relevant telegraf.conf:
System info:
Ubuntu 14.04
Telegraf 1.4.0-RC2
[Include Telegraf version, operating system name, and other relevant details]
Steps to reproduce:
Expected behavior:
Actual behavior:
Additional info:
[Include gist of relevant config, logs, etc.]
Feature Request
Opening a feature request kicks off a discussion.
Proposal:
Current behavior:
Desired behavior:
Use case: [Why is this important (helps with prioritizing requests)]
The text was updated successfully, but these errors were encountered: