-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Logs With Python Rotating File Handler And Minimal Log Messages #447
Comments
Thanks for the very succinct repro case. I just reproduced the issue on my own. I'll take a look at this |
As an aside, I tweaked the log generator to publish less and noticed that the partial log is still reproducible:
Newlines didn't paste well from CloudWatch logs, but to illustrate - changing the second log emission to: logging.info(json.dumps({"Metric": "0987"*10})) to publish 10 less chars, I observed two main things:
It looks to me that there is potentially an issue with state management within the agent with respect to log file offsets. From the first test (unaltered)
You can see the info level logging that when the file gets truncated/rotated, the agent starts again at a specified offset. This offset gets stored in a state file:
So what is happening is the agent starts reading at offset 65, which as you probably can tell, is the end of the log file in the original test - you only write out 64 bytes. This seems to explain why the second log line does not get published. Then, what happens is the file gets rotated the second time, and the agent starts reading from the same offset at 65. But this time, when you emit the third log, it starts reading from there and publishes the log line past the 64th byte. This explains the partial third message. Altered testNow what's interesting is the behavior with my modified test. Agent logs:
It starts reading from offset 115, which is the end of the previous third message. Then the log gets rotated, and the agent starts reading from offset 65 again. This indicates that the first log message was published then the log got rotated. Then the log gets rotated again and starts reading from offset 55, which is what I expected because I published 10 less chars in the generator script. After this, the agent publishes a partial log of the third message, but because it read from a lower offset, the partial message that gets published to CloudWatch includes more content in it. Still need to dig into it more to see if there's just something obvious that can be configured to resolve this or if this requires a code change. |
At this point, it seems to me like there are two potential bugs related to log rotation:
|
I still want to investigate this a little further, but I think I found the root cause. Analyzing repro case logsLooking at a snippet of the original repro case logs:
What happens here is:
I assume this is just how the logger works under the hood. Not super important overall, but something that confused me for a while until I finally looked at the modified time for the files. Analyzing altered case logsThis is to answer two questions about the altered scenario:
Looking at the logs - this is from the 2nd message of the test:
This is mostly the same, however there is one additional log line:
What happened here in my altered test scenario was I modified the second log to emit a log message < the size of the first one. The agent determined based on this that the file was truncated at some point completely separate from the actual log rotation that happened, so it started reading the file from the beginning, hence why even though there was a log that said it would start at offset 65, it still consumed the log line that was < 64 bytes. Then as noted from the CloudWatch Logs output from the second test, the third message still only partially gets uploaded to CloudWatch, though a larger portion of it than previously. This is due to the fact that the agent did not determine/know that the log file got rotated, so it started reading from offset 55, and consumed part of the third log message as new log content (skipping the first 55 bytes). **In this scenario, we don't want the agent to start from an offset. The file got rotated, which means that the file got truncated at some point and as illustrated by this repro case, the agent can miss logs. I was digging through the code and found this
Then in the tail plugin, it looks uses this flag to determine whether to reopen a file or to stop: amazon-cloudwatch-agent/plugins/inputs/logfile/tail/tail.go Lines 442 to 455 in 6bc4507
Looking at the original repro case, this What happens here is that when the rotation occurs, this function returns an error, which ultimately closes the file monitor. As far as I can tell, this does not cleanly restart tailing the file. It determines that external log rotation as a reason to stop monitoring the file.
The main agent has a long running process for picking up log files. This is mostly important in the scenario where someone would configure a log file path with wildcards in it, in order to pick up new files regularly: amazon-cloudwatch-agent/logs/logs.go Lines 93 to 116 in 6bc4507
So the reason why the agent is able to semi-gracefully handle killing the original tail routine when the file gets truncated is because this separate routine will rediscover that log file and start tailing it again, reading the offset as usual. This is not how we want it to behave in this scenario though. What we want is to reopen the file and start from the beginning again. I cut a branch, flipped the The logs from that test build show that the agent never cleaned up the tailer process and recreated it:
My biggest concern is that this is definitely a change in behavior - though it could be a change in behavior that fixes a longstanding bug that has gone unnoticed. As a simple sanity test, I ran another simple test where I wrote N logs, one log line per second, and had the logger rotate the file after 1 minute to verify that it handled that case, which it did. But I want to do some more testing with that change because it feels too obvious/minor of a change to be such an impactful fix. |
Thank you for the excellent deep dive. I agree that the
Which will handle reopening the log file after it is rotated. Good stack exchange post. Hopefully this really is a simple one-line fix... |
Also confirmed that the same bug appears on Windows:
Is what gets published to CWL. Now, one interesting thing I saw is that I think that Windows adds an extra byte hidden in there. The log line I see in the agent log file on Windows indicates that the offset is 1 byte farther along than on Linux. Ultimately, got the same result so not too important to drill down on where that extra byte comes from, but just thought it was interesting.
This morning I set up two Linux boxes, one with my fixed version and one with the latest release and started publishing logs to them continually just to make sure there wasn't some obvious regression. I'll work on setting something up on two Windows machines and then once I see everything looks normal, I'll work on publishing a PR for this fix. I will note that there is one unit test that fails with my change, because that unit test explicitly checks for the existing behavior where the agent terminates the tailing process when a file gets removed. |
Still been looking into this. Although flipping that ReOpen flag does alleviate it for this scenario, I realized that this test indicates that what you reproduced was expected behavior amazon-cloudwatch-agent/plugins/inputs/logfile/logfile_test.go Lines 717 to 721 in 8eb0d14
The above test explicitly validates that recreating the log file with the same name starts from the saved offset. Need to figure out why this behavior was made by design, since it seems counterintuitive to me. |
Started up one final test to see how it handles the |
There is potential for logs to go missing and to have partial logging messages when using a rotating file handler within an application that produces very few logs.
The behavior seems to be when a log file is rotated (ex:
logfile.log
is renamed tologfile.log.2022-04-18
) and a new message is written tologfile.log
, if the byte length of the single new message is >= the byte length of the entire old log file, then the new message will be entirely missing or partially missing.We are using the latest version of the CloudWatch Agent, no code modifications have been made, and there is nothing unusual about our environment.
We are mitigating this issue currently just by decreasing the rate at which log files get rotated.
Minimal reproducible example:
CloudWatch agent
config.json
:log_generator.py
:CloudWatch Logs:
Notice that the second log message is entirely missing, and that the third message is missing its first 65 bytes.
Log Files on EC2:
logs.log
(114 bytes):logs.log.2022-04-18_15-09
(64 bytes):logs.log.2022-04-18_15-05
(64 bytes):The text was updated successfully, but these errors were encountered: