Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Logs With Python Rotating File Handler And Minimal Log Messages #447

Closed
akozd opened this issue Apr 18, 2022 · 8 comments · Fixed by #457
Closed

Missing Logs With Python Rotating File Handler And Minimal Log Messages #447

akozd opened this issue Apr 18, 2022 · 8 comments · Fixed by #457
Assignees
Labels
bug Something isn't working

Comments

@akozd
Copy link

akozd commented Apr 18, 2022

There is potential for logs to go missing and to have partial logging messages when using a rotating file handler within an application that produces very few logs.

The behavior seems to be when a log file is rotated (ex: logfile.log is renamed to logfile.log.2022-04-18) and a new message is written to logfile.log, if the byte length of the single new message is >= the byte length of the entire old log file, then the new message will be entirely missing or partially missing.

We are using the latest version of the CloudWatch Agent, no code modifications have been made, and there is nothing unusual about our environment.

We are mitigating this issue currently just by decreasing the rate at which log files get rotated.

Minimal reproducible example:

CloudWatch agent config.json:

{
    "agent": {
      "run_as_user": "root",
      "credentials": {
        "role_arn": "arn:aws:iam::------------:role/EC2AdminRole"
      }
    },
    "logs": {
      "logs_collected": {
        "files": {
          "collect_list": [
            {
              "file_path": "/home/ec2-user/amazon-cloudwatch-agent/logdest/logs.log*",
              "log_group_name": "custom-agent-logs",
              "log_stream_name": "{instance_id}"
            }
          ]
        }
      }
    }
  }
  

log_generator.py:

import glob
import os
import logging
from logging.handlers import TimedRotatingFileHandler
import time
import json

# get root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# rotate our log file every 3 minutes
handler = TimedRotatingFileHandler("logdest/logs.log", when="M", interval=3)
logger.addHandler(handler)

# log a message
logging.info(json.dumps({"Metric": "12345"*10}))
# sleep for 4 minutes so that file will rotate upon next log message
time.sleep(60*4)
# log another message (this one will not appear since byte length of message == byte length of old log file)
logging.info(json.dumps({"Metric": "09876"*10}))
# sleep for another 4 minutes so that file will rotate upon next log message
time.sleep(60*4)
# this message will be partially written
logging.info({"Metric": "1234567890"*10})

CloudWatch Logs:
Notice that the second log message is entirely missing, and that the third message is missing its first 65 bytes.

Timestamp Message Bytes
2022-04-18T11:04:49.539-04:00 {"Metric": "12345123451234512345123451234512345123451234512345"} 64
2022-04-18T11:13:21.539-04:00 45678901234567890123456789012345678901234567890'} 49

Log Files on EC2:
logs.log (114 bytes):

{'Metric': '1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'}

logs.log.2022-04-18_15-09 (64 bytes):

{"Metric": "09876098760987609876098760987609876098760987609876"}

logs.log.2022-04-18_15-05 (64 bytes):

{"Metric": "12345123451234512345123451234512345123451234512345"}
@SaxyPandaBear SaxyPandaBear added bug Something isn't working status/investigate labels Apr 18, 2022
@SaxyPandaBear SaxyPandaBear self-assigned this Apr 19, 2022
@SaxyPandaBear
Copy link
Contributor

Thanks for the very succinct repro case. I just reproduced the issue on my own. I'll take a look at this

@SaxyPandaBear
Copy link
Contributor

As an aside, I tweaked the log generator to publish less and noticed that the partial log is still reproducible:

{     "Metric": "12345123451234512345123451234512345123451234512345" }
--
{     "Metric": "0987098709870987098709870987098709870987" }
--
456789012345678901234567890123456789012345678901234567890'}

Newlines didn't paste well from CloudWatch logs, but to illustrate - changing the second log emission to:

logging.info(json.dumps({"Metric": "0987"*10}))

to publish 10 less chars, I observed two main things:

  1. It published the second line in full
  2. The truncated third log has more content in it (10 more chars).

It looks to me that there is potentially an issue with state management within the agent with respect to log file offsets.

From the first test (unaltered)

2022-04-19T14:11:00Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14
2022-04-19T14:11:10Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 190.261007ms before retrying.
2022-04-19T14:15:00Z W! [inputs.tail] Stopping tail as file no longer exists: /home/ec2-user/logs.log
2022-04-19T14:15:00Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T14:15:00Z I! [inputs.logfile] Reading from offset 65 in /home/ec2-user/logs.log
2022-04-19T14:15:00Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14
2022-04-19T14:19:00Z W! [inputs.tail] Stopping tail as file no longer exists: /home/ec2-user/logs.log
2022-04-19T14:19:00Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T14:19:00Z I! [inputs.logfile] Reading from offset 65 in /home/ec2-user/logs.log
2022-04-19T14:19:00Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14

You can see the info level logging that when the file gets truncated/rotated, the agent starts again at a specified offset. This offset gets stored in a state file:

[ec2-user@ip-172-31-57-211 ~]$ ls -al /opt/aws/amazon-cloudwatch-agent/logs/state
total 4
drwxr-xr-x 2 root root 37 Apr 19 14:11 .
drwxr-xr-x 3 root root 90 Apr 19 14:10 ..
-rw-r--r-- 1 root root 27 Apr 19 16:27 _home_ec2-user_logs.log

So what is happening is the agent starts reading at offset 65, which as you probably can tell, is the end of the log file in the original test - you only write out 64 bytes.

This seems to explain why the second log line does not get published.

Then, what happens is the file gets rotated the second time, and the agent starts reading from the same offset at 65. But this time, when you emit the third log, it starts reading from there and publishes the log line past the 64th byte. This explains the partial third message.

Altered test

Now what's interesting is the behavior with my modified test.

Agent logs:

2022-04-19T16:19:03Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T16:19:04Z I! [inputs.logfile] Reading from offset 115 in /home/ec2-user/logs.log
2022-04-19T16:19:04Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14
2022-04-19T16:19:05Z I! [inputs.tail] Re-opening truncated file /home/ec2-user/logs.log ...
2022-04-19T16:23:03Z W! [inputs.tail] Stopping tail as file no longer exists: /home/ec2-user/logs.log
2022-04-19T16:23:03Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T16:23:04Z I! [inputs.logfile] Reading from offset 65 in /home/ec2-user/logs.log
2022-04-19T16:23:04Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14
2022-04-19T16:23:05Z I! [inputs.tail] Re-opening truncated file /home/ec2-user/logs.log ...
2022-04-19T16:27:04Z W! [inputs.tail] Stopping tail as file no longer exists: /home/ec2-user/logs.log
2022-04-19T16:27:04Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T16:27:04Z I! [inputs.logfile] Reading from offset 55 in /home/ec2-user/logs.log
2022-04-19T16:27:04Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14

It starts reading from offset 115, which is the end of the previous third message.

Then the log gets rotated, and the agent starts reading from offset 65 again. This indicates that the first log message was published then the log got rotated. Then the log gets rotated again and starts reading from offset 55, which is what I expected because I published 10 less chars in the generator script.

After this, the agent publishes a partial log of the third message, but because it read from a lower offset, the partial message that gets published to CloudWatch includes more content in it.


Still need to dig into it more to see if there's just something obvious that can be configured to resolve this or if this requires a code change.

@SaxyPandaBear
Copy link
Contributor

At this point, it seems to me like there are two potential bugs related to log rotation:

  1. Your specific scenario where that second log in your repro case gets dropped because the size of the log == the agent state file offset
  2. The partial message that only publishes the end of the log line after the offset.

@SaxyPandaBear
Copy link
Contributor

SaxyPandaBear commented Apr 20, 2022

I still want to investigate this a little further, but I think I found the root cause.

Analyzing repro case logs

Looking at a snippet of the original repro case logs:

2022-04-19T14:15:00Z W! [inputs.tail] Stopping tail as file no longer exists: /home/ec2-user/logs.log
2022-04-19T14:15:00Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T14:15:00Z I! [inputs.logfile] Reading from offset 65 in /home/ec2-user/logs.log
2022-04-19T14:15:00Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14

What happens here is:

Log Explanation
Stopping tail as file no longer exists: /home/ec2-user/logs.log The log file gets rotated*
Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) The tailing goroutine for the log file is killed/cleaned up
Reading from offset 65 in /home/ec2-user/logs.log The agent reads the state file for this log, to figure out where to seek in the file and continue monitoring
piping log from custom-agent-logs/i-0e3adad213edfd7b5 The agent starts reading from the log file again**

* - One thing that tripped me up for a while was that in the repro case, even though the logger is configured to rotate every 3 minutes, the rotation doesn't actually occur until the next log line gets emitted:

-rw-rw-r-- 1 ec2-user ec2-user      115 Apr 19 22:02 logs.log
-rw-rw-r-- 1 ec2-user ec2-user     3900 Apr 19 20:39 logs.log.2022-04-19_20-38
-rw-rw-r-- 1 ec2-user ec2-user     3900 Apr 19 20:40 logs.log.2022-04-19_20-40
-rw-rw-r-- 1 ec2-user ec2-user       65 Apr 19 21:54 logs.log.2022-04-19_21-54
-rw-rw-r-- 1 ec2-user ec2-user       65 Apr 19 21:58 logs.log.2022-04-19_21-58

I assume this is just how the logger works under the hood. Not super important overall, but something that confused me for a while until I finally looked at the modified time for the files.
** - will get into this deeper, but essentially, the agent does not gracefully handle the log rotation in this scenario and relies on something else to pick up the file again after the previous routine gets killed

Analyzing altered case logs

This is to answer two questions about the altered scenario:

  1. Why did the second message get published in this scenario?
  2. Why did the third message still get truncated?

Looking at the logs - this is from the 2nd message of the test:

2022-04-19T16:23:03Z W! [inputs.tail] Stopping tail as file no longer exists: /home/ec2-user/logs.log
2022-04-19T16:23:03Z I! [logagent] Log src has stopped for custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log)
2022-04-19T16:23:04Z I! [inputs.logfile] Reading from offset 65 in /home/ec2-user/logs.log
2022-04-19T16:23:04Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14
2022-04-19T16:23:05Z I! [inputs.tail] Re-opening truncated file /home/ec2-user/logs.log ...

This is mostly the same, however there is one additional log line:

Re-opening truncated file /home/ec2-user/logs.log ...

What happened here in my altered test scenario was I modified the second log to emit a log message < the size of the first one. The agent determined based on this that the file was truncated at some point completely separate from the actual log rotation that happened, so it started reading the file from the beginning, hence why even though there was a log that said it would start at offset 65, it still consumed the log line that was < 64 bytes.

Then as noted from the CloudWatch Logs output from the second test, the third message still only partially gets uploaded to CloudWatch, though a larger portion of it than previously. This is due to the fact that the agent did not determine/know that the log file got rotated, so it started reading from offset 55, and consumed part of the third log message as new log content (skipping the first 55 bytes).


**

In this scenario, we don't want the agent to start from an offset. The file got rotated, which means that the file got truncated at some point and as illustrated by this repro case, the agent can miss logs.

I was digging through the code and found this ReOpen boolean flag, which originates from the forked telegraf tail plugin and isn't exposed as a configuration in the CloudWatch agent:

Then in the tail plugin, it looks uses this flag to determine whether to reopen a file or to stop:

case <-tail.changes.Deleted:
tail.changes = nil
if tail.ReOpen {
tail.Logger.Infof("Re-opening moved/deleted file %s ...", tail.Filename)
if err := tail.reopen(); err != nil {
return err
}
tail.Logger.Debugf("Successfully reopened %s", tail.Filename)
tail.openReader()
return nil
} else {
tail.Logger.Warnf("Stopping tail as file no longer exists: %s", tail.Filename)
return ErrDeletedNotReOpen
}

Looking at the original repro case, this Stopping tail log is the first log line that shows up after a rotation occurs.

What happens here is that when the rotation occurs, this function returns an error, which ultimately closes the file monitor. As far as I can tell, this does not cleanly restart tailing the file. It determines that external log rotation as a reason to stop monitoring the file.

But then what is causing the file to be read from again?

The main agent has a long running process for picking up log files. This is mostly important in the scenario where someone would configure a log file path with wildcards in it, in order to pick up new files regularly:

t := time.NewTicker(time.Second)
defer t.Stop()
for {
select {
case <-t.C:
for _, c := range l.collections {
srcs := c.FindLogSrc()
for _, src := range srcs {
dname := src.Destination()
backend, ok := l.backends[dname]
if !ok {
log.Printf("E! [logagent] Failed to find destination %v for log source %v/%v(%v) ", dname, src.Group(), src.Stream(), src.Description())
continue
}
dest := backend.CreateDest(src.Group(), src.Stream(), src.Retention())
l.destNames[dest] = dname
log.Printf("I! [logagent] piping log from %v/%v(%v) to %v with retention %v", src.Group(), src.Stream(), src.Description(), dname, src.Retention())
go l.runSrcToDest(src, dest)
}
}
case <-ctx.Done():
return
}
}

So the reason why the agent is able to semi-gracefully handle killing the original tail routine when the file gets truncated is because this separate routine will rediscover that log file and start tailing it again, reading the offset as usual.

This is not how we want it to behave in this scenario though. What we want is to reopen the file and start from the beginning again.

I cut a branch, flipped the ReOpen flag to true, and that one-line code change resolves the repro case. I was able to publish all 3 logs, in full, with that change.

The logs from that test build show that the agent never cleaned up the tailer process and recreated it:

2022-04-19T19:30:03Z I! [logagent] starting
2022-04-19T19:30:03Z I! [logagent] found plugin cloudwatchlogs is a log backend
2022-04-19T19:30:03Z I! [logagent] found plugin logfile is a log collection
2022-04-19T19:30:04Z I! [inputs.logfile] Reading from offset 115 in /home/ec2-user/logs.log
2022-04-19T19:30:04Z I! [logagent] piping log from custom-agent-logs/i-0e3adad213edfd7b5(/home/ec2-user/logs.log) to cloudwatchlogs with retention 14
2022-04-19T19:30:41Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T19:30:51Z I! [outputs.cloudwatchlogs] First time sending logs to custom-agent-logs/i-0e3adad213edfd7b5 since startup so sequenceToken is nil, learned new token:(0
xc0007f7a90): The given sequenceToken is invalid. The next expected sequenceToken is: 49620584302281720869295547825648965512295105266261690210
2022-04-19T19:30:51Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 173.340515ms before retrying.
2022-04-19T19:34:41Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T19:38:41Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T20:38:10Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T20:39:30Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T21:54:59Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T21:58:59Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...
2022-04-19T22:02:59Z I! [inputs.tail] Re-opening moved/deleted file /home/ec2-user/logs.log ...

My biggest concern is that this is definitely a change in behavior - though it could be a change in behavior that fixes a longstanding bug that has gone unnoticed.

As a simple sanity test, I ran another simple test where I wrote N logs, one log line per second, and had the logger rotate the file after 1 minute to verify that it handled that case, which it did. But I want to do some more testing with that change because it feels too obvious/minor of a change to be such an impactful fix.

@akozd
Copy link
Author

akozd commented Apr 20, 2022

Thank you for the excellent deep dive. I agree that the ReOpen flag should be set to true in the case for rotating logs. I see that when true, this flag will mimic the tail -F command in Linux:

ReOpen bool // Reopen recreated files (tail -F)

Which will handle reopening the log file after it is rotated. Good stack exchange post.

Hopefully this really is a simple one-line fix...

@SaxyPandaBear
Copy link
Contributor

SaxyPandaBear commented Apr 20, 2022

Also confirmed that the same bug appears on Windows:

{     "Metric": "12345123451234512345123451234512345123451234512345" }
--
5678901234567890123456789012345678901234567890'}

Is what gets published to CWL. Now, one interesting thing I saw is that I think that Windows adds an extra byte hidden in there.

The log line I see in the agent log file on Windows indicates that the offset is 1 byte farther along than on Linux. Ultimately, got the same result so not too important to drill down on where that extra byte comes from, but just thought it was interesting.

2022-04-20T20:25:07Z I! [inputs.logfile] Reading from offset 66 in C:\Users\Administrator\Desktop\logs.log

This morning I set up two Linux boxes, one with my fixed version and one with the latest release and started publishing logs to them continually just to make sure there wasn't some obvious regression. I'll work on setting something up on two Windows machines and then once I see everything looks normal, I'll work on publishing a PR for this fix.

I will note that there is one unit test that fails with my change, because that unit test explicitly checks for the existing behavior where the agent terminates the tailing process when a file gets removed.

@SaxyPandaBear
Copy link
Contributor

Still been looking into this. Although flipping that ReOpen flag does alleviate it for this scenario, I realized that this test indicates that what you reproduced was expected behavior

// TestLogsFileRecreate verifies that if a LogSrc matching a LogConfig is detected,
// We only receive log lines beginning at the offset specified in the corresponding state-file.
// And if the file happens to get deleted and recreated we expect to receive log lines beginning
// at that same offset in the state file.
func TestLogsFileRecreate(t *testing.T) {

The above test explicitly validates that recreating the log file with the same name starts from the saved offset. Need to figure out why this behavior was made by design, since it seems counterintuitive to me.

@SaxyPandaBear
Copy link
Contributor

Started up one final test to see how it handles the auto_removal flag. The fix is a mild change in behavior, but I don't foresee it being customer impacting - pending the results of this last test. Am drafting a PR with some test code cleanup for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants