-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change behavior for handling file offsets for rotated files #455
Conversation
Codecov Report
@@ Coverage Diff @@
## master #455 +/- ##
==========================================
+ Coverage 57.28% 57.62% +0.34%
==========================================
Files 370 358 -12
Lines 17335 16455 -880
==========================================
- Hits 9931 9483 -448
+ Misses 6819 6426 -393
+ Partials 585 546 -39
Continue to review full report at Codecov.
|
Not sure why this only failed the second time that the test recreated the log file, but will dig into this. The assertion comes from checking for a nil error from: err := os.Remove(origFile.Name())
require.NoError(t, err) So my assumption is that Windows just doesn't want to play nicely with me |
Weird. It looks like the first two messages use |
Successfully ran the integ test |
Interesting. |
Not sure why |
noting this for myself as I've seen this fail before but not sure if I made the test flakier |
numTries += 1 | ||
waitDuration *= 2 | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100ms + 200ms + 400ms + 800ms + 1600ms = 3100ms.
I think that is fine since it is only for an edge case on the Windows OS.
Otherwise I'd suggest adding checks in any long running functions/methods to tail.Dying()
or tail.Err()
and return early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little tricky since Dying()
and Err()
are channel receivers, but if you feel strongly about it, I could rewrite it a little bit with a select, like:
select {
case <-tail.Dying():
return someError
case <-tail.Err():
return someError
case <-time.After(waitDuration):
continue
}
numTries += 1 | ||
waitDuration *= 2 | ||
continue | ||
} | ||
if os.IsNotExist(err) { | ||
tail.Logger.Debugf("Waiting for %s to appear...", tail.Filename) | ||
if err := tail.watcher.BlockUntilExists(&tail.Tomb); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this mean the tail
never dies now?
If the user is monitoring /tmp/foo*.log
and continuously creates files like:
foo1.log
foo2.log
foo3.log
...
Even if the old files get deleted will the tail
live on, stuck in BlockUntilExists()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this myself and am fixing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me back up
Even if the old files get deleted will the tail live on, stuck in BlockUntilExists()
I was concerned about this myself. That was partly why I had the last set of metrics I provided in the PR description.
Let me pull debug logs from both of those hosts
What I was referring to "fixing" just now is that I messed up the retry on file permission errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the host with my change, it didn't hold on to the file for tailing - though it is notable that I did set the auto_removal
flag for it. The debug logs are normal:
2022-04-27T14:05:26Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:27Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:27Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:28Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:28Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:29Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:29Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:30Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:30Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:31Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:31Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:32Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:32Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:33Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:33Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:34Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:34Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:35Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:35Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
and don't include that Waiting to reopen
debug log. Let me spin up a new test where I don't set auto_removal to see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect without auto_removal
that this function would wait forever for the file to get recreated.
And users could see an increase in CPU and/or MEM usage due to an increase in goroutines.
Can you try testing without auto_removal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I expect too. The question is the impact. Will repurpose some hosts I used for this to see what happens when I constantly rotate and append to a new file - I already have hosts set up that rotate and reuse a file but don't have something set up to do something like foo1.log, foo2.log, ... fooN.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm doing now for validating is I'm using this script to generate 10,000 log files:
import logging
import time
logger = logging.getLogger('test')
logger.setLevel(logging.INFO)
def write_batch(logger, batch):
for i in range(100):
logger.info("This is the %dth batch: %d", batch, i)
time.sleep(0.001)
for batch in range(1, 10001):
file_name = 'foo.log.'+str(batch)
fh = logging.FileHandler(file_name)
logger.addHandler(fh)
write_batch(logger, batch)
logger.removeHandler(fh)
time.sleep(0.1)
I set auto_removal to false on both test hosts.
After all of the log files get written, CPU/memory spike up on both hosts, which makes sense because the agent is now tailing 10k log files, though the log files aren't being updated after the script finishes.
Now, I'm going through and deleting some of the log files, like rm foo.log.1*
to delete 1000 of the log files, and monitoring the CPU/mem usage as I remove log files to monitor. Basically, I want to see how drastically they diverge in held CPU and memory when I delete log files that it tries to constantly reopen.
Will post my findings here as I get more data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very happy with my findings. Only sharing the process CPU usage because the other metrics align pretty closely despite the change in behavior. This is something I would want to drill down on before merging this change.
Something worth noting - as I was looking at existing unit test behavior, I saw that there is a timer for cleaning up a tail process if the file is closed. This is controlled by the exitOnDeletion()
function.
This is currently, with my change here, not executed.
if !config.ReOpen {
go t.exitOnDeletion()
}
I'm thinking that based on the existing durations for this monitoring, which maxes out at 5 minutes, I think that it is safe to remove the if
check here and just always run this goroutine. I'll make that change and subsequent unit test changes, and then redo this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to update on this: what happens is the exitOnDeletion()
never detects that the file is deleted, so this does nothing to resolve the issue. At this point, I feel like this was wasted effort and I need to pivot. I have test hosts and monitoring set up so it should be quick to get stuff set up to monitor for leaks in whatever other solution I come up with, but unfortunately for @akozd , I don't think this is a viable solution without serious degradation.
Closing this out for #457 which solves this issue without modifying the |
Description of the issue
Closes #447
Description of changes
reopen
function to retry on permission failures to account for Windows being slow to recreate files - this is noticeable in unit tests on Windows, but was not noticeable in manual/soak testing on Windows test hosts for this.License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
I manually reproduced the issue on a Linux (AL2) and Windows EC2, and also built the agent locally and verified that the code change fixes the issue.
After this, I set up pairs of hosts - one host with the latest release of the agent (v1.247350) and one with my custom build.
For each of these pairs of hosts, I performed a simple test:
After I validated all of this and set up the monitoring dashboard for these hosts, I wrote an integration test to reproduce the issue, in order to verify that my code change resolves the issue via an automated integration test.
logging code to validate auto removal
For this, I had the agent monitor
/home/ec2-user/foo.log*
so it would pick up the latest log file each time I stopped writing to the old one, and validated that the agent deleted the filesfoo.log.1
andfoo.log.2
.Dashboard for metrics
Integration testing
Added an integration test to accurately reproduce the scenario provided in #447. I tweaked the original script to rotate more quickly, since there is a strict 15 minute timeout currently for the integration tests. The source Python script is located in
integration/test/cloudwatchlogs/resources/write_and_rotate_logs.py
and is invoked directly in the integration test, hence why Python is a required dependency in the integration test AMIs now, with this change.Output of integration test with
ReOpen
change reverted (set back to false)Note that I included some
tail
output from the agent log file to see what happened, and it shows the same output as the repro case. Also, you can see that the two logs (instead of expected 3) queried from CloudWatch Logs match the expected outcome of the repro case, where the first log line is fully visible, the second log line is completely missing, and the third log line is published truncated.Output of integration test with change
https://github.com/SaxyPandaBear/amazon-cloudwatch-agent/actions/runs/2229185212
From AL2 test from this integration test run:
#Requirements
Before commit the code, please do the following steps.
make fmt
andmake fmt-sh
make linter