Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change behavior for handling file offsets for rotated files #455

Closed
wants to merge 35 commits into from

Conversation

SaxyPandaBear
Copy link
Contributor

@SaxyPandaBear SaxyPandaBear commented Apr 26, 2022

Description of the issue

Closes #447

Description of changes

  • Updates the default behavior for our tail plugin configuration to automatically attempt to reopen the file after it is deleted. This change in behavior causes the agent to read the file from the beginning rather than use the offset, so we can't miss log lines
  • Updated unit tests that validated the previous behavior, and refactored the Recreate test since it was very long, breaking it up into smaller functions
  • Added an integration test based on the repro case provided in Missing Logs With Python Rotating File Handler And Minimal Log Messages #447 to validate
  • I couldn't figure out how to reproduce the issue natively in the integration test code itself, natively in Go, so I opted to copy the repro case and execute that instead. This does what I wanted it to, however it introduces a dependency on Python in the AMIs - I'm working through updating the AMIs currently so don't have a successful run of the integration tests currently but there shouldn't be significant code changes so still publishing the PR now.
  • Updated integration test README to reflect missing gcc requirement, and new Python requirement.
  • Updated tail reopen function to retry on permission failures to account for Windows being slow to recreate files - this is noticeable in unit tests on Windows, but was not noticeable in manual/soak testing on Windows test hosts for this.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

I manually reproduced the issue on a Linux (AL2) and Windows EC2, and also built the agent locally and verified that the code change fixes the issue.

After this, I set up pairs of hosts - one host with the latest release of the agent (v1.247350) and one with my custom build.

For each of these pairs of hosts, I performed a simple test:

  1. On AL2, consume log lines at 2TPS, rotating regularly
  2. On Windows 2019 Server, consume log lines at 2TPS, rotating regularly
  3. On AL2, have the agent consume log lines and configure the agent to automatically remove old files, once. This is to verify that there is no obvious mem/CPU leak when the agent auto removes files with this change.

After I validated all of this and set up the monitoring dashboard for these hosts, I wrote an integration test to reproduce the issue, in order to verify that my code change resolves the issue via an automated integration test.

logging code to validate auto removal

import logging

logger = logging.getLogger('test')
logger.setLevel(logging.INFO)

# write 100 logs to the log file
fh = logging.FileHandler('foo.log.1')
logger.addHandler(fh)

for i in range(100):
    logger.info('This is the first batch: %d', i)
logger.removeHandler(fh)

fh = logging.FileHandler('foo.log.2')
logger.addHandler(fh)

for i in range(100):
    logger.info('This is the second batch: %d', i)
logger.removeHandler(fh)

fh = logging.FileHandler('foo.log.3')
logger.addHandler(fh)

for i in range(100):
    logger.info('This is the third batch: %d', i)

For this, I had the agent monitor /home/ec2-user/foo.log* so it would pick up the latest log file each time I stopped writing to the old one, and validated that the agent deleted the files foo.log.1 and foo.log.2.

Dashboard for metrics

Screen Shot 2022-04-26 at 8 53 06 AM
Screen Shot 2022-04-26 at 8 53 27 AM
Screen Shot 2022-04-26 at 8 54 27 AM

Integration testing

Added an integration test to accurately reproduce the scenario provided in #447. I tweaked the original script to rotate more quickly, since there is a strict 15 minute timeout currently for the integration tests. The source Python script is located in integration/test/cloudwatchlogs/resources/write_and_rotate_logs.py and is invoked directly in the integration test, hence why Python is a required dependency in the integration test AMIs now, with this change.

Output of integration test with ReOpen change reverted (set back to false)

aws_instance.integration-test (remote-exec):         2022-04-25T22:41:17Z I! [logagent] piping log from i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log) to cloudwatchlogs with retention -1
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:27Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 151.497393ms before retrying.
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:32Z W! [inputs.tail] Stopping tail as file no longer exists: /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:32Z I! [logagent] Log src has stopped for i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log)
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:33Z I! [inputs.logfile] Reading from offset 65 in /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:33Z I! [logagent] piping log from i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log) to cloudwatchlogs with retention -1
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:47Z W! [inputs.tail] Stopping tail as file no longer exists: /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:47Z I! [logagent] Log src has stopped for i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log)
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:48Z I! [inputs.logfile] Reading from offset 65 in /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:48Z I! [logagent] piping log from i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log) to cloudwatchlogs with retention -1
aws_instance.integration-test (remote-exec):         2022-04-25T22:42:02Z I! Profiler is stopped during shutdown
aws_instance.integration-test (remote-exec):         2022-04-25T22:42:02Z I! [agent] Hang on, flushing any cached metrics before shutdown

aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Checking i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated since 2022-04-25T22:41:01Z for 3 expected logs
aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Found log: {"Metric": "12345123451234512345123451234512345123451234512345"}
aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Found log: 45678901234567890123456789012345678901234567890'}
aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Done paginating log events for i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated and found 2 logs
aws_instance.integration-test (remote-exec):     cwl_util.go:128:
aws_instance.integration-test (remote-exec):         	Error Trace:	cwl_util.go:128
aws_instance.integration-test (remote-exec):         	            				publish_logs_test.go:144
aws_instance.integration-test (remote-exec):         	Error:      	"[{"Metric": "12345123451234512345123451234512345123451234512345"} 45678901234567890123456789012345678901234567890'}]" should have 3 item(s), but has 2
aws_instance.integration-test (remote-exec):         	Test:       	TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec):     cwl_util.go:130:
aws_instance.integration-test (remote-exec):         	Error Trace:	cwl_util.go:130
aws_instance.integration-test (remote-exec):         	            				publish_logs_test.go:144
aws_instance.integration-test (remote-exec):         	Error:      	Not equal:
aws_instance.integration-test (remote-exec):         	            	expected: "{\"Metric\": \"09876098760987609876098760987609876098760987609876\"}"
aws_instance.integration-test (remote-exec):         	            	actual  : "45678901234567890123456789012345678901234567890'}"

aws_instance.integration-test (remote-exec):         	            	Diff:
aws_instance.integration-test (remote-exec):         	            	--- Expected
aws_instance.integration-test (remote-exec):         	            	+++ Actual
aws_instance.integration-test (remote-exec):         	            	@@ -1 +1 @@
aws_instance.integration-test (remote-exec):         	            	-{"Metric": "09876098760987609876098760987609876098760987609876"}
aws_instance.integration-test (remote-exec):         	            	+45678901234567890123456789012345678901234567890'}
aws_instance.integration-test (remote-exec):         	Test:       	TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec): --- FAIL: TestRotatingLogsDoesNotSkipLines (75.67s)

Note that I included some tail output from the agent log file to see what happened, and it shows the same output as the repro case. Also, you can see that the two logs (instead of expected 3) queried from CloudWatch Logs match the expected outcome of the repro case, where the first log line is fully visible, the second log line is completely missing, and the third log line is published truncated.

Output of integration test with change

https://github.com/SaxyPandaBear/amazon-cloudwatch-agent/actions/runs/2229185212
From AL2 test from this integration test run:

aws_instance.integration-test (remote-exec): === RUN   TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 Found instance id i-0ece89a3e737ca584
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 Copy File resources/config_log_rotated.json to /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 File resources/config_log_rotated.json abs path /home/ec2-user/amazon-cloudwatch-agent/integration/test/cloudwatchlogs/resources/config_log_rotated.json
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 File : resources/config_log_rotated.json copied to : /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 Agent has started
aws_instance.integration-test: Still creating... [2m50s elapsed]
aws_instance.integration-test (remote-exec):     publish_logs_test.go:99: Writing logs and rotating
aws_instance.integration-test: Still creating... [3m0s elapsed]
aws_instance.integration-test: Still creating... [3m10s elapsed]
aws_instance.integration-test: Still creating... [3m20s elapsed]
aws_instance.integration-test: Still creating... [3m30s elapsed]
aws_instance.integration-test: Still creating... [3m40s elapsed]
aws_instance.integration-test (remote-exec): 2022/04/26 20:35:08 Agent is stopped
aws_instance.integration-test (remote-exec): 2022/04/26 20:35:08 Checking i-0ece89a3e737ca584/i-0ece89a3e737ca584Rotated since 2022-04-26T20:34:08Z for 3 expected logs
aws_instance.integration-test (remote-exec): 2022/04/26 20:35:08 Done paginating log events for i-0ece89a3e737ca584/i-0ece89a3e737ca584Rotated and found 3 logs
aws_instance.integration-test (remote-exec): --- PASS: TestRotatingLogsDoesNotSkipLines (60.69s)
aws_instance.integration-test (remote-exec): PASS
aws_instance.integration-test (remote-exec): ok  	github.com/aws/amazon-cloudwatch-agent/integration/test/cloudwatchlogs	122.873s

#Requirements
Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make linter

@SaxyPandaBear SaxyPandaBear requested a review from a team as a code owner April 26, 2022 13:00
@codecov-commenter
Copy link

codecov-commenter commented Apr 26, 2022

Codecov Report

Merging #455 (c732b2f) into master (11f1c75) will increase coverage by 0.34%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #455      +/-   ##
==========================================
+ Coverage   57.28%   57.62%   +0.34%     
==========================================
  Files         370      358      -12     
  Lines       17335    16455     -880     
==========================================
- Hits         9931     9483     -448     
+ Misses       6819     6426     -393     
+ Partials      585      546      -39     
Impacted Files Coverage Δ
plugins/inputs/logfile/logfile.go 62.13% <100.00%> (-2.21%) ⬇️
plugins/inputs/logfile/tail/tail.go 36.34% <100.00%> (+3.26%) ⬆️
...md/amazon-cloudwatch-agent-config-wizard/wizard.go 56.94% <0.00%> (-11.12%) ⬇️
plugins/inputs/logfile/tailersrc.go 81.72% <0.00%> (-7.62%) ⬇️
plugins/inputs/demo/demo.go 50.00% <0.00%> (-7.15%) ⬇️
plugins/outputs/cloudwatch/cloudwatch.go 73.21% <0.00%> (-1.16%) ⬇️
plugins/inputs/win_perf_counters/pdh.go
...gins/inputs/win_perf_counters/win_perf_counters.go
plugins/inputs/logfile/tmpfile_windows.go
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11f1c75...c732b2f. Read the comment docs.

@SaxyPandaBear
Copy link
Contributor Author

--- FAIL: TestLogsFileRecreate (7.01s)
    logfile_test.go:1120: Checking for expected message 'klmnopqrst' to be consumed
    logfile_test.go:1098: Recreating log file C:\Users\RUNNER~1\AppData\Local\Temp\222378890 and writing log line 09876098760987609876
    logfile_test.go:1116: Finished writing '09876098760987609876'
    logfile_test.go:1120: Checking for expected message '09876098760987609876' to be consumed
    logfile_test.go:1098: Recreating log file C:\Users\RUNNER~1\AppData\Local\Temp\222378890 and writing log line 123456789012345678901234567890
    logfile_test.go:1[103](https://github.com/aws/amazon-cloudwatch-agent/runs/6176071509?check_suite_focus=true#step:6:103): 
        	Error Trace:	logfile_test.go:1103
        	            				asm_amd64.s:1374
        	Error:      	Received unexpected error:
        	            	remove C:\Users\RUNNER~1\AppData\Local\Temp\222378890: The process cannot access the file because it is being used by another process.
        	Test:       	TestLogsFileRecreate
    logfile_test.go:784: The log file should have been rotated by now
FAIL

Not sure why this only failed the second time that the test recreated the log file, but will dig into this. The assertion comes from checking for a nil error from:

err := os.Remove(origFile.Name())
require.NoError(t, err)

So my assumption is that Windows just doesn't want to play nicely with me

@SaxyPandaBear
Copy link
Contributor Author

aws_instance.integration-test (remote-exec):     cwl_util.go:129:
aws_instance.integration-test (remote-exec):         	Error Trace:	cwl_util.go:129
aws_instance.integration-test (remote-exec):         	            				publish_logs_test.go:140
aws_instance.integration-test (remote-exec):         	Error:      	Not equal:
aws_instance.integration-test (remote-exec):         	            	expected: "{\"Metric\": \"1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\"}"
aws_instance.integration-test (remote-exec):         	            	actual  : "{'Metric': '1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'}"

aws_instance.integration-test (remote-exec):         	            	Diff:
aws_instance.integration-test (remote-exec):         	            	--- Expected
aws_instance.integration-test (remote-exec):         	            	+++ Actual
aws_instance.integration-test (remote-exec):         	            	@@ -1 +1 @@
aws_instance.integration-test (remote-exec):         	            	-{"Metric": "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"}
aws_instance.integration-test (remote-exec):         	            	+{'Metric': '1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'}

Weird. It looks like the first two messages use \" but the last message uses '. I think I'll just force them to be uniform.

sethAmazon
sethAmazon previously approved these changes Apr 26, 2022
@SaxyPandaBear
Copy link
Contributor Author

aws_instance.integration-test (remote-exec): === RUN   TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 Found instance id i-0153f97e734d72bd4
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 Copy File resources/config_log_rotated.json to /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 File resources/config_log_rotated.json abs path /home/ec2-user/amazon-cloudwatch-agent/integration/test/cloudwatchlogs/resources/config_log_rotated.json
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 File : resources/config_log_rotated.json copied to : /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 Agent has started
aws_instance.integration-test: Still creating... [3m0s elapsed]
aws_instance.integration-test (remote-exec):     publish_logs_test.go:99: Writing logs and rotating
aws_instance.integration-test: Still creating... [3m10s elapsed]
aws_instance.integration-test: Still creating... [3m20s elapsed]
aws_instance.integration-test: Still creating... [3m30s elapsed]
aws_instance.integration-test: Still creating... [3m40s elapsed]
aws_instance.integration-test: Still creating... [3m50s elapsed]
aws_instance.integration-test (remote-exec): 2022/04/26 14:31:50 Agent is stopped
aws_instance.integration-test (remote-exec): 2022/04/26 14:31:50 Checking i-0153f97e734d72bd4/i-0153f97e734d72bd4Rotated since 2022-04-26T14:30:49Z for 3 expected logs
aws_instance.integration-test (remote-exec): 2022/04/26 14:31:50 Done paginating log events for i-0153f97e734d72bd4/i-0153f97e734d72bd4Rotated and found 3 logs
aws_instance.integration-test (remote-exec): --- PASS: TestRotatingLogsDoesNotSkipLines (61.20s)
aws_instance.integration-test (remote-exec): PASS
aws_instance.integration-test (remote-exec): ok  	github.com/aws/amazon-cloudwatch-agent/integration/test/cloudwatchlogs	124.025s

Successfully ran the integ test

@SaxyPandaBear
Copy link
Contributor Author

--- FAIL: TestLogsFileRemove (0.50s)
    logfile_test.go:357: tailerSrc should have reopened after deletion
FAIL

Interesting.

@khanhntd
Copy link
Contributor

khanhntd commented Apr 26, 2022

Not sure why ReOpen false would be the expected behavior for CWAgent if the offset would be the issue? If now ReOpen should have expected to be true, should be set this to always true instead of keeping this flag? (I am still not clear what Edward said). Besides that, if we are not accounting for the failing test currently, everything looks fine to me.

@SaxyPandaBear
Copy link
Contributor Author

--- FAIL: TestOffsetDoneCallBack (10.01s)
    tailersrc_test.go:312: Not enough logs have been processed, only 30 are processed
FAIL

noting this for myself as I've seen this fail before but not sure if I made the test flakier

numTries += 1
waitDuration *= 2
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100ms + 200ms + 400ms + 800ms + 1600ms = 3100ms.
I think that is fine since it is only for an edge case on the Windows OS.
Otherwise I'd suggest adding checks in any long running functions/methods to tail.Dying() or tail.Err() and return early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tricky since Dying() and Err() are channel receivers, but if you feel strongly about it, I could rewrite it a little bit with a select, like:

select {
case <-tail.Dying():
    return someError
case <-tail.Err():
    return someError
case <-time.After(waitDuration):
    continue
}

numTries += 1
waitDuration *= 2
continue
}
if os.IsNotExist(err) {
tail.Logger.Debugf("Waiting for %s to appear...", tail.Filename)
if err := tail.watcher.BlockUntilExists(&tail.Tomb); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean the tail never dies now?
If the user is monitoring /tmp/foo*.log and continuously creates files like:

foo1.log
foo2.log
foo3.log
...

Even if the old files get deleted will the tail live on, stuck in BlockUntilExists()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this myself and am fixing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me back up

Even if the old files get deleted will the tail live on, stuck in BlockUntilExists()

I was concerned about this myself. That was partly why I had the last set of metrics I provided in the PR description.
Screen Shot 2022-04-27 at 10 02 21 AM

Let me pull debug logs from both of those hosts


What I was referring to "fixing" just now is that I messed up the retry on file permission errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the host with my change, it didn't hold on to the file for tailing - though it is notable that I did set the auto_removal flag for it. The debug logs are normal:

2022-04-27T14:05:26Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:27Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:27Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:28Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:28Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:29Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:29Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:30Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:30Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:31Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:31Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:32Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:32Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:33Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:33Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:34Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:34Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:35Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics
2022-04-27T14:05:35Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics

and don't include that Waiting to reopen debug log. Let me spin up a new test where I don't set auto_removal to see

Copy link
Contributor

@adam-mateen adam-mateen Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect without auto_removal that this function would wait forever for the file to get recreated.
And users could see an increase in CPU and/or MEM usage due to an increase in goroutines.

Can you try testing without auto_removal?

Copy link
Contributor Author

@SaxyPandaBear SaxyPandaBear Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I expect too. The question is the impact. Will repurpose some hosts I used for this to see what happens when I constantly rotate and append to a new file - I already have hosts set up that rotate and reuse a file but don't have something set up to do something like foo1.log, foo2.log, ... fooN.log

Copy link
Contributor Author

@SaxyPandaBear SaxyPandaBear Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm doing now for validating is I'm using this script to generate 10,000 log files:

import logging
import time

logger = logging.getLogger('test')
logger.setLevel(logging.INFO)

def write_batch(logger, batch):
    for i in range(100):
        logger.info("This is the %dth batch: %d", batch, i)
        time.sleep(0.001)

for batch in range(1, 10001):
    file_name = 'foo.log.'+str(batch)
    fh = logging.FileHandler(file_name)
    logger.addHandler(fh)
    write_batch(logger, batch)
    logger.removeHandler(fh)
    time.sleep(0.1)

I set auto_removal to false on both test hosts.

After all of the log files get written, CPU/memory spike up on both hosts, which makes sense because the agent is now tailing 10k log files, though the log files aren't being updated after the script finishes.

Now, I'm going through and deleting some of the log files, like rm foo.log.1* to delete 1000 of the log files, and monitoring the CPU/mem usage as I remove log files to monitor. Basically, I want to see how drastically they diverge in held CPU and memory when I delete log files that it tries to constantly reopen.

Will post my findings here as I get more data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2022-04-27 at 2 53 07 PM
I'm not very happy with my findings. Only sharing the process CPU usage because the other metrics align pretty closely despite the change in behavior. This is something I would want to drill down on before merging this change.

Something worth noting - as I was looking at existing unit test behavior, I saw that there is a timer for cleaning up a tail process if the file is closed. This is controlled by the exitOnDeletion() function.

This is currently, with my change here, not executed.

	if !config.ReOpen {
		go t.exitOnDeletion()
	}

I'm thinking that based on the existing durations for this monitoring, which maxes out at 5 minutes, I think that it is safe to remove the if check here and just always run this goroutine. I'll make that change and subsequent unit test changes, and then redo this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to update on this: what happens is the exitOnDeletion() never detects that the file is deleted, so this does nothing to resolve the issue. At this point, I feel like this was wasted effort and I need to pivot. I have test hosts and monitoring set up so it should be quick to get stuff set up to monitor for leaks in whatever other solution I come up with, but unfortunately for @akozd , I don't think this is a viable solution without serious degradation.

@SaxyPandaBear
Copy link
Contributor Author

Closing this out for #457 which solves this issue without modifying the ReOpen flag, which is too disruptive to existing behavior for me to be comfortable to merge

@SaxyPandaBear SaxyPandaBear deleted the file-rotation branch June 18, 2022 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing Logs With Python Rotating File Handler And Minimal Log Messages
6 participants