Change behavior for handling file offsets for rotated files #455

SaxyPandaBear · 2022-04-26T13:00:44Z

Description of the issue

Closes #447

Description of changes

Updates the default behavior for our tail plugin configuration to automatically attempt to reopen the file after it is deleted. This change in behavior causes the agent to read the file from the beginning rather than use the offset, so we can't miss log lines
Updated unit tests that validated the previous behavior, and refactored the Recreate test since it was very long, breaking it up into smaller functions
Added an integration test based on the repro case provided in Missing Logs With Python Rotating File Handler And Minimal Log Messages #447 to validate
I couldn't figure out how to reproduce the issue natively in the integration test code itself, natively in Go, so I opted to copy the repro case and execute that instead. This does what I wanted it to, however it introduces a dependency on Python in the AMIs - I'm working through updating the AMIs currently so don't have a successful run of the integration tests currently but there shouldn't be significant code changes so still publishing the PR now.
Updated integration test README to reflect missing gcc requirement, and new Python requirement.
Updated tail reopen function to retry on permission failures to account for Windows being slow to recreate files - this is noticeable in unit tests on Windows, but was not noticeable in manual/soak testing on Windows test hosts for this.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

I manually reproduced the issue on a Linux (AL2) and Windows EC2, and also built the agent locally and verified that the code change fixes the issue.

After this, I set up pairs of hosts - one host with the latest release of the agent (v1.247350) and one with my custom build.

For each of these pairs of hosts, I performed a simple test:

On AL2, consume log lines at 2TPS, rotating regularly
On Windows 2019 Server, consume log lines at 2TPS, rotating regularly
On AL2, have the agent consume log lines and configure the agent to automatically remove old files, once. This is to verify that there is no obvious mem/CPU leak when the agent auto removes files with this change.

After I validated all of this and set up the monitoring dashboard for these hosts, I wrote an integration test to reproduce the issue, in order to verify that my code change resolves the issue via an automated integration test.

logging code to validate auto removal

import logging

logger = logging.getLogger('test')
logger.setLevel(logging.INFO)

# write 100 logs to the log file
fh = logging.FileHandler('foo.log.1')
logger.addHandler(fh)

for i in range(100):
    logger.info('This is the first batch: %d', i)
logger.removeHandler(fh)

fh = logging.FileHandler('foo.log.2')
logger.addHandler(fh)

for i in range(100):
    logger.info('This is the second batch: %d', i)
logger.removeHandler(fh)

fh = logging.FileHandler('foo.log.3')
logger.addHandler(fh)

for i in range(100):
    logger.info('This is the third batch: %d', i)

For this, I had the agent monitor /home/ec2-user/foo.log* so it would pick up the latest log file each time I stopped writing to the old one, and validated that the agent deleted the files foo.log.1 and foo.log.2.

Dashboard for metrics

Integration testing

Added an integration test to accurately reproduce the scenario provided in #447. I tweaked the original script to rotate more quickly, since there is a strict 15 minute timeout currently for the integration tests. The source Python script is located in integration/test/cloudwatchlogs/resources/write_and_rotate_logs.py and is invoked directly in the integration test, hence why Python is a required dependency in the integration test AMIs now, with this change.

Output of integration test with `ReOpen` change reverted (set back to false)

aws_instance.integration-test (remote-exec):         2022-04-25T22:41:17Z I! [logagent] piping log from i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log) to cloudwatchlogs with retention -1
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:27Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 151.497393ms before retrying.
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:32Z W! [inputs.tail] Stopping tail as file no longer exists: /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:32Z I! [logagent] Log src has stopped for i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log)
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:33Z I! [inputs.logfile] Reading from offset 65 in /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:33Z I! [logagent] piping log from i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log) to cloudwatchlogs with retention -1
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:47Z W! [inputs.tail] Stopping tail as file no longer exists: /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:47Z I! [logagent] Log src has stopped for i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log)
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:48Z I! [inputs.logfile] Reading from offset 65 in /tmp/rotate_me.log
aws_instance.integration-test (remote-exec):         2022-04-25T22:41:48Z I! [logagent] piping log from i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated(/tmp/rotate_me.log) to cloudwatchlogs with retention -1
aws_instance.integration-test (remote-exec):         2022-04-25T22:42:02Z I! Profiler is stopped during shutdown
aws_instance.integration-test (remote-exec):         2022-04-25T22:42:02Z I! [agent] Hang on, flushing any cached metrics before shutdown

aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Checking i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated since 2022-04-25T22:41:01Z for 3 expected logs
aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Found log: {"Metric": "12345123451234512345123451234512345123451234512345"}
aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Found log: 45678901234567890123456789012345678901234567890'}
aws_instance.integration-test (remote-exec): 2022/04/25 22:42:17 Done paginating log events for i-04bc8d8d08febf163/i-04bc8d8d08febf163Rotated and found 2 logs
aws_instance.integration-test (remote-exec):     cwl_util.go:128:
aws_instance.integration-test (remote-exec):         	Error Trace:	cwl_util.go:128
aws_instance.integration-test (remote-exec):         	            				publish_logs_test.go:144
aws_instance.integration-test (remote-exec):         	Error:      	"[{"Metric": "12345123451234512345123451234512345123451234512345"} 45678901234567890123456789012345678901234567890'}]" should have 3 item(s), but has 2
aws_instance.integration-test (remote-exec):         	Test:       	TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec):     cwl_util.go:130:
aws_instance.integration-test (remote-exec):         	Error Trace:	cwl_util.go:130
aws_instance.integration-test (remote-exec):         	            				publish_logs_test.go:144
aws_instance.integration-test (remote-exec):         	Error:      	Not equal:
aws_instance.integration-test (remote-exec):         	            	expected: "{\"Metric\": \"09876098760987609876098760987609876098760987609876\"}"
aws_instance.integration-test (remote-exec):         	            	actual  : "45678901234567890123456789012345678901234567890'}"

aws_instance.integration-test (remote-exec):         	            	Diff:
aws_instance.integration-test (remote-exec):         	            	--- Expected
aws_instance.integration-test (remote-exec):         	            	+++ Actual
aws_instance.integration-test (remote-exec):         	            	@@ -1 +1 @@
aws_instance.integration-test (remote-exec):         	            	-{"Metric": "09876098760987609876098760987609876098760987609876"}
aws_instance.integration-test (remote-exec):         	            	+45678901234567890123456789012345678901234567890'}
aws_instance.integration-test (remote-exec):         	Test:       	TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec): --- FAIL: TestRotatingLogsDoesNotSkipLines (75.67s)

Note that I included some tail output from the agent log file to see what happened, and it shows the same output as the repro case. Also, you can see that the two logs (instead of expected 3) queried from CloudWatch Logs match the expected outcome of the repro case, where the first log line is fully visible, the second log line is completely missing, and the third log line is published truncated.

Output of integration test with change

https://github.com/SaxyPandaBear/amazon-cloudwatch-agent/actions/runs/2229185212
From AL2 test from this integration test run:

aws_instance.integration-test (remote-exec): === RUN   TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 Found instance id i-0ece89a3e737ca584
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 Copy File resources/config_log_rotated.json to /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 File resources/config_log_rotated.json abs path /home/ec2-user/amazon-cloudwatch-agent/integration/test/cloudwatchlogs/resources/config_log_rotated.json
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 File : resources/config_log_rotated.json copied to : /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 20:34:08 Agent has started
aws_instance.integration-test: Still creating... [2m50s elapsed]
aws_instance.integration-test (remote-exec):     publish_logs_test.go:99: Writing logs and rotating
aws_instance.integration-test: Still creating... [3m0s elapsed]
aws_instance.integration-test: Still creating... [3m10s elapsed]
aws_instance.integration-test: Still creating... [3m20s elapsed]
aws_instance.integration-test: Still creating... [3m30s elapsed]
aws_instance.integration-test: Still creating... [3m40s elapsed]
aws_instance.integration-test (remote-exec): 2022/04/26 20:35:08 Agent is stopped
aws_instance.integration-test (remote-exec): 2022/04/26 20:35:08 Checking i-0ece89a3e737ca584/i-0ece89a3e737ca584Rotated since 2022-04-26T20:34:08Z for 3 expected logs
aws_instance.integration-test (remote-exec): 2022/04/26 20:35:08 Done paginating log events for i-0ece89a3e737ca584/i-0ece89a3e737ca584Rotated and found 3 logs
aws_instance.integration-test (remote-exec): --- PASS: TestRotatingLogsDoesNotSkipLines (60.69s)
aws_instance.integration-test (remote-exec): PASS
aws_instance.integration-test (remote-exec): ok  	github.com/aws/amazon-cloudwatch-agent/integration/test/cloudwatchlogs	122.873s

#Requirements
Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make linter

This reverts commit 47d158b.

This reverts commit 324bd17.

This reverts commit f335635.

codecov-commenter · 2022-04-26T13:05:40Z

Codecov Report

Merging #455 (c732b2f) into master (11f1c75) will increase coverage by 0.34%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #455      +/-   ##
==========================================
+ Coverage   57.28%   57.62%   +0.34%     
==========================================
  Files         370      358      -12     
  Lines       17335    16455     -880     
==========================================
- Hits         9931     9483     -448     
+ Misses       6819     6426     -393     
+ Partials      585      546      -39

Impacted Files	Coverage Δ
plugins/inputs/logfile/logfile.go	`62.13% <100.00%> (-2.21%)`	⬇️
plugins/inputs/logfile/tail/tail.go	`36.34% <100.00%> (+3.26%)`	⬆️
...md/amazon-cloudwatch-agent-config-wizard/wizard.go	`56.94% <0.00%> (-11.12%)`	⬇️
plugins/inputs/logfile/tailersrc.go	`81.72% <0.00%> (-7.62%)`	⬇️
plugins/inputs/demo/demo.go	`50.00% <0.00%> (-7.15%)`	⬇️
plugins/outputs/cloudwatch/cloudwatch.go	`73.21% <0.00%> (-1.16%)`	⬇️
plugins/inputs/win_perf_counters/pdh.go
...gins/inputs/win_perf_counters/win_perf_counters.go
plugins/inputs/logfile/tmpfile_windows.go
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11f1c75...c732b2f. Read the comment docs.

integration/test/cloudwatchlogs/publish_logs_test.go

integration/test/cwl_util.go

SaxyPandaBear · 2022-04-26T13:24:50Z

--- FAIL: TestLogsFileRecreate (7.01s)
    logfile_test.go:1120: Checking for expected message 'klmnopqrst' to be consumed
    logfile_test.go:1098: Recreating log file C:\Users\RUNNER~1\AppData\Local\Temp\222378890 and writing log line 09876098760987609876
    logfile_test.go:1116: Finished writing '09876098760987609876'
    logfile_test.go:1120: Checking for expected message '09876098760987609876' to be consumed
    logfile_test.go:1098: Recreating log file C:\Users\RUNNER~1\AppData\Local\Temp\222378890 and writing log line 123456789012345678901234567890
    logfile_test.go:1[103](https://github.com/aws/amazon-cloudwatch-agent/runs/6176071509?check_suite_focus=true#step:6:103): 
        	Error Trace:	logfile_test.go:1103
        	            				asm_amd64.s:1374
        	Error:      	Received unexpected error:
        	            	remove C:\Users\RUNNER~1\AppData\Local\Temp\222378890: The process cannot access the file because it is being used by another process.
        	Test:       	TestLogsFileRecreate
    logfile_test.go:784: The log file should have been rotated by now
FAIL

Not sure why this only failed the second time that the test recreated the log file, but will dig into this. The assertion comes from checking for a nil error from:

err := os.Remove(origFile.Name())
require.NoError(t, err)

So my assumption is that Windows just doesn't want to play nicely with me

integration/test/cloudwatchlogs/resources/write_and_rotate_logs.py

integration/test/cloudwatchlogs/publish_logs_test.go

SaxyPandaBear · 2022-04-26T14:02:59Z

aws_instance.integration-test (remote-exec):     cwl_util.go:129:
aws_instance.integration-test (remote-exec):         	Error Trace:	cwl_util.go:129
aws_instance.integration-test (remote-exec):         	            				publish_logs_test.go:140
aws_instance.integration-test (remote-exec):         	Error:      	Not equal:
aws_instance.integration-test (remote-exec):         	            	expected: "{\"Metric\": \"1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\"}"
aws_instance.integration-test (remote-exec):         	            	actual  : "{'Metric': '1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'}"

aws_instance.integration-test (remote-exec):         	            	Diff:
aws_instance.integration-test (remote-exec):         	            	--- Expected
aws_instance.integration-test (remote-exec):         	            	+++ Actual
aws_instance.integration-test (remote-exec):         	            	@@ -1 +1 @@
aws_instance.integration-test (remote-exec):         	            	-{"Metric": "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"}
aws_instance.integration-test (remote-exec):         	            	+{'Metric': '1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890'}

Weird. It looks like the first two messages use \" but the last message uses '. I think I'll just force them to be uniform.

SaxyPandaBear · 2022-04-26T14:32:26Z

aws_instance.integration-test (remote-exec): === RUN   TestRotatingLogsDoesNotSkipLines
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 Found instance id i-0153f97e734d72bd4
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 Copy File resources/config_log_rotated.json to /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 File resources/config_log_rotated.json abs path /home/ec2-user/amazon-cloudwatch-agent/integration/test/cloudwatchlogs/resources/config_log_rotated.json
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 File : resources/config_log_rotated.json copied to : /opt/aws/amazon-cloudwatch-agent/bin/config.json
aws_instance.integration-test (remote-exec): 2022/04/26 14:30:49 Agent has started
aws_instance.integration-test: Still creating... [3m0s elapsed]
aws_instance.integration-test (remote-exec):     publish_logs_test.go:99: Writing logs and rotating
aws_instance.integration-test: Still creating... [3m10s elapsed]
aws_instance.integration-test: Still creating... [3m20s elapsed]
aws_instance.integration-test: Still creating... [3m30s elapsed]
aws_instance.integration-test: Still creating... [3m40s elapsed]
aws_instance.integration-test: Still creating... [3m50s elapsed]
aws_instance.integration-test (remote-exec): 2022/04/26 14:31:50 Agent is stopped
aws_instance.integration-test (remote-exec): 2022/04/26 14:31:50 Checking i-0153f97e734d72bd4/i-0153f97e734d72bd4Rotated since 2022-04-26T14:30:49Z for 3 expected logs
aws_instance.integration-test (remote-exec): 2022/04/26 14:31:50 Done paginating log events for i-0153f97e734d72bd4/i-0153f97e734d72bd4Rotated and found 3 logs
aws_instance.integration-test (remote-exec): --- PASS: TestRotatingLogsDoesNotSkipLines (61.20s)
aws_instance.integration-test (remote-exec): PASS
aws_instance.integration-test (remote-exec): ok  	github.com/aws/amazon-cloudwatch-agent/integration/test/cloudwatchlogs	124.025s

Successfully ran the integ test

SaxyPandaBear · 2022-04-26T14:33:14Z

--- FAIL: TestLogsFileRemove (0.50s)
    logfile_test.go:357: tailerSrc should have reopened after deletion
FAIL

Interesting.

integration/test/cwl_util.go

plugins/inputs/logfile/logfile_test.go

khanhntd · 2022-04-26T15:00:04Z

Not sure why ReOpen false would be the expected behavior for CWAgent if the offset would be the issue? If now ReOpen should have expected to be true, should be set this to always true instead of keeping this flag? (I am still not clear what Edward said). Besides that, if we are not accounting for the failing test currently, everything looks fine to me.

SaxyPandaBear · 2022-04-26T19:19:28Z

--- FAIL: TestOffsetDoneCallBack (10.01s)
    tailersrc_test.go:312: Not enough logs have been processed, only 30 are processed
FAIL

noting this for myself as I've seen this fail before but not sure if I made the test flakier

adam-mateen · 2022-04-27T13:53:13Z

plugins/inputs/logfile/tail/tail.go

+				numTries += 1
+				waitDuration *= 2
+				continue
+			}


100ms + 200ms + 400ms + 800ms + 1600ms = 3100ms.
I think that is fine since it is only for an edge case on the Windows OS.
Otherwise I'd suggest adding checks in any long running functions/methods to tail.Dying() or tail.Err() and return early.

It's a little tricky since Dying() and Err() are channel receivers, but if you feel strongly about it, I could rewrite it a little bit with a select, like:

select { case <-tail.Dying(): return someError case <-tail.Err(): return someError case <-time.After(waitDuration): continue }

adam-mateen · 2022-04-27T13:55:37Z

plugins/inputs/logfile/tail/tail.go

+				numTries += 1
+				waitDuration *= 2
+				continue
+			}
 			if os.IsNotExist(err) {
 				tail.Logger.Debugf("Waiting for %s to appear...", tail.Filename)
 				if err := tail.watcher.BlockUntilExists(&tail.Tomb); err != nil {


So does this mean the tail never dies now?
If the user is monitoring /tmp/foo*.log and continuously creates files like:

foo1.log foo2.log foo3.log ...

Even if the old files get deleted will the tail live on, stuck in BlockUntilExists()?

I just noticed this myself and am fixing this

Let me back up

Even if the old files get deleted will the tail live on, stuck in BlockUntilExists()

I was concerned about this myself. That was partly why I had the last set of metrics I provided in the PR description.

Let me pull debug logs from both of those hosts

What I was referring to "fixing" just now is that I messed up the retry on file permission errors.

On the host with my change, it didn't hold on to the file for tailing - though it is notable that I did set the auto_removal flag for it. The debug logs are normal:

2022-04-27T14:05:26Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:27Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:27Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:28Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:28Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:29Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:29Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:30Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:30Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:31Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:31Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:32Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:32Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:33Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:33Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:34Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:34Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:35Z D! [outputs.cloudwatch] Buffer fullness: 0 / 10000 metrics 2022-04-27T14:05:35Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics

and don't include that Waiting to reopen debug log. Let me spin up a new test where I don't set auto_removal to see

I would expect without auto_removal that this function would wait forever for the file to get recreated.
And users could see an increase in CPU and/or MEM usage due to an increase in goroutines.

Can you try testing without auto_removal?

That's what I expect too. The question is the impact. Will repurpose some hosts I used for this to see what happens when I constantly rotate and append to a new file - I already have hosts set up that rotate and reuse a file but don't have something set up to do something like foo1.log, foo2.log, ... fooN.log

What I'm doing now for validating is I'm using this script to generate 10,000 log files:

import logging import time logger = logging.getLogger('test') logger.setLevel(logging.INFO) def write_batch(logger, batch): for i in range(100): logger.info("This is the %dth batch: %d", batch, i) time.sleep(0.001) for batch in range(1, 10001): file_name = 'foo.log.'+str(batch) fh = logging.FileHandler(file_name) logger.addHandler(fh) write_batch(logger, batch) logger.removeHandler(fh) time.sleep(0.1)

I set auto_removal to false on both test hosts.

After all of the log files get written, CPU/memory spike up on both hosts, which makes sense because the agent is now tailing 10k log files, though the log files aren't being updated after the script finishes.

Now, I'm going through and deleting some of the log files, like rm foo.log.1* to delete 1000 of the log files, and monitoring the CPU/mem usage as I remove log files to monitor. Basically, I want to see how drastically they diverge in held CPU and memory when I delete log files that it tries to constantly reopen.

Will post my findings here as I get more data.

I'm not very happy with my findings. Only sharing the process CPU usage because the other metrics align pretty closely despite the change in behavior. This is something I would want to drill down on before merging this change.

Something worth noting - as I was looking at existing unit test behavior, I saw that there is a timer for cleaning up a tail process if the file is closed. This is controlled by the exitOnDeletion() function.

This is currently, with my change here, not executed.

if !config.ReOpen { go t.exitOnDeletion() }

I'm thinking that based on the existing durations for this monitoring, which maxes out at 5 minutes, I think that it is safe to remove the if check here and just always run this goroutine. I'll make that change and subsequent unit test changes, and then redo this test.

Forgot to update on this: what happens is the exitOnDeletion() never detects that the file is deleted, so this does nothing to resolve the issue. At this point, I feel like this was wasted effort and I need to pivot. I have test hosts and monitoring set up so it should be quick to get stuff set up to monitor for leaks in whatever other solution I come up with, but unfortunately for @akozd , I don't think this is a viable solution without serious degradation.

…=false

SaxyPandaBear · 2022-04-29T14:58:51Z

Closing this out for #457 which solves this issue without modifying the ReOpen flag, which is too disruptive to existing behavior for me to be comfortable to merge

Andrew Huynh and others added 21 commits April 19, 2022 14:24

increase state file cleanup frequency

47d158b

Revert "increase state file cleanup frequency"

98383b5

This reverts commit 47d158b.

set ReOpen=true

324bd17

Revert "set ReOpen=true"

01bea31

This reverts commit 324bd17.

delete state file when tail process ends

f335635

Revert "delete state file when tail process ends"

1815e42

This reverts commit f335635.

update unit tests and flip to use ReOpen

af17911

clean up recreate test code

3a2c98a

check integ test

a7abf74

add debugs to figure out empty log lines

e1d123e

debug mode for agent conf

1f8d5e8

separate log stream

bb1d617

more closely mirror repro case with py script

891d933

more debugging on integration test

452b0bc

debugging

ead25df

tweak write time

36a7516

validate integ test passes

9597693

cleanup

624a0b1

cleanup

9f2a6eb

more cleanup

47507ff

fix expected log message, disambiguate python exe

44e0cad

SaxyPandaBear requested a review from a team as a code owner April 26, 2022 13:00

SaxyPandaBear commented Apr 26, 2022

View reviewed changes

integration/test/cloudwatchlogs/publish_logs_test.go Show resolved Hide resolved

integration/test/cwl_util.go Show resolved Hide resolved

sethAmazon reviewed Apr 26, 2022

View reviewed changes

integration/test/cloudwatchlogs/resources/write_and_rotate_logs.py Show resolved Hide resolved

sethAmazon reviewed Apr 26, 2022

View reviewed changes

integration/test/cloudwatchlogs/publish_logs_test.go Outdated Show resolved Hide resolved

sethAmazon reviewed Apr 26, 2022

View reviewed changes

integration/test/cloudwatchlogs/publish_logs_test.go Show resolved Hide resolved

Andrew Huynh added 2 commits April 26, 2022 09:41

tweak sleep time

66f1a95

Merge remote-tracking branch 'upstream/master' into file-rotation

9e5f777

fix after merge

b3fd06f

sanitize log events, close log file before removing

a0f98e1

sethAmazon previously approved these changes Apr 26, 2022

View reviewed changes

test cleanup

76fed32

khanhntd reviewed Apr 26, 2022

View reviewed changes

integration/test/cwl_util.go Show resolved Hide resolved

plugins/inputs/logfile/logfile_test.go Show resolved Hide resolved

SaxyPandaBear dismissed sethAmazon’s stale review via 76fed32 April 26, 2022 15:08

Andrew Huynh added 7 commits April 26, 2022 11:33

debugging test on windows

047e819

debugging test on windows

d7023b9

tweak timer

6d24568

more debugging

736a61a

more debugging

74bcd79

add retry on permission error

3c88388

cleaned up debug logging

d4a94a9

adam-mateen reviewed Apr 27, 2022

View reviewed changes

Andrew Huynh added 2 commits April 27, 2022 12:02

update retry logic, added test, fixed other tests that rely on ReOpen…

71d10ed

…=false

fix test for windows

c732b2f

SaxyPandaBear mentioned this pull request Apr 29, 2022

Delete statefile when tailer terminates due to an error #457

Merged

SaxyPandaBear closed this Apr 29, 2022

SaxyPandaBear deleted the file-rotation branch June 18, 2022 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change behavior for handling file offsets for rotated files #455

Change behavior for handling file offsets for rotated files #455

SaxyPandaBear commented Apr 26, 2022 •

edited

Loading

codecov-commenter commented Apr 26, 2022 •

edited

Loading

SaxyPandaBear commented Apr 26, 2022

SaxyPandaBear commented Apr 26, 2022

SaxyPandaBear commented Apr 26, 2022

SaxyPandaBear commented Apr 26, 2022

khanhntd commented Apr 26, 2022 •

edited

Loading

SaxyPandaBear commented Apr 26, 2022

adam-mateen Apr 27, 2022

SaxyPandaBear Apr 27, 2022

adam-mateen Apr 27, 2022

SaxyPandaBear Apr 27, 2022

SaxyPandaBear Apr 27, 2022

SaxyPandaBear Apr 27, 2022

adam-mateen Apr 27, 2022 •

edited

Loading

SaxyPandaBear Apr 27, 2022 •

edited

Loading

SaxyPandaBear Apr 27, 2022 •

edited

Loading

SaxyPandaBear Apr 27, 2022

SaxyPandaBear Apr 27, 2022

SaxyPandaBear commented Apr 29, 2022

Change behavior for handling file offsets for rotated files #455

Change behavior for handling file offsets for rotated files #455

Conversation

SaxyPandaBear commented Apr 26, 2022 • edited Loading

Description of the issue

Description of changes

License

Tests

logging code to validate auto removal

Dashboard for metrics

Integration testing

Output of integration test with ReOpen change reverted (set back to false)

Output of integration test with change

codecov-commenter commented Apr 26, 2022 • edited Loading

Codecov Report

SaxyPandaBear commented Apr 26, 2022

SaxyPandaBear commented Apr 26, 2022

SaxyPandaBear commented Apr 26, 2022

SaxyPandaBear commented Apr 26, 2022

khanhntd commented Apr 26, 2022 • edited Loading

SaxyPandaBear commented Apr 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adam-mateen Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

SaxyPandaBear Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

SaxyPandaBear Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaxyPandaBear commented Apr 29, 2022

SaxyPandaBear commented Apr 26, 2022 •

edited

Loading

Output of integration test with `ReOpen` change reverted (set back to false)

codecov-commenter commented Apr 26, 2022 •

edited

Loading

khanhntd commented Apr 26, 2022 •

edited

Loading

adam-mateen Apr 27, 2022 •

edited

Loading

SaxyPandaBear Apr 27, 2022 •

edited

Loading

SaxyPandaBear Apr 27, 2022 •

edited

Loading