Consider testing enhancements for `filelog` receiver #32001

ChrsMark · 2024-03-27T09:40:40Z

Component(s)

receiver/filelog

Describe the issue you're reporting

As of today the OTel Collector can handle log collection without loss or duplication after restarts.
This is possible by using the storage extension to store the receiver's state and using a persistent-queue to persist the exporter's state.
This is documented at the offset tracking section and the fault-tolerant-logs-collection example.

However there are cases where it has been reported data loss like at #31074, in which is not quite straightforward from where the issue comes from. This also indicates the lack of e2e testing for this important use case.

It is already mentioned the need for e2e tests at #20552 however since that issue is more generic I would like to propose a more specific scope here to cover use cases like #31074.

In this regard, we should consider implementing e2e tests for at least the following scenarios:

A) Basic restart scenarios (offset tracking)

The collector must track file offsets, even after a restart and continue where it left off without any data loss or data duplication
The collector must ship logs (only once) when the backend is temporarily unavailable even if Collector restarts occur during the "backend outage"

B) Nice to have(s):

We should be able to update/downgrade/re-install without losing the offset data
We should correctly track offsets even in the case of a hard crash

C) Changes on target files

We should be able to rotate files with move-create strategy without losing or duplicating data
We should be able to rotate files with copy-truncate strategy without losing or duplicating data
We should be able to read a file to the end even after it's deleted
We should be able to keep reading from a file even after it's moved (e.g: foo.log -> foo.log.1) and there should not be duplicate entries from the moved file (which could be caused due to erroneously re-reading it from the beginning)

cc: @djaglowski @ycombinator

github-actions · 2024-03-27T09:40:53Z

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2024-03-27T19:41:42Z

In this regard, we should consider implementing e2e tests for at least the following scenarios:

I agree in principle that we should have robust coverage for almost all these scenarios, but it's not clear to me whether we need these to be more than unit tests at the receiver level. For the most part, I'd like to try to find ways to localize the tests as much as possible without compromising the value they would provide. For example, I'd rather we develop a mock consumer.Logs which can start/stop/error at a unit test level.

The collector must track file offsets, even after a restart and continue where it left off without any data loss or data duplication

This is relatively well tested in unit tests. What do we gain by making this an e2e test?

The collector must ship logs (only once) when the backend is temporarily unavailable even if Collector restarts occur during the "backend outage"

We have decent coverage for non-duplication after restarting in unit tests. However, we don't have coverage which also includes backpressure from downstream consumers. This seems like a good candidate for a receiver unit test, given a mock consumer which applies backpressure.

We should be able to update/downgrade/re-install without losing the offset data

How would you test this in this repo?

We should correctly track offsets even in the case of a hard crash

This likely does require an e2e test. Do you have ideas how this would be implemented in this repo?

We should be able to rotate files with move-create strategy without losing or duplicating data

We should be able to rotate files with copy-truncate strategy without losing or duplicating data

We should be able to keep reading from a file even after it's moved and there should not be duplicate entries from the moved file (which could be caused due to erroneously re-reading it from the beginning)

These are all relatively well tested in unit tests. What do we gain by making this an e2e test?

We should be able to read a file to the end even after it's deleted

I don't think this is tested, but could be a unit test similar to the above.

ChrsMark · 2024-03-28T14:07:55Z

Thank's @djaglowski!

It's true that B1 and B2 would require some extra effort that's why I have put them as "nice-to-have"s. We can think of those in the future.

For A1 do you mean the testbed or do we have another unit-test? Does this also test the correctness or just the performance part?

For A2, agreed.

For C1, C2, C3(once added) and C4 I wonder if we should also combine the cases with a restart so as to include the persistance part as well. If that can be done with a unit-test that would be also fine. Would that make sense or a restart does not really affect these cases?

This test-case covers C1 and this covers C2, right? I guess we can consider that C4 is covered by C1, right?

djaglowski · 2024-03-28T19:28:18Z

For A1 do you mean the testbed or do we have another unit-test? Does this also test the correctness or just the performance part?

We have this unit test which tests correctness.

For C1, C2, C3(once added) and C4 I wonder if we should also combine the cases with a restart so as to include the persistance part as well. If that can be done with a unit-test that would be also fine. Would that make sense or a restart does not really affect these cases?

It would likely be good to involve restart into these tests. Ideally if there are some exact scenarios which could be problematic, we would cover them precisely in unit tests. However, rotation often comes down to timing and this can be difficult to test. I'd be curious to see what you can come up with though.

This test-case covers C1 and this covers C2, right?

I believe so

I guess we can consider that C4 is covered by C1, right?

This test is aimed at C4

ChrsMark · 2024-03-29T10:07:08Z

The collector must ship logs (only once) when the backend is temporarily unavailable even if Collector restarts occur during the "backend outage"

We have decent coverage for non-duplication after restarting in unit tests. However, we don't have coverage which also includes backpressure from downstream consumers. This seems like a good candidate for a receiver unit test, given a mock consumer which applies backpressure.

Thinking of A2 again and looking into the code (correct me if I miss anything there) I wonder if that's actually a useful unit test for the filelog receiver: From what I can see the Reader will try to send the log token to the downstream by calling the emitFunc/processFunc and then it will only update the offset in case of success. Then it's downstream's responsibility to take care of it, so if the backend is unreachable then it's exporter's responsibility to persist the data properly using a persistent-queue for example. I see this is covered by unit-tests which is great.
That's actually the case that I had in mind with e2e testing scenario but isolated unit-testing can also cover for now I think.

So back to your proposal for the backpressure unit test, do you mean that we could have a similar to the TestRestartOffsets with an extra delay (==backpressure) in the sink.Callback just to ensure the Reader's part?

djaglowski · 2024-03-29T12:50:06Z

From what I can see the Reader will try to send the log token to the downstream by calling the emitFunc/processFunc and then it will only update the offset in case of success. Then it's downstream's responsibility to take care of it

I think what may be missing is in our coverage is that when the reader emits a token it is processed within pkg/stanza/adapter before being emitted from the receiver. We have pretty solid test coverage in pkg/stanza/fileconsumer, but I think A2 would test the combination of fileconsumer -> adapter. This kind of test may have caught #31074 for example.

crobert-1 · 2024-03-29T15:52:53Z

It looks like there's been some good discussion here, and a general direction is agreed upon. Removing needs triage.

ChrsMark · 2024-04-02T07:38:41Z

From what I can see the Reader will try to send the log token to the downstream by calling the emitFunc/processFunc and then it will only update the offset in case of success. Then it's downstream's responsibility to take care of it

I think what may be missing is in our coverage is that when the reader emits a token it is processed within pkg/stanza/adapter before being emitted from the receiver. We have pretty solid test coverage in pkg/stanza/fileconsumer, but I think A2 would test the combination of fileconsumer -> adapter. This kind of test may have caught #31074 for example.

Cool, I think something like the TestShutdownFlush from swiatekm@9747d92, which tried to prove #31074 is really close to what we need. Along with solving #31074 we should include a test like this as well.

github-actions · 2024-06-03T03:31:00Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

ChrsMark · 2024-06-03T06:58:48Z

/label -Stale

ycombinator · 2024-07-11T12:14:13Z

Reading the discussion, it seems like the pending work at this point is #31074. @ChrsMark @djaglowski can we close this issue here then?

ChrsMark · 2024-07-11T12:36:00Z

From my perspective I'm fine with closing this. Most of the cases were already covered and some extra ones were added as well. Only the issue described in #31074 seems to not be covered.

It can be re-opened if there is a tangible need for the B) Nice to have(s) part.

ChrsMark added the needs triage New item requiring triage label Mar 27, 2024

github-actions bot added the receiver/filelog label Mar 27, 2024

ChrsMark changed the title ~~Introduce e2e CI testing for filelog receiver's offset tracking~~ Consider testing enhancements for filelog receiver Mar 29, 2024

crobert-1 added test coverage Improve test coverage and removed needs triage New item requiring triage labels Mar 29, 2024

github-actions bot mentioned this issue Apr 2, 2024

Weekly Report: 2024-03-26 - 2024-04-02 #32082

Closed

github-actions bot added the Stale label Jun 3, 2024

djaglowski removed the Stale label Jun 3, 2024

djaglowski closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider testing enhancements for `filelog` receiver #32001

Consider testing enhancements for `filelog` receiver #32001

ChrsMark commented Mar 27, 2024 •

edited

Loading

github-actions bot commented Mar 27, 2024

djaglowski commented Mar 27, 2024

ChrsMark commented Mar 28, 2024

djaglowski commented Mar 28, 2024 •

edited

Loading

ChrsMark commented Mar 29, 2024 •

edited

Loading

djaglowski commented Mar 29, 2024

crobert-1 commented Mar 29, 2024

ChrsMark commented Apr 2, 2024

github-actions bot commented Jun 3, 2024

ChrsMark commented Jun 3, 2024

ycombinator commented Jul 11, 2024

ChrsMark commented Jul 11, 2024

Consider testing enhancements for filelog receiver #32001

Consider testing enhancements for filelog receiver #32001

Comments

ChrsMark commented Mar 27, 2024 • edited Loading

Component(s)

Describe the issue you're reporting

A) Basic restart scenarios (offset tracking)

B) Nice to have(s):

C) Changes on target files

github-actions bot commented Mar 27, 2024

djaglowski commented Mar 27, 2024

ChrsMark commented Mar 28, 2024

djaglowski commented Mar 28, 2024 • edited Loading

ChrsMark commented Mar 29, 2024 • edited Loading

djaglowski commented Mar 29, 2024

crobert-1 commented Mar 29, 2024

ChrsMark commented Apr 2, 2024

github-actions bot commented Jun 3, 2024

ChrsMark commented Jun 3, 2024

ycombinator commented Jul 11, 2024

ChrsMark commented Jul 11, 2024

Consider testing enhancements for `filelog` receiver #32001

Consider testing enhancements for `filelog` receiver #32001

ChrsMark commented Mar 27, 2024 •

edited

Loading

djaglowski commented Mar 28, 2024 •

edited

Loading

ChrsMark commented Mar 29, 2024 •

edited

Loading