Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider testing enhancements for filelog receiver #32001

Closed
ChrsMark opened this issue Mar 27, 2024 · 12 comments
Closed

Consider testing enhancements for filelog receiver #32001

ChrsMark opened this issue Mar 27, 2024 · 12 comments
Labels

Comments

@ChrsMark
Copy link
Member

ChrsMark commented Mar 27, 2024

Component(s)

receiver/filelog

Describe the issue you're reporting

As of today the OTel Collector can handle log collection without loss or duplication after restarts.
This is possible by using the storage extension to store the receiver's state and using a persistent-queue to persist the exporter's state.
This is documented at the offset tracking section and the fault-tolerant-logs-collection example.

However there are cases where it has been reported data loss like at #31074, in which is not quite straightforward from where the issue comes from. This also indicates the lack of e2e testing for this important use case.

It is already mentioned the need for e2e tests at #20552 however since that issue is more generic I would like to propose a more specific scope here to cover use cases like #31074.

In this regard, we should consider implementing e2e tests for at least the following scenarios:

A) Basic restart scenarios (offset tracking)

  1. The collector must track file offsets, even after a restart and continue where it left off without any data loss or data duplication
  2. The collector must ship logs (only once) when the backend is temporarily unavailable even if Collector restarts occur during the "backend outage"

B) Nice to have(s):

  1. We should be able to update/downgrade/re-install without losing the offset data
  2. We should correctly track offsets even in the case of a hard crash

C) Changes on target files

  1. We should be able to rotate files with move-create strategy without losing or duplicating data
  2. We should be able to rotate files with copy-truncate strategy without losing or duplicating data
  3. We should be able to read a file to the end even after it's deleted
  4. We should be able to keep reading from a file even after it's moved (e.g: foo.log -> foo.log.1) and there should not be duplicate entries from the moved file (which could be caused due to erroneously re-reading it from the beginning)

cc: @djaglowski @ycombinator

@ChrsMark ChrsMark added the needs triage New item requiring triage label Mar 27, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

In this regard, we should consider implementing e2e tests for at least the following scenarios:

I agree in principle that we should have robust coverage for almost all these scenarios, but it's not clear to me whether we need these to be more than unit tests at the receiver level. For the most part, I'd like to try to find ways to localize the tests as much as possible without compromising the value they would provide. For example, I'd rather we develop a mock consumer.Logs which can start/stop/error at a unit test level.

  1. The collector must track file offsets, even after a restart and continue where it left off without any data loss or data duplication

This is relatively well tested in unit tests. What do we gain by making this an e2e test?

  1. The collector must ship logs (only once) when the backend is temporarily unavailable even if Collector restarts occur during the "backend outage"

We have decent coverage for non-duplication after restarting in unit tests. However, we don't have coverage which also includes backpressure from downstream consumers. This seems like a good candidate for a receiver unit test, given a mock consumer which applies backpressure.

  1. We should be able to update/downgrade/re-install without losing the offset data

How would you test this in this repo?

  1. We should correctly track offsets even in the case of a hard crash

This likely does require an e2e test. Do you have ideas how this would be implemented in this repo?

  1. We should be able to rotate files with move-create strategy without losing or duplicating data
  2. We should be able to rotate files with copy-truncate strategy without losing or duplicating data
  3. We should be able to keep reading from a file even after it's moved and there should not be duplicate entries from the moved file (which could be caused due to erroneously re-reading it from the beginning)

These are all relatively well tested in unit tests. What do we gain by making this an e2e test?

  1. We should be able to read a file to the end even after it's deleted

I don't think this is tested, but could be a unit test similar to the above.

@ChrsMark
Copy link
Member Author

Thank's @djaglowski!

It's true that B1 and B2 would require some extra effort that's why I have put them as "nice-to-have"s. We can think of those in the future.

For A1 do you mean the testbed or do we have another unit-test? Does this also test the correctness or just the performance part?

For A2, agreed.

For C1, C2, C3(once added) and C4 I wonder if we should also combine the cases with a restart so as to include the persistance part as well. If that can be done with a unit-test that would be also fine. Would that make sense or a restart does not really affect these cases?

This test-case covers C1 and this covers C2, right? I guess we can consider that C4 is covered by C1, right?

@djaglowski
Copy link
Member

djaglowski commented Mar 28, 2024

For A1 do you mean the testbed or do we have another unit-test? Does this also test the correctness or just the performance part?

We have this unit test which tests correctness.

For C1, C2, C3(once added) and C4 I wonder if we should also combine the cases with a restart so as to include the persistance part as well. If that can be done with a unit-test that would be also fine. Would that make sense or a restart does not really affect these cases?

It would likely be good to involve restart into these tests. Ideally if there are some exact scenarios which could be problematic, we would cover them precisely in unit tests. However, rotation often comes down to timing and this can be difficult to test. I'd be curious to see what you can come up with though.

This test-case covers C1 and this covers C2, right?

I believe so

I guess we can consider that C4 is covered by C1, right?

This test is aimed at C4

@ChrsMark ChrsMark changed the title Introduce e2e CI testing for filelog receiver's offset tracking Consider testing enhancements for filelog receiver Mar 29, 2024
@ChrsMark
Copy link
Member Author

ChrsMark commented Mar 29, 2024

The collector must ship logs (only once) when the backend is temporarily unavailable even if Collector restarts occur during the "backend outage"

We have decent coverage for non-duplication after restarting in unit tests. However, we don't have coverage which also includes backpressure from downstream consumers. This seems like a good candidate for a receiver unit test, given a mock consumer which applies backpressure.

Thinking of A2 again and looking into the code (correct me if I miss anything there) I wonder if that's actually a useful unit test for the filelog receiver: From what I can see the Reader will try to send the log token to the downstream by calling the emitFunc/processFunc and then it will only update the offset in case of success. Then it's downstream's responsibility to take care of it, so if the backend is unreachable then it's exporter's responsibility to persist the data properly using a persistent-queue for example. I see this is covered by unit-tests which is great.
That's actually the case that I had in mind with e2e testing scenario but isolated unit-testing can also cover for now I think.

So back to your proposal for the backpressure unit test, do you mean that we could have a similar to the TestRestartOffsets with an extra delay (==backpressure) in the sink.Callback just to ensure the Reader's part?

@djaglowski
Copy link
Member

From what I can see the Reader will try to send the log token to the downstream by calling the emitFunc/processFunc and then it will only update the offset in case of success. Then it's downstream's responsibility to take care of it

I think what may be missing is in our coverage is that when the reader emits a token it is processed within pkg/stanza/adapter before being emitted from the receiver. We have pretty solid test coverage in pkg/stanza/fileconsumer, but I think A2 would test the combination of fileconsumer -> adapter. This kind of test may have caught #31074 for example.

@crobert-1
Copy link
Member

It looks like there's been some good discussion here, and a general direction is agreed upon. Removing needs triage.

@crobert-1 crobert-1 added test coverage Improve test coverage and removed needs triage New item requiring triage labels Mar 29, 2024
@ChrsMark
Copy link
Member Author

ChrsMark commented Apr 2, 2024

From what I can see the Reader will try to send the log token to the downstream by calling the emitFunc/processFunc and then it will only update the offset in case of success. Then it's downstream's responsibility to take care of it

I think what may be missing is in our coverage is that when the reader emits a token it is processed within pkg/stanza/adapter before being emitted from the receiver. We have pretty solid test coverage in pkg/stanza/fileconsumer, but I think A2 would test the combination of fileconsumer -> adapter. This kind of test may have caught #31074 for example.

Cool, I think something like the TestShutdownFlush from swiatekm@9747d92, which tried to prove #31074 is really close to what we need. Along with solving #31074 we should include a test like this as well.

Copy link
Contributor

github-actions bot commented Jun 3, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jun 3, 2024
@ChrsMark
Copy link
Member Author

ChrsMark commented Jun 3, 2024

/label -Stale

@djaglowski djaglowski removed the Stale label Jun 3, 2024
@ycombinator
Copy link
Contributor

Reading the discussion, it seems like the pending work at this point is #31074. @ChrsMark @djaglowski can we close this issue here then?

@ChrsMark
Copy link
Member Author

From my perspective I'm fine with closing this. Most of the cases were already covered and some extra ones were added as well. Only the issue described in #31074 seems to not be covered.

It can be re-opened if there is a tangible need for the B) Nice to have(s) part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants