Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: logging deadlock #2346

Merged
merged 2 commits into from
Apr 4, 2024
Merged

Conversation

stevenh
Copy link
Collaborator

@stevenh stevenh commented Mar 12, 2024

What does this PR do?

Fix logging deadlock, causing lots of test timeouts.

This refactors how logging shutdown is handled, eliminating unnecessary captures, use idiomatic wait group to signal processor completion and remove unnecessary nil initialisation.

Fix race condition in log testing which was reading Msg while the processor was still running.

Switch to checking GITHUB_RUN_ID environment variable to detect GitHub as XDG_RUNTIME_DIR can be present in other situations.

@mdelapenya edit:

Capture all errors in the lifecycle hooks, refactoring the code in a single apply function that will be used in all the lifecycle hook methods.

Why is it important?

Tests and runtime is currently flakey due to race condition during start and stop of logging.

How to test this PR

Testing is hard as it's fixing a random race condition, but tests should now be more reliable.

Copy link

netlify bot commented Mar 12, 2024

Deploy Preview for testcontainers-go ready!

Name Link
🔨 Latest commit 35cb59d
🔍 Latest deploy log https://app.netlify.com/sites/testcontainers-go/deploys/660da0524323120008414c0e
😎 Deploy Preview https://deploy-preview-2346--testcontainers-go.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@stevenh
Copy link
Collaborator Author

stevenh commented Mar 13, 2024

Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?

@stevenh stevenh marked this pull request as ready for review March 13, 2024 09:14
@stevenh stevenh requested a review from a team as a code owner March 13, 2024 09:14
@stevenh stevenh force-pushed the fix/logging-deadlock branch from b2cc7d7 to bdd8886 Compare March 13, 2024 09:23
@stevenh stevenh marked this pull request as draft March 13, 2024 09:25
@stevenh stevenh marked this pull request as ready for review March 13, 2024 17:20
@mdelapenya mdelapenya self-assigned this Mar 25, 2024
@mdelapenya mdelapenya added the bug An issue with the library label Mar 25, 2024
@mdelapenya
Copy link
Member

Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?

I've taken a deep look into this error, and as you pointed out, it seems consistent for the reaper-off mode: so there are containers that are not manually cleaned by the tests. And the test that leaks that container is building it from a Dockerfile. So it makes me think the culprit is Test_MultiContainerLogConsumer_CancelledContext, not killing the first container that is created.

Could you please take a look? 🙏

@bearrito
Copy link
Contributor

bearrito commented Mar 25, 2024

@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.

The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.

Here is the flow

  1. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
  2. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was context cancelled
  3. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
  4. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.

@stevenh
Copy link
Collaborator Author

stevenh commented Apr 2, 2024

@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.

The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.

Here is the flow

  1. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
  2. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was context cancelled
  3. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
  4. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.

That seems to be an analysis on the old code, that shouldn't be possible with the new version?

@bearrito
Copy link
Contributor

bearrito commented Apr 2, 2024

@stevenh Probably not, I was more or less replying to #2346 (comment).

I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.

If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?

@stevenh
Copy link
Collaborator Author

stevenh commented Apr 2, 2024

The Test with reaper off looks like a secondary issue.

What's happening is the test cancels the main context causing the logger to stop. This returns context cancelled error which is subsequently returned by stopLogProduction in terminatingHook during Test_MultiContainerLogConsumer_CancelledContext deferred call to c.Terminate and c2.Terminate which prevents Terminate from cleaning up properly as it returns early.

There's a few ways to fix:

  1. Don't return context cancelled error from stopLogProduction.
  2. Don't return context cancelled error from defaultLogConsumersHook.PreTerminates.
  3. Ignore context cancelled errors from hooks in Terminate.
  4. Perform best effort clean up in Terminate and return a multi error if needed.

It feels to me like 4. would be the best solution as it puts the caller in control of the behaviour.

A tertiary issue then becomes should all lifecycleHooks iterators e.g. terminatedHook and hence containerHookFn also be best effort?

Thoughts?

@stevenh
Copy link
Collaborator Author

stevenh commented Apr 2, 2024

@stevenh Probably not, I was more or less replying to #2346 (comment).

I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.

best-effort.patch

If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?

If you have a reproduction and want to try with this branch + the attached patch which makes lifecycles and Terminate best effort that would be helpful to confirm it fixes your case too.

@mdelapenya
Copy link
Member

I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?

@stevenh
Copy link
Collaborator Author

stevenh commented Apr 3, 2024

I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?

Yes that makes sense to me, I have a version of this from last night I'll tweak and put up see what folks think.

stevenh added 2 commits April 3, 2024 19:30
Fix logging deadlock, causing lots of test timeouts.

This refactors how logging shutdown is handled, eliminating unnecessary
captures, use idiomatic wait group to signal processor completion and
remove unnecessary nil initialisation.

Fix race condition in log testing which was reading Msg while the
processor was still running.

Switch to checking GITHUB_RUN_ID environment variable to detect GitHub
as XDG_RUNTIME_DIR can be present in other situations.
Fix Docker container Terminate returning early when context is
cancelled, which was leaving orphaned running containers.

This ensures that all life cycle hooks are run even if one errors
returning a multi error if needed.
@stevenh stevenh force-pushed the fix/logging-deadlock branch from f5edc6b to 35cb59d Compare April 3, 2024 18:30
Copy link
Member

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for dedicating time to resolve concurrency issues, that are always hard to detect and fix, even more with legacy code, so I appreciate your efforts here 🙇

}

// applyLifecycleHooks applies all lifecycle hooks reporting the container logs on error if logError is true.
func (c *DockerContainer) applyLifecycleHooks(ctx context.Context, logError bool, hooks func(lifecycleHooks ContainerLifecycleHooks) []ContainerHook) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@mdelapenya mdelapenya merged commit e8b8e55 into testcontainers:main Apr 4, 2024
102 checks passed
@stevenh stevenh deleted the fix/logging-deadlock branch April 4, 2024 09:30
mdelapenya added a commit to coffeegoddd/testcontainers-go that referenced this pull request Apr 12, 2024
* main: (115 commits)
  chore: create TLS certs in a consistent manner (testcontainers#2478)
  chore(deps): bump idna from 3.6 to 3.7 (testcontainers#2480)
  Elasticsearch disable CA retrieval when ssl is disabled (testcontainers#2475)
  fix: handle dockerignore exclusions properly (testcontainers#2476)
  chore: prepare for next minor development cycle (0.31.0)
  chore: use new version (v0.30.0) in modules and examples
  Fix url creation to handle query params when using HTTP wait strategy (testcontainers#2466)
  fix: data race on container run (testcontainers#2345)
  fix: logging deadlock (testcontainers#2346)
  feat(k6):Add remote test scripts (testcontainers#2350)
  feat: optimizes file copies to and from containers (testcontainers#2450)
  fix(exec): updates the `Multiplexed` opt to combine stdout and stderr (testcontainers#2452)
  Upgrade neo4j module to use features from v0.29.1 of testcontainers-go (testcontainers#2463)
  bug:Fix AMQPS url (testcontainers#2462)
  chore: more compose updates in comments
  chore: use "docker compose" (v2) instead of "docker-compose" (v1) (testcontainers#2464)
  chore(deps): bump github/codeql-action from 2.22.12 to 3.24.9 (testcontainers#2459)
  refactor: Add Weaviate modules tests (testcontainers#2447)
  feat(exitcode): Add exit code sugar method (testcontainers#2342)
  feat: add module to support InfluxDB v1.x (testcontainers#1703)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants