Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: wait workers to start before draining parent. #14319

Merged
merged 15 commits into from
Dec 19, 2020

Conversation

caitong93
Copy link

@caitong93 caitong93 commented Dec 8, 2020

Signed-off-by: Tong Cai [email protected]

Commit Message: server: wait workers to start before draining parent.

Additional Description:

  • Add test.
  • The initialization process takes longer. What's the impact ?
    Manual test pass. Inject 5s delay in socket() call, keep sending traffic to Envoy during hot restarts. Everything seems good.

Risk Level: medium
Testing:
Docs Changes:
Release Notes:
[Optional Fixes #Issue]

Fixes #14295

@caitong93
Copy link
Author

Trying to fix. It seems more complex than I thought, and i will spent sometime to figure out the test logic of hot restart.

@mattklein123 mattklein123 self-assigned this Dec 8, 2020
@mattklein123
Copy link
Member

At a high level this looks correct, so let me know if you have any questions or want me to do a further review.

/wait

Signed-off-by: Tong Cai <[email protected]>
@caitong93
Copy link
Author

@mattklein123

Basically ready for review.

@caitong93 caitong93 marked this pull request as ready for review December 10, 2020 14:35
@caitong93
Copy link
Author

caitong93 commented Dec 10, 2020

The new stage WorkerStarted is added for testing. I'm still trying to add code to check restarter_.drainParentListeners will not be called before workers started.(there is a deadlock problem to solve)

Signed-off-by: Tong Cai <[email protected]>
@caitong93
Copy link
Author

caitong93 commented Dec 11, 2020

Updated. Tests pass.

Signed-off-by: Tong Cai <[email protected]>
server_ = nullptr;
thread_local_ = nullptr;
});

started.WaitForNotification();
EXPECT_TRUE(startup);
EXPECT_FALSE(shutdown);
EXPECT_TRUE(TestUtility::findGauge(stats_store_, "server.state")->used());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove server.state check here because it's non deterministic.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this looks mostly correct but I'm confused about why some of the changes were made. Thank you!

/wait

Comment on lines 377 to 378
// startup_ is true means Startup notifications have been called.
bool startup_{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this up into the variables section. Also startup_lifecycle_event_raised_ or something like that?

Comment on lines 720 to 723
if (!startup_) {
notifyCallbacksForStage(Stage::Startup);
startup_ = true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain these changes? It's not clear to my why they were made. Please add more comments both here and below if they are necessary.

Copy link
Author

@caitong93 caitong93 Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures Startup notifications to be sent first. Otherwise at LifecycleNotifications test , the notification order will be PostInit, WorkerStarted, Startup(because in static configuration, post_init_cb will be called immediately , before main thread dispatcher start), and deadlock will happen.(because we block callback at WorkerStarted stage)

/**
* All workers have started.
*/
WorkerStarted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used? If this is needed WorkersStarted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My preference would be to not make prod changes like this just for tests. If you need synchronization hooks can you use https://github.com/envoyproxy/envoy/blob/master/source/server/listener_hooks.h instead? Thank you.

/wait

@@ -296,6 +296,7 @@ class ListenerManagerImplTest : public testing::Test {
std::unique_ptr<Network::MockConnectionSocket> socket_;
uint64_t listener_tag_{1};
bool enable_dispatcher_stats_{false};
std::function<void()> callback_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a mock and actually verify correct calls.

Signed-off-by: Tong Cai <[email protected]>
@mattklein123
Copy link
Member

Please merge main and check format.

/wait

Signed-off-by: Tong Cai <[email protected]>
@caitong93
Copy link
Author

@mattklein123 Updated, PTAL.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this LGTM. Just a few test questions.

/wait-any

@@ -423,6 +440,37 @@ TEST_P(ServerInstanceImplTest, LifecycleNotifications) {
server_thread->join();
}

TEST_P(ServerInstanceImplTest, DrainParentListenerAfterWorkersStarted) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run this with --runs_per_test=1000 to make sure it doesn't flake?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

 Target //test/server:server_test up-to-date:
  bazel-bin/test/server/server_test
INFO: Elapsed time: 4724.860s, Critical Path: 22.74s
INFO: 1001 processes: 1 internal, 1000 processwrapper-sandbox.
INFO: Build completed successfully, 1001 total actions
//test/server:server_test                                                PASSED in 21.8s
  Stats over 1000 runs: max = 21.8s, min = 18.5s, avg = 18.8s, dev = 0.3s

Comment on lines +614 to +616
if (isShutdown()) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have test coverage of this case? (Put an ASSERT in there and see if it's hit or look at coverage report)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for this in the new commit.

@mattklein123
Copy link
Member

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14319 (comment) was created by @mattklein123.

see: more, trace.

@mattklein123
Copy link
Member

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14319 (comment) was created by @mattklein123.

see: more, trace.

@caitong93
Copy link
Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14319 (comment) was created by @caitong93.

see: more, trace.

@caitong93
Copy link
Author

Seems not like a network problem, will check what cause CI to fail when I got time.

@mattklein123
Copy link
Member

It's an unrelated OSX issue. Will just merge.

@mattklein123 mattklein123 merged commit 867b9e2 into envoyproxy:master Dec 19, 2020
@caitong93
Copy link
Author

Thanks for the review!

mpuncel added a commit to mpuncel/envoy that referenced this pull request Dec 22, 2020
* master: (30 commits)
  Deflaked: Guarddog_impl_test (envoyproxy#14475)
  [fuzz] add fuzz tests for hpack encoding and decoding (envoyproxy#13315)
  [filters] Prevent a filter from sending local reply and continue (envoyproxy#14416)
  oauth2: improving coverage (envoyproxy#14479)
  owners: Change dio email address (envoyproxy#14498)
  macos build: Fix ninja install (envoyproxy#14495)
  http: use OptRef helper to reduce some boilerplate (envoyproxy#14361)
  doc: update test/integration/README.md (envoyproxy#14485)
  server: wait workers to start before draining parent. (envoyproxy#14319)
  api: relax inline_string length limitation in DataSource (envoyproxy#14461)
  oauth: properly stop filter chain when a response was sent (envoyproxy#14476)
  listener: deprecate use_proxy_proto (envoyproxy#14406)
  deps: update cel and remove a patch (envoyproxy#14473)
  preconnect: rename: (envoyproxy#14474)
  coverage: ratcheting limits (envoyproxy#14472)
  grpc mux: fix sending node again after stream is reset (envoyproxy#14080)
  [test] Replace printers_include with printers_lib. (envoyproxy#14442)
  tcp: nodelay in the new pool (envoyproxy#14453)
  test: replace mock_methodn macros with mock_method (envoyproxy#14450)
  tcp: extending tcp integration test (envoyproxy#14451)
  ...

Signed-off-by: Michael Puncel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hot restart: child should wait sockets to be created when reuse_port is enabled
2 participants