-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: wait workers to start before draining parent. #14319
Conversation
Signed-off-by: Tong Cai <[email protected]>
Trying to fix. It seems more complex than I thought, and i will spent sometime to figure out the test logic of hot restart. |
At a high level this looks correct, so let me know if you have any questions or want me to do a further review. /wait |
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Basically ready for review. |
The new stage |
Signed-off-by: Tong Cai <[email protected]>
Updated. Tests pass. |
Signed-off-by: Tong Cai <[email protected]>
test/server/server_test.cc
Outdated
server_ = nullptr; | ||
thread_local_ = nullptr; | ||
}); | ||
|
||
started.WaitForNotification(); | ||
EXPECT_TRUE(startup); | ||
EXPECT_FALSE(shutdown); | ||
EXPECT_TRUE(TestUtility::findGauge(stats_store_, "server.state")->used()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove server.state
check here because it's non deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this looks mostly correct but I'm confused about why some of the changes were made. Thank you!
/wait
source/server/server.h
Outdated
// startup_ is true means Startup notifications have been called. | ||
bool startup_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this up into the variables section. Also startup_lifecycle_event_raised_
or something like that?
source/server/server.cc
Outdated
if (!startup_) { | ||
notifyCallbacksForStage(Stage::Startup); | ||
startup_ = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain these changes? It's not clear to my why they were made. Please add more comments both here and below if they are necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures Startup notifications to be sent first. Otherwise at LifecycleNotifications test , the notification order will be PostInit
, WorkerStarted
, Startup
(because in static configuration, post_init_cb
will be called immediately , before main thread dispatcher start), and deadlock will happen.(because we block callback at WorkerStarted stage)
/** | ||
* All workers have started. | ||
*/ | ||
WorkerStarted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used? If this is needed WorkersStarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. My preference would be to not make prod changes like this just for tests. If you need synchronization hooks can you use https://github.com/envoyproxy/envoy/blob/master/source/server/listener_hooks.h instead? Thank you.
/wait
@@ -296,6 +296,7 @@ class ListenerManagerImplTest : public testing::Test { | |||
std::unique_ptr<Network::MockConnectionSocket> socket_; | |||
uint64_t listener_tag_{1}; | |||
bool enable_dispatcher_stats_{false}; | |||
std::function<void()> callback_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a mock and actually verify correct calls.
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Please merge main and check format. /wait |
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
Signed-off-by: Tong Cai <[email protected]>
@mattklein123 Updated, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this LGTM. Just a few test questions.
/wait-any
@@ -423,6 +440,37 @@ TEST_P(ServerInstanceImplTest, LifecycleNotifications) { | |||
server_thread->join(); | |||
} | |||
|
|||
TEST_P(ServerInstanceImplTest, DrainParentListenerAfterWorkersStarted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run this with --runs_per_test=1000
to make sure it doesn't flake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Target //test/server:server_test up-to-date:
bazel-bin/test/server/server_test
INFO: Elapsed time: 4724.860s, Critical Path: 22.74s
INFO: 1001 processes: 1 internal, 1000 processwrapper-sandbox.
INFO: Build completed successfully, 1001 total actions
//test/server:server_test PASSED in 21.8s
Stats over 1000 runs: max = 21.8s, min = 18.5s, avg = 18.8s, dev = 0.3s
if (isShutdown()) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have test coverage of this case? (Put an ASSERT in there and see if it's hit or look at coverage report)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test for this in the new commit.
Signed-off-by: Tong Cai <[email protected]>
/retest |
Retrying Azure Pipelines: |
/retest |
Retrying Azure Pipelines: |
/retest |
Retrying Azure Pipelines: |
Seems not like a network problem, will check what cause CI to fail when I got time. |
It's an unrelated OSX issue. Will just merge. |
Thanks for the review! |
* master: (30 commits) Deflaked: Guarddog_impl_test (envoyproxy#14475) [fuzz] add fuzz tests for hpack encoding and decoding (envoyproxy#13315) [filters] Prevent a filter from sending local reply and continue (envoyproxy#14416) oauth2: improving coverage (envoyproxy#14479) owners: Change dio email address (envoyproxy#14498) macos build: Fix ninja install (envoyproxy#14495) http: use OptRef helper to reduce some boilerplate (envoyproxy#14361) doc: update test/integration/README.md (envoyproxy#14485) server: wait workers to start before draining parent. (envoyproxy#14319) api: relax inline_string length limitation in DataSource (envoyproxy#14461) oauth: properly stop filter chain when a response was sent (envoyproxy#14476) listener: deprecate use_proxy_proto (envoyproxy#14406) deps: update cel and remove a patch (envoyproxy#14473) preconnect: rename: (envoyproxy#14474) coverage: ratcheting limits (envoyproxy#14472) grpc mux: fix sending node again after stream is reset (envoyproxy#14080) [test] Replace printers_include with printers_lib. (envoyproxy#14442) tcp: nodelay in the new pool (envoyproxy#14453) test: replace mock_methodn macros with mock_method (envoyproxy#14450) tcp: extending tcp integration test (envoyproxy#14451) ... Signed-off-by: Michael Puncel <[email protected]>
Signed-off-by: Tong Cai [email protected]
Commit Message: server: wait workers to start before draining parent.
Additional Description:
Manual test pass. Inject 5s delay in
socket()
call, keep sending traffic to Envoy during hot restarts. Everything seems good.Risk Level: medium
Testing:
Docs Changes:
Release Notes:
[Optional Fixes #Issue]
Fixes #14295