Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flakey OS X integration test #2428

Closed
zuercher opened this issue Jan 22, 2018 · 14 comments
Closed

flakey OS X integration test #2428

zuercher opened this issue Jan 22, 2018 · 14 comments
Assignees

Comments

@zuercher
Copy link
Member

Description:
test/integration:http2_integration_test is flakey on OS X.

I believe the exact test that hangs varies, so it's probably some timing issue in the test harness that OS X exposes as compared to Linux. I've not had any luck reproducing this outside of CI.

@alyssawilk
Copy link
Contributor

Tracking more data here (and @envoyproxy/maintainers please continue as you see more flakes)

https://pastebin.com/NJu7N5ri says Http2IntegrationTest took ~40s and the test timed out after 315 seconds so either some generic "bad expectation" or CookieRoutingNoCookieNoTtl failure. I think I recall other tests flaking so I think it's some bad expectation but it'd be nice to confirm.

@zuercher
Copy link
Member Author

zuercher commented Feb 5, 2018

got that one to fail once (after a few thousand repetitions). It's stuck waiting in Envoy::FakeUpstream::waitForHttpConnection, but I don't know that this tells us much. I can reproduce it pretty reliably by running the test in a loop while building an unrelated project.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7d68acee libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff7d7c7662 libsystem_pthread.dylib`_pthread_cond_wait + 732
    frame #2: 0x00007fff7b570d43 libc++.1.dylib`std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) + 93
    frame #3: 0x0000000105fb693b http2_integration_test`std::__1::cv_status std::__1::condition_variable::wait_for<long long, std::__1::ratio<1l, 1000000l> >(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000l> > const&) + 3771
    frame #4: 0x0000000105f9e68f http2_integration_test`std::__1::cv_status std::__1::condition_variable::wait_until<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000l> > >(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000l> > > const&) + 479
    frame #5: 0x0000000105fa9a19 http2_integration_test`Envoy::FakeUpstream::waitForHttpConnection(Envoy::Event::Dispatcher&, std::__1::vector<std::__1::unique_ptr<Envoy::FakeUpstream, std::__1::default_delete<Envoy::FakeUpstream> >, std::__1::allocator<std::__1::unique_ptr<Envoy::FakeUpstream, std::__1::default_delete<Envoy::FakeUpstream> > > >&) + 1609
    frame #6: 0x0000000105e5a274 http2_integration_test`Envoy::Http2RingHashIntegrationTest::sendMultipleRequests(int, Envoy::Http::TestHeaderMapImpl, std::__1::function<void (Envoy::IntegrationStreamDecoder&)>) + 4772
    frame #7: 0x0000000105e5cd03 http2_integration_test`Envoy::Http2RingHashIntegrationTest_CookieRoutingNoCookieNoTtl_Test::TestBody() + 2675
    frame #8: 0x0000000105e5d1fc http2_integration_test`non-virtual thunk to Envoy::Http2RingHashIntegrationTest_CookieRoutingNoCookieNoTtl_Test::TestBody() + 28
    frame #9: 0x000000010783ed6e http2_integration_test`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 126
    frame #10: 0x0000000107814e9b http2_integration_test`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 123
    frame #11: 0x0000000107814dc6 http2_integration_test`testing::Test::Run() + 198
    frame #12: 0x000000010781630d http2_integration_test`testing::TestInfo::Run() + 221
    frame #13: 0x00000001078178cc http2_integration_test`testing::TestCase::Run() + 236
    frame #14: 0x00000001078272db http2_integration_test`testing::internal::UnitTestImpl::RunAllTests() + 1019
    frame #15: 0x00000001078419be http2_integration_test`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 126
    frame #16: 0x0000000107826c9b http2_integration_test`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 123
    frame #17: 0x0000000107826b68 http2_integration_test`testing::UnitTest::Run() + 408
    frame #18: 0x0000000106eda9a1 http2_integration_test`RUN_ALL_TESTS() + 17
    frame #19: 0x0000000106eda814 http2_integration_test`Envoy::TestRunner::RunTests(int, char**) + 1076
    frame #20: 0x0000000106eda29f http2_integration_test`main + 1983
    frame #21: 0x00007fff7d53b115 libdyld.dylib`start + 1

@alyssawilk
Copy link
Contributor

Internally we have an injector for trickle-write tests one could (fairly) easily plug into general integration tests. I'm thinking of adding one to Envoy since I think the packetization has caused most of the macos behavioral differences and I'd be happy to work on debugging these if I could only repro.

Semi-related, we were discussing internally if there was any way to set things up for macos and/or as we add other build systems so other folks could repro. Currently the only way is to send out a PR with debug logging which you tell folks to ignore, and open and close the issue a whole lot, which is clearly suboptimal :-P

@alyssawilk
Copy link
Contributor

Well interesting. I decided to take the hacky approach and overwrote

-int OwnedImpl::write(int fd) { return evbuffer_write(buffer_.get(), fd); }
+int OwnedImpl::write(int fd) { return evbuffer_write_atmost(buffer_.get(), fd, 10); }

This causes a LOT of tests to fail. Sadly http2_integration_test is still pretty solid for me. How many runs on average does it fail for you? I can up my --runs_per_test if I'm not being aggressive enough.

@htuch
Copy link
Member

htuch commented Feb 6, 2018

@htuch
Copy link
Member

htuch commented Feb 7, 2018

@mattklein123 mattklein123 added help wanted Needs help! and removed bug labels Feb 8, 2018
@zuercher
Copy link
Member Author

zuercher commented Feb 9, 2018

@alyssawilk to get the stack traces I ran the test directly from the command line in a loop, so I wasn't using runs_per_test. I tried to get it to reproduce today with bazel and runs per test and it's not as reliable as it was before.

@htuch
Copy link
Member

htuch commented Mar 5, 2018

@htuch
Copy link
Member

htuch commented Mar 7, 2018

@zuercher any thoughts on how we can resolve this? We had some active confusion in #2688 which led to the author rolling back a change due to perceived introduction of OS X hangs.

@zuercher
Copy link
Member Author

zuercher commented Mar 7, 2018

Every time I try to add extra logging to help me understand the problem, it disappears. I can try throwing dtrace at it. Maybe that will turn up something.

@mattklein123
Copy link
Member

@zuercher I hate to propose this, but what if on OSX we retry the bazel test command once if it fails? (Or maybe there is something built-in to bazel to retest). This is obviously not great, but would probably cause almost all of the flakes to go away.

@ambuc
Copy link
Contributor

ambuc commented May 2, 2018

I think the //test/integration:integration_test test is still flaky on mac: https://circleci.com/gh/envoyproxy/envoy/48033

htuch pushed a commit that referenced this issue May 2, 2018
Adds the --flaky_test_attempts flag for OS X CI builds to paper over the flaky tests noted in #2428. Test with "integration" in their names will be retried on failure and the CI will succeed if the retry succeeds.

Risk Level: Low
Testing: n/a
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <[email protected]>
@zuercher
Copy link
Member Author

zuercher commented May 2, 2018

Per @mattklein123's suggestion we now retry integration tests on OS X once if they fail to make the flakiness less disruptive.

ramaraochavali pushed a commit to ramaraochavali/envoy that referenced this issue May 3, 2018
Adds the --flaky_test_attempts flag for OS X CI builds to paper over the flaky tests noted in envoyproxy#2428. Test with "integration" in their names will be retried on failure and the CI will succeed if the retry succeeds.

Risk Level: Low
Testing: n/a
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <[email protected]>
Signed-off-by: Rama <[email protected]>
@mattklein123
Copy link
Member

I'm closing this as it seems pretty stable w/ the retry.

Shikugawa pushed a commit to Shikugawa/envoy that referenced this issue Mar 28, 2020
jpsim pushed a commit that referenced this issue Nov 28, 2022
Signed-off-by: GitHub Action <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Signed-off-by: GitHub Action <[email protected]>
Signed-off-by: JP Simard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants