Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: connection pool - Allow a cancel callback of a request cancel other requests #7345

Merged
merged 9 commits into from
Jul 29, 2019

Conversation

yuval-k
Copy link
Contributor

@yuval-k yuval-k commented Jun 20, 2019

Fixes an ASSERT crash with the connection pool in the following scenario:

  • while using the JWT filter
  • with two providers (under requires_all) set to the same remote JWKS
  • remote JWKS server was down
  • sending a request to with a JWT triggers an ASSERT in envoy (specifically in LinkedObject::removeFromList).

After a gdb session I found out the following:

  • the two providers try to fetch the key. There is no connection to the host, so two pending requests are created with ConnPoolImplBase::newPendingRequest
  • the host is down so ConnPoolImplBase::purgePendingRequests is called
  • request->callbacks_.onPoolFailure is called for the first request
  • this propagates to the JWT filter, that issues a failure back to the client (sendLocalReply in Filter::onComplete)
  • sendLocalReply ends the stream which calls the JWT filter onDestory
  • onDestory (eventually) cancels the pending requests
  • the pending requests calls ConnPoolImplBase::onPendingRequestCancel (all of this from the middle of ConnPoolImplBase::purgePendingRequests's call to UpstreamRequest::onPoolFailure)
  • ConnPoolImplBase::onPendingRequestCancel tries to remove the pending request for the list, but it is not there any-more as purgePendingRequests moved the list to a stack variable
  • this triggers an ASSERT in LinkedObject::removeFromList as it cannot find the element in the list.

The fix is to move the cancelling requests their own list, and remove pending requests from that new list as long as it is not empty.

Risk Level: Mid\High
Testing: Added unit test
Docs Changes: N\A
Release Notes: N\A

stacktrace.txt

Signed-off-by: Yuval Kohavi <[email protected]>
@yuval-k yuval-k force-pushed the fix-conn-pool-cancel-crash branch from a6c35d3 to b76a968 Compare June 20, 2019 22:46
@yuval-k yuval-k changed the title http: fix conpool crash http: fix conpool ASSERT Jun 20, 2019
@yuval-k yuval-k changed the title http: fix conpool ASSERT http: fix conpool crash due to ASSERT Jun 20, 2019
@snowp
Copy link
Contributor

snowp commented Jun 20, 2019

This seems reasonable to me, @mattklein123 WDYT?

@mattklein123 mattklein123 self-assigned this Jun 21, 2019
@mattklein123
Copy link
Member

From a very quick look I'm not convinced this is the fix we want, but at a high level I think I understand what is going on and I think this is on the right track. Thanks a ton for digging in so deeply.

I think the next step is can you write a test that crashes? I think that will help better understand the scenario and then we can walk through what the right solution is? Thank you!

/wait

Signed-off-by: Yuval Kohavi <[email protected]>
@yuval-k
Copy link
Contributor Author

yuval-k commented Jun 21, 2019

@mattklein123 - added a test that crashes.
I tried to extend the tests in http1/conn_pool_test.cc, but it was a bit complicated. so instead created a new test file for the base class.
also undid my changes so the test will indeed crash.

@mattklein123
Copy link
Member

@yuval-k sorry why can't you add an HTTP/1 test? I understand the problem and it seems like you should be able to setup a test that repros the problem without going directly into the base class?

/wait-any

@yuval-k
Copy link
Contributor Author

yuval-k commented Jun 21, 2019

Hi @mattklein123 , I will be out of the office (big family event) this coming week so I'll have to get back to this on the following week (week of June30th).

@mattklein123 mattklein123 added no stalebot Disables stalebot from closing an issue waiting labels Jun 22, 2019
yuval-k added 2 commits July 5, 2019 11:32
Signed-off-by: Yuval Kohavi <[email protected]>
Signed-off-by: Yuval Kohavi <[email protected]>
@yuval-k
Copy link
Contributor Author

yuval-k commented Jul 6, 2019

Hi @mattklein123, I added a failing test via the http1 conn-pool. would love your insight on how to fix this bug.
My initial idea (in the first commit) was to move the cancelling handles to a separate list pending_requests_to_purge_, and remove from this new list if it isn't empty.
The idea being that if the new list isn't empty, it must be because it was called in the loop in purgePendingRequests, so it must be for one of the existing handles.

@mattklein123
Copy link
Member

@yuval-k I think my original comment still stands: can you get a test for the HTTP/1.1 that fails? I don't think we should be testing the base class. Also, once that works, feel free to put your proposed fix back and then we can look at it all together. Thank you!

/wait

@yuval-k
Copy link
Contributor Author

yuval-k commented Jul 10, 2019

Hi @mattklein123 See my last commit - I added a failing http1 unit test here: 458c4fb - Did you mean adding an integration test?
In the mean time i'll remove the base class test (The reason I left it there was thinking that we want to verify this behaviour across different sub classes).

Signed-off-by: Yuval Kohavi <[email protected]>
@mattklein123
Copy link
Member

Ah OK sorry, I missed that. Can you add your fix now? The I can take a look. Thank you!

/wait

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, at a high level this makes sense to me but some more comments and cleanups will help me solidify my understanding. Thanks for working on this!

/wait

ConnPoolCallbacks callbacks1;
uint32_t pool_failure_calls{};
EXPECT_CALL(callbacks1.pool_failure_, ready())
.Times(AtMost(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use WillOnce here and below? This should be deterministic?

Copy link
Contributor Author

@yuval-k yuval-k Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially one and only one of those of callbacks should be called (as one of them cancels the other).
it depends on the implementation of the conn pool which one is first; while i can check the impl to see which one it will be, i thought it is better that the test covers the accepted behaviour to allow the impl to change without breaking the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken, but my preference would be simplify the test for now. Feel free to leave a comment about how it's tied to the implementation.

// Simulate connection failure.
EXPECT_CALL(conn_pool_, onClientDestroy());
conn_pool_.test_clients_[0].connection_->raiseEvent(Network::ConnectionEvent::RemoteClose);
EXPECT_EQ(1, pool_failure_calls);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I asked this above. Why is this not deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, see my answer above; essentially we need to see that only one of the callbacks was called.

source/common/http/conn_pool_base.h Show resolved Hide resolved
@mattklein123 mattklein123 removed the no stalebot Disables stalebot from closing an issue label Jul 11, 2019
Signed-off-by: Yuval Kohavi <[email protected]>
@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Jul 18, 2019
@mattklein123
Copy link
Member

Sorry am traveling will get to this next week.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this LGTM. Please update the PR title (and description if needed) to be more descriptive of the actual problem.

/wait

source/common/http/conn_pool_base.cc Outdated Show resolved Hide resolved
source/common/http/conn_pool_base.cc Outdated Show resolved Hide resolved
source/common/http/conn_pool_base.cc Outdated Show resolved Hide resolved
@mattklein123 mattklein123 removed the no stalebot Disables stalebot from closing an issue label Jul 22, 2019
Signed-off-by: Yuval Kohavi <[email protected]>
@yuval-k yuval-k changed the title http: fix conpool crash due to ASSERT http: connection pool - Allow a cancel callback of a request cancel other requests Jul 23, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, one follow up question.

/wait-any

source/common/http/conn_pool_base.cc Show resolved Hide resolved
Signed-off-by: Yuval Kohavi <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 33f3757 into envoyproxy:master Jul 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants