Do not duplicate connections in connection pool after rebuild #591
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It seems connection pool reloading does not work properly for
elasticsearch-transport
. Using 6.1.0 version.My client config:
I did some debugging on retry logic and I found this in my logs:
After more logging:
So basically after some failure (connection error or timeout),
elasticsearch-transport
tries to reload connections, sniffer returns same set of hosts (as expected), but reload method does not really remove "dead" connections, and adds new connections, the pool keeps getting bigger and bigger (containing same 3 hosts over and over).This gets really problematic when ES cluster goes down. Retry logic kicks in at line
elasticsearch-ruby/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb
Line 304 in 8372d37
and it just retries too many times (since the pool have grown too big).
Expected result would be to just try all (3 in my case) connections (or as many times as
retry_on_failure
option is set) and throw error after that.The fix leaves all existing connections (even "dead" ones) in the pool if hosts are the same. And lets resurrection process to kick in and handle things properly.
Another approach could be removing "dead" connections before adding new ones, but I don't think it's a good idea (some reasons here: 331e4ee).
After the fix in production logs look normal: