Do not duplicate connections in connection pool after rebuild #591

deees · 2019-01-10T10:00:29Z

It seems connection pool reloading does not work properly for elasticsearch-transport. Using 6.1.0 version.

My client config:

hosts: ['http://1.2.3.4:9200', 'http://1.2.3.5:9200', 'http://1.2.3.6:9200'],
transport_options: ...,
reload_on_failure: true

I did some debugging on retry logic and I found this in my logs:

W, [2019-01-09T05:08:03.423829 #16537]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 8)
W, [2019-01-09T05:08:44.248169 #20150]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 8)
W, [2019-01-09T05:09:22.165731 #17094]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 11)
W, [2019-01-09T05:10:04.635238 #17891]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 5)
W, [2019-01-09T05:10:11.609438 #17465]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 12)
W, [2019-01-09T05:11:27.627451 #20064]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 4)
W, [2019-01-09T05:13:10.793077 #19312]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 10)
W, [2019-01-09T05:14:19.425391 #20661]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 7)
W, [2019-01-09T05:14:29.728092 #20909]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 7)
W, [2019-01-09T05:15:10.218102 #17891]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 6)

After more logging:

W, [2019-01-09T08:45:05.159219 #1749]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 4)
W, [2019-01-09T08:45:05.167816 #1749]  WARN -- : [1749] Dumpling connections
W, [2019-01-09T08:45:05.167991 #1749]  WARN -- : [1749] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:39:33 -0500, failures: 1)>
W, [2019-01-09T08:45:05.168128 #1749]  WARN -- : [1749] * host: http://1.2.3.5:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.168252 #1749]  WARN -- : [1749] * host: http://1.2.3.6:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.168377 #1749]  WARN -- : [1749] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:45:05 -0500, failures: 1)>
W, [2019-01-09T08:45:05.775172 #1846]  WARN -- : [Faraday::TimeoutError] Reloading connections (attempt 1 of 5)
W, [2019-01-09T08:45:05.780262 #1671]  WARN -- : [Faraday::TimeoutError] Reloading connections (attempt 1 of 8)
W, [2019-01-09T08:45:05.789611 #1846]  WARN -- : [1846] Dumpling connections
W, [2019-01-09T08:45:05.789911 #1846]  WARN -- : [1846] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 07:24:12 -0500, failures: 1)>
W, [2019-01-09T08:45:05.790083 #1846]  WARN -- : [1846] * host: http://1.2.3.5:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.790258 #1846]  WARN -- : [1846] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 08:12:59 -0500, failures: 1)>
W, [2019-01-09T08:45:05.790449 #1846]  WARN -- : [1846] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:45:05 -0500, failures: 1)>
W, [2019-01-09T08:45:05.790655 #1846]  WARN -- : [1846] * host: http://1.2.3.6:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.791722 #1671]  WARN -- : [1671] Dumpling connections
W, [2019-01-09T08:45:05.791928 #1671]  WARN -- : [1671] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:03:51 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792127 #1671]  WARN -- : [1671] * host: http://1.2.3.5:9200/ (dead since 2019-01-09 08:45:05 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792331 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 07:28:49 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792497 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 07:34:53 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792691 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 08:05:24 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792860 #1671]  WARN -- : [1671] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:42:47 -0500, failures: 1)>
W, [2019-01-09T08:45:05.793089 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.793244 #1671]  WARN -- : [1671] * host: http://1.2.3.4:9200/ (alive, failures: 0)>

So basically after some failure (connection error or timeout), elasticsearch-transport tries to reload connections, sniffer returns same set of hosts (as expected), but reload method does not really remove "dead" connections, and adds new connections, the pool keeps getting bigger and bigger (containing same 3 hosts over and over).
This gets really problematic when ES cluster goes down. Retry logic kicks in at line

elasticsearch-ruby/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb

Line 304 in 8372d37

if @options[:reload_on_failure] and tries < connections.all.size

and it just retries too many times (since the pool have grown too big).

Expected result would be to just try all (3 in my case) connections (or as many times as retry_on_failure option is set) and throw error after that.

The fix leaves all existing connections (even "dead" ones) in the pool if hosts are the same. And lets resurrection process to kick in and handle things properly.

Another approach could be removing "dead" connections before adding new ones, but I don't think it's a good idea (some reasons here: 331e4ee).

After the fix in production logs look normal:

W, [2019-01-10T04:05:41.884891 #10627]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:06:06.540159 #10531]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:06:14.701250 #10034]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:07:53.245979 #9107]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:08:14.118875 #11710]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)

elasticmachine · 2019-01-10T10:00:32Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

estolfo · 2019-01-15T11:25:38Z

Hi @deees Thanks very much for reporting this and for the pull request. What you've proposed makes sense, I'll merge this and also backport to the 6.x branch. Thanks!

#591)

deees · 2019-01-15T15:05:15Z

Great, thanks!

Do not duplicate connections in connection pool after rebuild

f1cf92e

estolfo merged commit 16887e5 into elastic:master Jan 15, 2019

estolfo pushed a commit that referenced this pull request Jan 15, 2019

[CLIENT] Do not duplicate connections in connection pool after rebuild (

e917f2b

#591)

deees deleted the fix-infinite-connection-pool branch January 15, 2019 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not duplicate connections in connection pool after rebuild #591

Do not duplicate connections in connection pool after rebuild #591

deees commented Jan 10, 2019

elasticmachine commented Jan 10, 2019

estolfo commented Jan 15, 2019

deees commented Jan 15, 2019

Do not duplicate connections in connection pool after rebuild #591

Do not duplicate connections in connection pool after rebuild #591

Conversation

deees commented Jan 10, 2019

elasticmachine commented Jan 10, 2019

estolfo commented Jan 15, 2019

deees commented Jan 15, 2019