Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not duplicate connections in connection pool after rebuild #591

Merged
merged 1 commit into from
Jan 15, 2019

Conversation

deees
Copy link
Contributor

@deees deees commented Jan 10, 2019

It seems connection pool reloading does not work properly for elasticsearch-transport. Using 6.1.0 version.

My client config:

hosts: ['http://1.2.3.4:9200', 'http://1.2.3.5:9200', 'http://1.2.3.6:9200'],
transport_options: ...,
reload_on_failure: true

I did some debugging on retry logic and I found this in my logs:

W, [2019-01-09T05:08:03.423829 #16537]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 8)
W, [2019-01-09T05:08:44.248169 #20150]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 8)
W, [2019-01-09T05:09:22.165731 #17094]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 11)
W, [2019-01-09T05:10:04.635238 #17891]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 5)
W, [2019-01-09T05:10:11.609438 #17465]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 12)
W, [2019-01-09T05:11:27.627451 #20064]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 4)
W, [2019-01-09T05:13:10.793077 #19312]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 10)
W, [2019-01-09T05:14:19.425391 #20661]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 7)
W, [2019-01-09T05:14:29.728092 #20909]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 7)
W, [2019-01-09T05:15:10.218102 #17891]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 6)

After more logging:

W, [2019-01-09T08:45:05.159219 #1749]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 4)
W, [2019-01-09T08:45:05.167816 #1749]  WARN -- : [1749] Dumpling connections
W, [2019-01-09T08:45:05.167991 #1749]  WARN -- : [1749] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:39:33 -0500, failures: 1)>
W, [2019-01-09T08:45:05.168128 #1749]  WARN -- : [1749] * host: http://1.2.3.5:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.168252 #1749]  WARN -- : [1749] * host: http://1.2.3.6:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.168377 #1749]  WARN -- : [1749] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:45:05 -0500, failures: 1)>
W, [2019-01-09T08:45:05.775172 #1846]  WARN -- : [Faraday::TimeoutError] Reloading connections (attempt 1 of 5)
W, [2019-01-09T08:45:05.780262 #1671]  WARN -- : [Faraday::TimeoutError] Reloading connections (attempt 1 of 8)
W, [2019-01-09T08:45:05.789611 #1846]  WARN -- : [1846] Dumpling connections
W, [2019-01-09T08:45:05.789911 #1846]  WARN -- : [1846] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 07:24:12 -0500, failures: 1)>
W, [2019-01-09T08:45:05.790083 #1846]  WARN -- : [1846] * host: http://1.2.3.5:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.790258 #1846]  WARN -- : [1846] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 08:12:59 -0500, failures: 1)>
W, [2019-01-09T08:45:05.790449 #1846]  WARN -- : [1846] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:45:05 -0500, failures: 1)>
W, [2019-01-09T08:45:05.790655 #1846]  WARN -- : [1846] * host: http://1.2.3.6:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.791722 #1671]  WARN -- : [1671] Dumpling connections
W, [2019-01-09T08:45:05.791928 #1671]  WARN -- : [1671] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:03:51 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792127 #1671]  WARN -- : [1671] * host: http://1.2.3.5:9200/ (dead since 2019-01-09 08:45:05 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792331 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 07:28:49 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792497 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 07:34:53 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792691 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (dead since 2019-01-09 08:05:24 -0500, failures: 1)>
W, [2019-01-09T08:45:05.792860 #1671]  WARN -- : [1671] * host: http://1.2.3.4:9200/ (dead since 2019-01-09 08:42:47 -0500, failures: 1)>
W, [2019-01-09T08:45:05.793089 #1671]  WARN -- : [1671] * host: http://1.2.3.6:9200/ (alive, failures: 0)>
W, [2019-01-09T08:45:05.793244 #1671]  WARN -- : [1671] * host: http://1.2.3.4:9200/ (alive, failures: 0)>

So basically after some failure (connection error or timeout), elasticsearch-transport tries to reload connections, sniffer returns same set of hosts (as expected), but reload method does not really remove "dead" connections, and adds new connections, the pool keeps getting bigger and bigger (containing same 3 hosts over and over).
This gets really problematic when ES cluster goes down. Retry logic kicks in at line

if @options[:reload_on_failure] and tries < connections.all.size

and it just retries too many times (since the pool have grown too big).

Expected result would be to just try all (3 in my case) connections (or as many times as retry_on_failure option is set) and throw error after that.


The fix leaves all existing connections (even "dead" ones) in the pool if hosts are the same. And lets resurrection process to kick in and handle things properly.

Another approach could be removing "dead" connections before adding new ones, but I don't think it's a good idea (some reasons here: 331e4ee).

After the fix in production logs look normal:

W, [2019-01-10T04:05:41.884891 #10627]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:06:06.540159 #10531]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:06:14.701250 #10034]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:07:53.245979 #9107]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)
W, [2019-01-10T04:08:14.118875 #11710]  WARN -- : [Faraday::ConnectionFailed] Reloading connections (attempt 1 of 3)

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@estolfo
Copy link
Contributor

estolfo commented Jan 15, 2019

Hi @deees Thanks very much for reporting this and for the pull request. What you've proposed makes sense, I'll merge this and also backport to the 6.x branch. Thanks!

@estolfo estolfo merged commit 16887e5 into elastic:master Jan 15, 2019
@deees
Copy link
Contributor Author

deees commented Jan 15, 2019

Great, thanks!

@deees deees deleted the fix-infinite-connection-pool branch January 15, 2019 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants