-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make REST API check stricter #882
Make REST API check stricter #882
Conversation
So far we have used the info API to determine whether the REST API of Elasticsearch is available. However, we might get lucky that a quorum (but not all) of the target hosts are available yet. While certain nodes would then respond to HTTP requests, others might not which can lead to situations where the REST API check succeeds but we run into connection errors later on (because we hit a different host from the connection pool). With this commit we make this check stricter by using the cluster health API and blocking until at least the number of target hosts in the cluster is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left two questions, main one is about the wait period.
""" | ||
# assume that at least the hosts that we expect to contact should be available. Note that this is not 100% | ||
# bullet-proof as a cluster could have e.g. dedicated masters which are not contained in our list of target hosts | ||
# but this is still better than just checking for any random node's REST API being reachable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very useful comment to have.
Question: would it be dangerous to trigger a sniff of the eligible http hosts (e.g. via a helper method in our EsClientFactory invoking #elasticsearch.Transport#sniff_hosts())? I was thinking if we explicitly ask for a fresh list of hosts before the check, then no unavailable hosts should be reachable. Then the same call could be invoked before the load driver starts. The caveat with this approach would be that it could potentially override the explicitly list provided by --target-hosts
. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try not to build any smartness into this? Without reading all of the involved code I don't think we can reason what nodes will be returned by the sniff_hosts
call on cluster bootstrap (let's assume not all nodes are up yet or not all of them might have opened the HTTP port). I was even considering exposing an explicit command line parameter but thought that this would be a good compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's keep things simple here.
For the record I had a look what the elasticsearch py client does when sniff gets invoked here and it collects a list of eligible http hosts via /_nodes/_all/http
.
# cluster block, x-pack not initialized yet, our wait condition is not reached | ||
if e.status_code in (503, 401, 408): | ||
logger.debug("Got status code [%s] on attempt [%s]. Sleeping...", e.status_code, attempt) | ||
time.sleep(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the default max_attempts=20
the worst case scenario would mean waiting 3*20=60s plus whatever time spent executing the 20 API calls. Given the (potential) difference in performance between different hosts building from source, should we increase this e.g. to 6 (2mins in total)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine increasing this although I'd opt for more retries instead of a larger sleep period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More retries is fine by me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've increased the number of retries to 40 now in cbf6dec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for fixing this.
So far we have used the info API to determine whether the REST API of
Elasticsearch is available. However, we might get lucky that a quorum
(but not all) of the target hosts are available yet. While certain nodes
would then respond to HTTP requests, others might not which can lead to
situations where the REST API check succeeds but we run into connection
errors later on (because we hit a different host from the connection
pool).
With this commit we make this check stricter by using the cluster health
API and blocking until at least the number of target hosts in the
cluster is available.