Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make REST API check stricter #882

Merged

Conversation

danielmitterdorfer
Copy link
Member

So far we have used the info API to determine whether the REST API of
Elasticsearch is available. However, we might get lucky that a quorum
(but not all) of the target hosts are available yet. While certain nodes
would then respond to HTTP requests, others might not which can lead to
situations where the REST API check succeeds but we run into connection
errors later on (because we hit a different host from the connection
pool).

With this commit we make this check stricter by using the cluster health
API and blocking until at least the number of target hosts in the
cluster is available.

So far we have used the info API to determine whether the REST API of
Elasticsearch is available. However, we might get lucky that a quorum
(but not all) of the target hosts are available yet. While certain nodes
would then respond to HTTP requests, others might not which can lead to
situations where the REST API check succeeds but we run into connection
errors later on (because we hit a different host from the connection
pool).

With this commit we make this check stricter by using the cluster health
API and blocking until  at least the number of target hosts in the
cluster is available.
@danielmitterdorfer danielmitterdorfer added bug Something's wrong :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. labels Jan 29, 2020
@danielmitterdorfer danielmitterdorfer added this to the 1.4.1 milestone Jan 29, 2020
@danielmitterdorfer danielmitterdorfer self-assigned this Jan 29, 2020
Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left two questions, main one is about the wait period.

"""
# assume that at least the hosts that we expect to contact should be available. Note that this is not 100%
# bullet-proof as a cluster could have e.g. dedicated masters which are not contained in our list of target hosts
# but this is still better than just checking for any random node's REST API being reachable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very useful comment to have.

Question: would it be dangerous to trigger a sniff of the eligible http hosts (e.g. via a helper method in our EsClientFactory invoking #elasticsearch.Transport#sniff_hosts())? I was thinking if we explicitly ask for a fresh list of hosts before the check, then no unavailable hosts should be reachable. Then the same call could be invoked before the load driver starts. The caveat with this approach would be that it could potentially override the explicitly list provided by --target-hosts. Thoughts?

Copy link
Member Author

@danielmitterdorfer danielmitterdorfer Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try not to build any smartness into this? Without reading all of the involved code I don't think we can reason what nodes will be returned by the sniff_hosts call on cluster bootstrap (let's assume not all nodes are up yet or not all of them might have opened the HTTP port). I was even considering exposing an explicit command line parameter but thought that this would be a good compromise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's keep things simple here.

For the record I had a look what the elasticsearch py client does when sniff gets invoked here and it collects a list of eligible http hosts via /_nodes/_all/http.

# cluster block, x-pack not initialized yet, our wait condition is not reached
if e.status_code in (503, 401, 408):
logger.debug("Got status code [%s] on attempt [%s]. Sleeping...", e.status_code, attempt)
time.sleep(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the default max_attempts=20 the worst case scenario would mean waiting 3*20=60s plus whatever time spent executing the 20 API calls. Given the (potential) difference in performance between different hosts building from source, should we increase this e.g. to 6 (2mins in total)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine increasing this although I'd opt for more retries instead of a larger sleep period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More retries is fine by me too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've increased the number of retries to 40 now in cbf6dec.

@dliappis dliappis self-requested a review January 29, 2020 12:09
Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this.

@danielmitterdorfer danielmitterdorfer merged commit f145325 into elastic:master Jan 29, 2020
@danielmitterdorfer danielmitterdorfer deleted the robust-api-check branch January 29, 2020 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants