Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce an unauthenticated endpoint for readiness checks #81168

Closed
jasontedor opened this issue Nov 30, 2021 · 65 comments · Fixed by #84375
Closed

Introduce an unauthenticated endpoint for readiness checks #81168

jasontedor opened this issue Nov 30, 2021 · 65 comments · Fixed by #84375
Assignees
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@jasontedor
Copy link
Member

In the context of Kubernetes, readiness probes are used to determine whether or not to route traffic to a pod. In the context of Elasticsearch, this is useful since Elasticsearch takes some time after startup before it is ready to receive requests, and when it is shutting down, it would be better to route those requests elsewhere.

Today there is not a dedicated endpoint for this, which means that operators have to simulate it by relying on an endpoint like the main endpoint (/). This is less-than-ideal since the behavior is not dedicated, and also because this endpoint requires authentication. Authentication presents a challenge since it requires that credentials be available to the probe that executes the readiness check. Making those credentials available means that an operator such as ECK can not rely on native HTTP probes in Kubernetes. Finally, we would want a solution to this in 7.x since we like to preserve the option of simplifying our approach to managing 7.x deployments via Kubernetes.

Ideally, there would be an endpoint that:

  • is dedicated to readiness checks
  • is not authenticated
  • is available in 7.x

As far as the behavior of this endpoint, having discussed this with @dakrone, we have settled on the following proposal:

  • as startup, after the node is ready to receive HTTP requests, the node returns 503 until the node has joined the cluster and recovered the initial cluster state
  • after the node has joined the cluster and recovered the initial cluster state, the node returns 200
  • when the node enters the shutdown state, the node returns 503

Note that it would be fine if response bodies to these requests were empty (JSON: {}), so that this endpoint is not otherwise revealing information.

(@dakrone, please feel free if I got any of the details of our conversation wrong. @tvernum feel free to comment on the unauthenticated aspects, which we discussed previously.)

Lastly, this would be a precursor to removing curl from the Docker image, which I've previously discussed with @pugnascotia, but I will open a separate issue to capture our discussion.

Relates #50187

@jasontedor jasontedor added >enhancement needs:triage Requires assignment of a team area label labels Nov 30, 2021
@jasontedor
Copy link
Member Author

I opened #81177 to capture the request to remove curl from the Docker image.

@dakrone dakrone added the :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown label Nov 30, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Nov 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@dakrone dakrone removed the needs:triage Requires assignment of a team area label label Nov 30, 2021
@tvernum
Copy link
Contributor

tvernum commented Dec 6, 2021

Even though this only proposes to return a status code and no further details about the cluster state, I think some admins would feel uncomfortable having an unauthenticated API exposed by default.
It feels like the sort of thing that would come up on security scans, and we'd spend time explaining why it's safe.

I think my preference would be either:

  1. Truly unauthenticated, but disabled by default. It can then be enabled automatically by ECK (and our Helm charts?) and manually by anyone else who wants to use it.
  2. Statically authenticated by a URL-based credential (e.g. GET /_node/ready/{some-token-value}) with the credential value managed on the node's filesystem (e.g in elasticsearch.keystore).

I believe the ECK team would opt for (1), which seems fine to me.

The 3rd option would be to try and roll this into anonymous access. I think that's do-able, but it has some tricky bits.

  1. Currently, if anonymous access is enabled, then any REST handler can be accessed anonymously. Authorization happens on the transport layer, so most REST actions would fail to do anything, but they would be accessible. That forces quite a wide attack surface, so we'd probably also want to introduce a new setting to control which REST endpoints can be accessed anonymously.
  2. Anonymous access is a user feature not an operator feature. That is, people running on ECK should be free to make use of ES's anonymous access if they want. If we use anonymous for the readiness probe, then we need to be able to support cluster admins configuring the anonymous roles alongside ECK's configuration of the anonymous readiness probe. Mixing admin and operator features like that can be tricky.
    • For example, the readiness problem needs to work (but return 503) when the node isn't connected to the cluster. Which means we'd need anonymous to be reliable even if the node cannot resolve index (API) based security roles. But admins should be free to define their anonymous access via the API if that suits them.

I don't think it's worth trying to make this part of anonymous. I think we want to say "anonymous is for admins to configure" and "readiness probes are for operators" and keep them separate.
And like operator privileges, it's fine to say that the implcitly-authenticated-probe-user is defined in code and can be enabled/disabled but cannot be configured because its scope is tightly defined by the needs of our orchestration features.


I proposed this in a conversation elsewhere, but copying it here in order to keep things in 1 place...

There are 2 related but distinct parts of the problem that need to be solved:

  1. Authentication
  2. Authorization

The latter issue matters because just trying to avoid authentication is going to end up backfiring - if you have no user then nothing will work. It's impossible to call an action if there's no user, and I'd be very strongly against relaxing that constraint.

So, what could do instead is automatically authenticate the request as an internal user if it matches the right endpoint. I would approach that as a new Authenticator, at the top of the chain that only looks at the URL and, if it matches, authenticates as a new internal user (like AsyncSearchUser)
That new user would have exactly 1 privilege, which would be the action that determines node readiness.

I propose the top of the chain so that the actions always run as this internal user and we never process any credentials that might have been provided in the request. That makes the URL as close to "unauthenticated" as we can get, and it doesn't fail (on security grounds) even if you happen to provide credentials authenticate as a user with no privileges.

@henningandersen
Copy link
Contributor

I think I would prefer a 200 OK status as long as a status can be derived. Instead we can include a ready flag in the response { "ready" : false }. This makes it possible that the readiness probe can fail with a 503 for other reasons (like a proxy failing to connect to ES). And it avoids frequent 503's that need to be filtered out by tools inspecting response codes.

This also seems more correct, in that the node could happily figure out whether it is ready to accept requests or not, which is the question asked when using the API.

Can you elaborate on the reasons for using a 503 status response to signal "not ready" from a healthy node?

@colings86
Copy link
Contributor

I think I would prefer a 200 OK status as long as a status can be derived. Instead we can include a ready flag in the response { "ready" : false }. This makes it possible that the readiness probe can fail with a 503 for other reasons (like a proxy failing to connect to ES). And it avoids frequent 503's that need to be filtered out by tools inspecting response codes.

This from the docs is relevant here:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

Also, from a more philosophical point of view, the purpose of a readiness check is to determine if the receiver should be used to route requests to. Therefore any error code from anywhere in the chain the request is sent is an indication that the service is not ready to serve requests. I don't think the caller needs to know where the error code came from, just that it should not route requests to that node. Debugging what went wrong is not for the caller to do and can be satisfied by observability and other debugging (such as testing the request at different points in the chain)

@ChrisHegarty
Copy link
Contributor

This issue specifically mentions "readiness probes". There are clearly other kinds of probes, e.g. liveness probes, startup probes. Is there any interest in these probes? Can they be served from the same endpoint?

@colings86
Copy link
Contributor

colings86 commented Dec 8, 2021

From what I have read (though I suspect others will be better authorities on this). Startup probes are often configured to call the same endpoint as the readiness probe but with a longer timeout before calling the node unhealthy.

I'm not sure but given there is more than likely full control of the container in these kinds of deployment I wonder if liveness is satisfied by probing whether the process is running, particularly as Elasticsearch nodes take the approach of exiting if they encounter errors which are unrecoverable and therefore not require an HTTP API

@DaveCTurner
Copy link
Contributor

Note also that today we don't even start listening on the HTTP port until we've applied a cluster state from an elected master, so simply checking whether the HTTP port is open is also a reasonable startup check. The wait is currently bounded with a default timeout of 30s which starts counting from fairly late on in the startup process anyway; arguably we could just wait indefinitely here, I don't think there's much you can do over HTTP before joining the cluster. Alternatively we could consider a 30s wait to be long enough that we've "started up" even if we're still not really "ready".

@DaveCTurner
Copy link
Contributor

A few of us discussed this today and one option we considered was opening a distinct HTTP port specifically for this API. AIUI Kubernetes hits the readiness endpoint in a way that means the port wouldn't need to be exposed outside the container, so a distinct port would be a neat way to avoid inappropriate access from clients without needing authentication. It also means that we wouldn't need to worry about authentication or TLS - in particular if Elasticsearch is configured to require TLS client certificates then authentication happens at handshake time which makes exposing an unauthenticated endpoint rather tricky. It's still technically possible, TLS has the facility to redo its handshakes after receiving the HTTP request, but it's not pretty and I worry that it'd be a pretty substantial change to support this flow.

We were not convinced this is the right approach but thought it worth mentioning as a possible option.

@henningandersen
Copy link
Contributor

This from the docs is relevant here:

Yeah, if that is the k8s way, I see no issue in using 503 here.

@jasontedor
Copy link
Member Author

It feels like the sort of thing that would come up on security scans, and we'd spend time explaining why it's safe.

I'm not sure I understand. An HTTP scanner would have to be taught about this endpoint. But since this endpoint effectively does nothing (no request to parse, empty response, etc.), there isn't a reason to teach an HTTP scanner about this endpoint, or if it is taught about this endpoint, that it's a non-issue?

Truly unauthenticated, but disabled by default. It can then be enabled automatically by ECK (and our Helm charts?) and manually by anyone else who wants to use it.

I'd rather we go the other way. The simplest solution is that this endpoint exists and is unauthenticated, and that's it. If we have to, we can layer on additional options (e.g., to require authn/authz for this endpoint), but from my perspective the default option should be the endpoint exists, and is unauthenticated.

@jasontedor
Copy link
Member Author

A few of us discussed this today and one option we considered was opening a distinct HTTP port specifically for this API.

Thanks, I considered this option as well and ultimately I convinced myself it wasn't worth the complexity of supporting binding on another port, serving some endpoints from that port and not the default, etc.

@tvernum
Copy link
Contributor

tvernum commented Dec 13, 2021

It feels like the sort of thing that would come up on security scans, and we'd spend time explaining why it's safe.

I'm not sure I understand. An HTTP scanner would have to be taught about this endpoint. But since this endpoint effectively does nothing (no request to parse, empty response, etc.), there isn't a reason to teach an HTTP scanner about this endpoint, or if it is taught about this endpoint, that it's a non-issue?

We would get security reports that Elasticsearch includes an unauthenticated URL that discloses information.
We would then point out that this works as documented and it exists for readiness checks.
The reporter would repeat their stance that this is an unauthenticated URL that discloses information and they don't want it to exist.
We would point to documentation about why it's not a problem.
The reporter would repeat their stance that this is an unauthenticated URL that discloses information and they don't want it to exist and we need to fix it.

All of which takes time from people who have far more useful things to do.

I think secure by default should mean no authenticated APIs by default, and they need to be opt-in.

Otherwise we end up in a totally hypothetical position where one of our dependency has a vulnerability, and we'd be protected if all our APIs required authentication, but they don't because we opened up 1 unauthenticated API to support an orchestrator, and need to explain to users that they have an unauthenticated access path to their cluster that they don't need, but we enabled automatically.

@pugnascotia
Copy link
Contributor

I was reading the k8s probe docs, and I noticed that while HTTP probes don't allow authentication credentials to be specified, you can set arbitrary HTTP headers. I don't know how helpful it would be, but we could do something like:

httpHeaders:
        - name: Authorization
          value: ${SOME SECRET VALUE}

@colings86
Copy link
Contributor

Though this is possible to add headers, an issue here is operational complexity. To use an authenticated endpoint a secret needs to be maintained and distributed for each deployment. This adds significant complexity in Cloud. Also bear in mind that it's not just the ECK operator that needs to access this endpoint but also observability tools such as heartbeat (which may or may not be running near the deployment itself). Having to maintain credentials and distribute them to heartbeat instances is certainly possible but it adds quite a bit of operational complexity for an endpoint that does not emit any information apart from the fact that its ready to receive requests

@grcevski
Copy link
Contributor

Would it work if we added the readiness endpoint, by default it would be authenticated just like any other API is, and we add a boot setting option to disable authentication for this endpoint, if the customer wants it, e.g Cloud or anyone else running Kubernetes on their own, or other platform where they would benefit from having this endpoint unauthenticated?

@jasontedor
Copy link
Member Author

It feels like the sort of thing that would come up on security scans, and we'd spend time explaining why it's safe.

I'm not sure I understand. An HTTP scanner would have to be taught about this endpoint. But since this endpoint effectively does nothing (no request to parse, empty response, etc.), there isn't a reason to teach an HTTP scanner about this endpoint, or if it is taught about this endpoint, that it's a non-issue?

We would get security reports that Elasticsearch includes an unauthenticated URL that discloses information.

My comment was strictly about scans, in response to your statement that's where they'd come up.

Let's take this entire thread to a high-bandwidth discussion and come back to this thread to communicate decisions. I'll reach out to set that up.

@rdnm rdnm added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

A few weeks ago we had the high-bandwidth discussion that Jason mentioned and concluded:

  • exposing a new port is preferable to trying to add a completely unauthenticated endpoint to either of the existing interfaces. We will only listen on localhost since there's no need to expose the port outside the container. Listening for readiness checks will require specific config: by default we won't make Elasticsearch listen for readiness checks.

  • we will replace our use of curl in the readiness check with a dedicated readiness-check client, written in Java. We will strive to limit the memory footprint and startup time for this client (within reason given that it'll be in Java).

  • we did not reach a conclusion about the wire protocol that we'll use. Since we control the client and the server there's no need to use HTTP, and a simpler custom protocol likely has lower client-side overheads.

@fcofdez
Copy link
Contributor

fcofdez commented Jan 11, 2022

exposing a new port is preferable to trying to add a completely unauthenticated endpoint to either of the existing interfaces. We will only listen on localhost since there's no need to expose the port outside the container. Listening for readiness checks will require specific config: by default we won't make Elasticsearch listen for readiness checks.

Wouldn't it make sense to use a unix domain socket for this?

@ChrisHegarty
Copy link
Contributor

Now that ES is compiled with --release 17 we can make use Java 17 APIs, so can use Unix-Domain Socket Channels from Java itself, which should simplify things a lot (and also since the wire protocol does not need to be HTTP)

@DaveCTurner
Copy link
Contributor

Wouldn't it make sense to use a unix domain socket for this?

We did mention that idea but we didn't want to make this feature Unix-only, nor really to have to maintain a separate implementation for Windows. However a Unix domain socket does have advantages such as even tighter access controls without needing authentication in the wire protocol. Nobody mentioned in the meeting that this is much better-supported than I thought (TIL) which pretty much invalidates the arguments against it. We'll be discussing this topic again soon to work out the division of labour, but let's consider this question again and I am now +1 on using a Unix domain socket instead.

@mark-vieira
Copy link
Contributor

We did mention that idea but we didn't want to make this feature Unix-only, nor really to have to maintain a separate implementation for Windows.

If the main use-case here is container health then we don't really have to worry about Windows. Sure, a "generic" readiness API might be helpful in other contexts but I wouldn't necessarily rule it out here if it's the most appropriate solution for the use case at hand, which is container orchestration. For at least the foreseeable future, in that world, the underlying system in Linux.

@mark-vieira
Copy link
Contributor

In this situation will there still be a way to send requests directly to a node?

Readiness checks are per-node. The notion of a healthy "cluster" is orthogonal to this. The point of the endpoint should not be to determine cluster health but whether or not an individual node is capable of responding to user requests. That response might be "the cluster is borked". If a node cannot join the cluster, to me that means it's not "ready".

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Feb 12, 2022

This is kind of my point tho: "has joined a cluster" is a health property. Nodes can respond to requests as soon as the HTTP port is open whether they are in a cluster or not. Nodes cannot independently distinguish between "the entire cluster is borked" and "I cannot join an otherwise perfectly-healthy cluster", and it is much more common that the problem is "the entire cluster is borked". In that situation if we require nodes to join a cluster before they claim to be ready we would have zero ready nodes and therefore no way for clients to get information out of the cluster about its health status.

It's extremely unusual, and definitely a bug, for a single node to fail to join an otherwise-healthy cluster within the discovery.initial_state_timeout timeout.

@mark-vieira
Copy link
Contributor

In that situation if we require nodes to join a cluster before they claim to be ready we would have zero ready nodes and therefore no way for clients to get information out of the cluster about its health status.

How can a node that cannot join the cluster report on the cluster health status though? It can only really tell you it's own status.

It's extremely unusual, and definitely a bug, for a single node to fail to join an otherwise-healthy cluster within the discovery.initial_state_timeout timeout.

Fair enough, but the question is whether we should consider a node "ready" until it has done so. I presume if we start routing requests to nodes that haven't joined the cluster yet the result is going to be failures (depending on the nature of the request). As mentioned above, what useful information can we get diagnostically from nodes that can't talk to the cluster?

@DaveCTurner
Copy link
Contributor

How can a node that cannot join the cluster report on the cluster health status though? It can only really tell you it's own status.

With no cluster there's no cluster health to report of course, but we can still determine a bunch of information in this situation: is it unable to discover other nodes? If there are other nodes out there, what are their statuses? Did we lose ≥½ of the masters? Is one node winning elections and then failing? Is there a stable master rejecting its join requests?

The only non-bug in that list is "we lost ≥½ of the masters" but this is something that the orchestrator can't accurately determine itself. It needs insight into the state of the nodes that are trying (and failing) to form the cluster.

In particular, if a node simply can't talk to the rest of the cluster, but the rest of the cluster is healthy, then this could well be an orchestrator bug.

@DaveCTurner
Copy link
Contributor

I presume if we start routing requests to nodes that haven't joined the cluster yet the result is going to be failures (depending on the nature of the request)

Yes, but those failures will not be substantially different from the failures you get sending requests to nodes that joined the cluster and then fell out of it again, and yet we consider this latter group of nodes to be ready.

@mark-vieira
Copy link
Contributor

and yet we consider this latter group of nodes to be ready

IMO, we shouldn't. Readiness checks continue even after the container identifies as ready. If at some point the container loses connection with the cluster the readiness check will then fail and the orchestrator will stop directing traffic to that node.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Feb 14, 2022

With that model, what would you expect to happen if the cluster loses quorum? My understanding is that all nodes would fail their readiness checks and be removed from routing, leaving no nodes to which to direct any requests. Bear in mind that this is a failure mode that I see more often than a single node being isolated from the cluster.

I would rather have Elasticsearch be able to report failures to clients in this case, rather than sending them a generic "there are no ready nodes" error at the orchestration layer.

@mark-vieira
Copy link
Contributor

mark-vieira commented Feb 14, 2022

I would rather have Elasticsearch be able to report failures to clients in this case, rather than sending them a generic "there are no ready nodes" error at the orchestration layer.

Yeah, we may be struggling with a lack of flexibility here and maybe we are getting too lost in the weeds of how to define "ready". The reality is we can tweak the definition over time in a way that doesn't impact how it integrates with the orchestrator. As you say, in practice it might make more sense to adopt a more loose definition of "ready" as reporting something to the user is generally better than nothing.

I'll just reiterate a few things:

  1. We want "ready" to mean more than just "I can open a socket". Or at the very least implement the endpoint in such a way as the definition of "ready" can evolve over time.
  2. This check doesn't stop once the node identifies itself as ready. At any point a node could "die" or any number of things can happen after startup that would cause the ready check to subsequently fail at which point the orchestrator would take the appropriate action (stop routing requests to it).
  3. Readiness checks are at the node level, which as David brings up makes it difficult to use them to infer cluster health. It's possible we might need to supplement node readiness endpoints with something else to get a state of the cluster in the case where nodes can't join the cluster, or a quorum cannot be established. Essentially, can we avoid both routing requests to non-ready nodes and finding ourselves in a scenario in which the cluster responds to zero requests due to an inability to form a cluster?

@DaveCTurner
Copy link
Contributor

I'll just reiterate a few things:

💯 from me.

@sebgl
Copy link

sebgl commented Mar 8, 2022

Double-checking my understanding of the PR (just from a very quick look): for this to work as a TCP readiness check in K8s, we want the kubelet to connect to the TCP port and just check whether it can connect or not.
Either that connection succeeds, either it fails.

Do I get correctly that we would instead invoke a Java Client (new JVM process) as an exec readiness probe that checks the string response status? This feels quite heavy-handed (note I may be misjudging the overhead of invoking a Java process)? As opposed to an http probe / tcp probe / exec command that checks the presence of a file.

@grcevski
Copy link
Contributor

grcevski commented Mar 8, 2022

Hi @sebgl, as the PR is implemented right now, we'd need to use the Java command line client. It takes about 300-400ms on my Intel based Mac to verify the readiness status. If this is too slow, I think I can try an approach where keep the TCP port open/closed instead to signify the readiness status, able to connect would mean ES is ready, port down not.

You are correct, the overhead of invoking the Java process is not negligible, it would definitely consume CPU cycles if it's invoked too often.

@mark-vieira
Copy link
Contributor

I'm continue to be in favor of a CLI-based solution here even if it is a simple wrapper around what is effectively a TCP readiness check because it allows us to evolve the conditions on what determine "readiness" over time w/o affecting the orchestrator.

I don't think latency is a concern, but as stated, launching a JVM is non-zero overhead so we certainly don't want the polling interval here to be too aggressive. @ChrisHegarty might have some ideas to minimize this overhead.

@sebgl
Copy link

sebgl commented Mar 8, 2022

It takes about 300-400ms on my Intel based Mac to verify the readiness status

Fwiw we're currently running the readiness probe every 5sec: https://github.com/elastic/cloud-on-k8s/blob/0264584373d3d2d7d5f80fbdd6fdf5fb853150e1/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L17-L30 (we can definitely tweak these settings)

it would definitely consume CPU cycles if it's invoked too often

Considering users may run small ES instances on small CPUs (e.g. "0.5 cpu" whatever that means), I suspect this could mean looking at the CPU graphs for the Pod we may clearly see the impact of the readiness probe running every 5 sec?

this is too slow, I think I can try an approach

It's hard to conclude anything without real numbers in a real environment, but I have an intuition the overhead here for what's supposed to be something users shouldn't even notice when looking at the container cpu graphs might be a bit too much.

@sebgl
Copy link

sebgl commented Mar 8, 2022

a simple wrapper around what is effectively a TCP readiness check because it allows us to evolve the conditions on what determine "readiness" over time w/o affecting the orchestrator

the orchestrator needs something that either succeeds or fail, there's no condition here. Either it's ready, either it's not. If we were to expose a TCP probe that succeeds or fails, a file that exists or doesn't, an http endpoint that returns ok: true or not, the details of how that boolean value is computed would remain in Elasticsearch and not impact the orchestrator.

@mark-vieira
Copy link
Contributor

That's fair, but it doesn't solve the authentication issue which I think is the other primary motivation here. My point is the "wrapper" could be backed by any of the things you describe w/o affecting the orchestrator. There are also other benefits to having such a CLI tool, as it makes it easier to integrate into other systems/tools.

I do share your concerns on the cost of running a Java console app every 5 seconds though. This is likely going to incur a non-trivial cost. We could poll less often but I'm not sure what the convention is here. Certainly we don't want to do it too infrequently or we defeat the purpose and end up sending requests to dead nodes for up to whatever that interval is.

@mark-vieira
Copy link
Contributor

To add I think we ruled out the naive "file exists" check because we can't easily guarantee that we clean it up between restarts, or in the case of an unexpected crash, so we would likely end up with false positive readiness readings.

@grcevski
Copy link
Contributor

grcevski commented Mar 8, 2022

I have some more data and I personally think that we should follow the TCP approach that @sebgl suggested.

The readiness-probe-cli takes 300-400ms, but with about 1.5 CPU used. We can reduce this overhead by limiting the JIT to Tier1 compilation only, put class sharing on, but in my experience we'll be still in the same ballpark, perhaps reduce the CPU to 1.25.

On a small container with 1CPU or less, using the probe every 5 seconds will create a significant overhead, e.g 10% or more.

Given that we expect that we won't be changing ready/not-ready state that often on the Elasticsearch side, I think it's reasonable to accept the overhead of bringing up and dropping the listener thread, and using the port open as a signal of ES being ready or not ready.

@mark-vieira
Copy link
Contributor

Given that we expect that we won't be changing ready/not-ready state that often on the Elasticsearch side, I think it's reasonable to accept the overhead of bringing up and dropping the listener thread, and using the port open as a signal of ES being ready or not ready.

Can you clarify here. So we'd be going to just using the fact that we can connect on that port as the indicator that ES is ready? And on the ES side of things we'd only open the port when ES is indeed ready?

If we are going through the work of opening up a completely separate port for this purpose could we not make it HTTP as well? That way our response could be more than just binary? Or is is providing an unauthenticaed HTTP endpoint troublesome given our current architecture?

FYI, I still think it's worth providing a CLI tool even if it's not used by the k8s orchestrator so long as it's trivial to implement. It helps in other use cases.

@rjernst
Copy link
Member

rjernst commented Mar 8, 2022

I still think it's worth providing a CLI tool even if it's not used by the k8s orchestrator so long as it's trivial to implement

It's trivial in that it isn't complicated code, but there is a lot of boilerplate for something that we have no concrete uses for. I would be more inclined to drop the cli if we are going to use tcp. There are other ways to check if a tcp connection can be made locally, like netcat.

@ChrisHegarty
Copy link
Contributor

Dropping a few notes to this already lengthy discussion.

I assume that whatever solution we arrive at should interoperate with the native k8s readiness probe, without any customisation. This imposes some constraints on any possible ES solution - which are not unreasonable.

  • TCP - From what I can glean form the k8s code, to determine readiness the TCP probe simply opens and connects a new socket to a specified network address - no data is exchanged. With this in mind, from the ES point of view, ES can create and retain a strong reference to a listening TCP socket - when ES determines that the server is "ready" ( for some definition of ready ). When ES determines it is no longer ready, the TCP listening socket is closed. Note: no explicit action on this listening socket is required from ES - the Linux kernel will implicitly eagerly accept new connections, which should later be dropped from the kernels accept queue once the client closes. Additionally, the socket port number needs to be configured to be reusable, since ES could be creating/binding and closing a listening socket on that port a number of times.

  • Exec - Starting up a JVM for such a trivial task is quite an overhead. There are ways to somewhat reduce this overhead, but it is unlikely that it can be reduced to an acceptable level. Just write a go client! Whatever binary protocol is used, if any, is irrelevant to the k8s probe client.

  • HTTP - Opening an HTTP endpoint over a plain (non-TLS) connection is likely not acceptable. Processing the TLS handshake and even a simple HTTP request will require "some" resources from ES. From my reading of the k8s code, it is not clear if the HTTP probe client will reuse the underlying network connection - which would mitigate the overhead of the TLS handshake per probe request.

Given the above, I don't see how any existing open port or service that ES currently offers can really satisfy the readiness behaviour - as we would want it from the perspective of k8s.

@grcevski
Copy link
Contributor

grcevski commented Mar 9, 2022

So to clarify my proposal here. I'm proposing changing my PR in the following way:

  • Instead of a permanent TCP socket listener which responds with true or false, depending on whether ES is ready, I intend to keep the listener up while ready and closed when not.
  • Instead of a connect overhead on the client side, we'll have an overhead to bring up the listener every time the state changes from not-ready to ready. Given that we don't expect the ready/not-ready state to change often it's a more acceptable tradeoff.
  • Since the listener is off when not ready, instead of being on and saying not-ready, the TCP probe will work really fast. We'll be able to tell if ES is ready in terms of few milliseconds instead of 100s of milliseconds, no CPU usage either.
  • For now I'll simply end the socket channel as soon as it's connected, but nothing prevents us from extending the TCP response to send something back in the future, if we wanted to build on top of this beyond the k8s TCP probe.

@mark-vieira
Copy link
Contributor

I'm wondering if this is something that should be explicitly enabled/configured as well. I'm just anticipating naive security scanners (all of them) to see an additional open port listening w/o TLS to be flagged as an issue, regardless of it's purpose or that fact that it exposes no sensitive data. We can pass the requisite config by default in our Docker image so that it's always enabled there (or just go based on es.distribution.type) and then folks can choose to expose that container port only if they wish.

@grcevski
Copy link
Contributor

grcevski commented Mar 9, 2022

I'm wondering if this is something that should be explicitly enabled/configured as well. I'm just anticipating naive security scanners (all of them) to see an additional open port listening w/o TLS to be flagged as an issue, regardless of it's purpose or that fact that it exposes no sensitive data. We can pass the requisite config by default in our Docker image so that it's always enabled there (or just go based on es.distribution.type) and then folks can choose to expose that container port only if they wish.

Currently we listen only on localhost, so this port will never show as open on external scanners, but it will be seen by scanners installed on the same host. It's easy to add a flag to enable the feature, my concern would be that it will add usage complexity. When the port isn't open we can't tell if Elasticsearch is down or the feature wasn't enabled.

I updated the PR to respond to the TCP based probe, performance is very good as expected (~10ms per call, a lot less CPU):

% time nc localhost 9400
nc localhost 9400  0.00s user 0.01s system 61% cpu 0.011 total

@mark-vieira
Copy link
Contributor

Currently we listen only on localhost, so this port will never show as open on external scanners, but it will be seen by scanners installed on the same host.

I have a suspicion that's commonly the case. At very least we should expect this to come up and we certainly would have to have a way to explicitly disable as many deployments will see a non-secure port as an absolute no-go.

It's easy to add a flag to enable the feature, my concern would be that it will add usage complexity.

I think we can have this enabled by default for Docker, which is the primary use-case here. We don't have the same problem of exposing a port there as we could just omit it from the EXPOSES directive in our Dockerfile. "Using" this means configuring some kind of readiness probe anyway so I don't think that's much of a concern.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Mar 9, 2022

I'm wondering if this is something that should be explicitly enabled/configured as well. I'm just anticipating naive security scanners (all of them) to see an additional open port listening w/o TLS to be flagged as an issue, regardless of it's purpose or that fact that it exposes no sensitive data.

... particularly since port 9400 happens to be the port used by a piece of malware known as Incommand. Sure it's ancient and Windows-only but these security scanners can be incredibly dumb.

Instead of a permanent TCP socket listener which responds with true or false, depending on whether ES is ready, I intend to keep the listener up while ready and closed when not.

It looks like by default we plan to handle a bind failure (e.g. because another node got there first) by scanning for the next available socket. I don't think that makes sense in this context - it means that clients have to go and read the ports file to work out what port to for each node, but that means there's a race condition between checking the ports file and the node restarting. Can we just use a single port and consider it to be fatal if we can't bind to it? We agreed that this behaviour is trappy in #54354 although it's a big deal to remove it where it exists already it'd be great not to add a new case to fix.

Also it's possible that something else is already using this port on the machine on which Elasticsearch is running. If so, requiring this additional port to be free by default would prevent Elasticsearch from starting after the upgrade.

I therefore also think this should be disabled by default (outside of a container at least).

@mark-vieira
Copy link
Contributor

Can we just use a single port and consider it to be fatal if we can't bind to it?

Yes, especially since the kubernetes probe cannot be that dynamic, it's going to expect this thing to be exposed by a static port. There's also basically no chance that port will be taken by something else in a containerized scenario so there's no point in the extra complexity given that's the primary use case.

@grcevski
Copy link
Contributor

I've updated the PR to manually enable and configure the port. I've also tested with the k8s tcp probe, works as expected. The go language overhead is tiny, 9ms total to run and for us to respond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet