-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce an unauthenticated endpoint for readiness checks #81168
Comments
I opened #81177 to capture the request to remove |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Even though this only proposes to return a status code and no further details about the cluster state, I think some admins would feel uncomfortable having an unauthenticated API exposed by default. I think my preference would be either:
I believe the ECK team would opt for (1), which seems fine to me. The 3rd option would be to try and roll this into anonymous access. I think that's do-able, but it has some tricky bits.
I don't think it's worth trying to make this part of anonymous. I think we want to say "anonymous is for admins to configure" and "readiness probes are for operators" and keep them separate. I proposed this in a conversation elsewhere, but copying it here in order to keep things in 1 place... There are 2 related but distinct parts of the problem that need to be solved:
The latter issue matters because just trying to avoid authentication is going to end up backfiring - if you have no user then nothing will work. It's impossible to call an action if there's no user, and I'd be very strongly against relaxing that constraint. So, what could do instead is automatically authenticate the request as an internal user if it matches the right endpoint. I would approach that as a new Authenticator, at the top of the chain that only looks at the URL and, if it matches, authenticates as a new internal user (like AsyncSearchUser) I propose the top of the chain so that the actions always run as this internal user and we never process any credentials that might have been provided in the request. That makes the URL as close to "unauthenticated" as we can get, and it doesn't fail (on security grounds) even if you happen to provide credentials authenticate as a user with no privileges. |
I think I would prefer a 200 OK status as long as a status can be derived. Instead we can include a This also seems more correct, in that the node could happily figure out whether it is ready to accept requests or not, which is the question asked when using the API. Can you elaborate on the reasons for using a 503 status response to signal "not ready" from a healthy node? |
This from the docs is relevant here:
Also, from a more philosophical point of view, the purpose of a readiness check is to determine if the receiver should be used to route requests to. Therefore any error code from anywhere in the chain the request is sent is an indication that the service is not ready to serve requests. I don't think the caller needs to know where the error code came from, just that it should not route requests to that node. Debugging what went wrong is not for the caller to do and can be satisfied by observability and other debugging (such as testing the request at different points in the chain) |
This issue specifically mentions "readiness probes". There are clearly other kinds of probes, e.g. liveness probes, startup probes. Is there any interest in these probes? Can they be served from the same endpoint? |
From what I have read (though I suspect others will be better authorities on this). Startup probes are often configured to call the same endpoint as the readiness probe but with a longer timeout before calling the node unhealthy. I'm not sure but given there is more than likely full control of the container in these kinds of deployment I wonder if liveness is satisfied by probing whether the process is running, particularly as Elasticsearch nodes take the approach of exiting if they encounter errors which are unrecoverable and therefore not require an HTTP API |
Note also that today we don't even start listening on the HTTP port until we've applied a cluster state from an elected master, so simply checking whether the HTTP port is open is also a reasonable startup check. The wait is currently bounded with a default timeout of 30s which starts counting from fairly late on in the startup process anyway; arguably we could just wait indefinitely here, I don't think there's much you can do over HTTP before joining the cluster. Alternatively we could consider a 30s wait to be long enough that we've "started up" even if we're still not really "ready". |
A few of us discussed this today and one option we considered was opening a distinct HTTP port specifically for this API. AIUI Kubernetes hits the readiness endpoint in a way that means the port wouldn't need to be exposed outside the container, so a distinct port would be a neat way to avoid inappropriate access from clients without needing authentication. It also means that we wouldn't need to worry about authentication or TLS - in particular if Elasticsearch is configured to require TLS client certificates then authentication happens at handshake time which makes exposing an unauthenticated endpoint rather tricky. It's still technically possible, TLS has the facility to redo its handshakes after receiving the HTTP request, but it's not pretty and I worry that it'd be a pretty substantial change to support this flow. We were not convinced this is the right approach but thought it worth mentioning as a possible option. |
Yeah, if that is the k8s way, I see no issue in using 503 here. |
I'm not sure I understand. An HTTP scanner would have to be taught about this endpoint. But since this endpoint effectively does nothing (no request to parse, empty response, etc.), there isn't a reason to teach an HTTP scanner about this endpoint, or if it is taught about this endpoint, that it's a non-issue?
I'd rather we go the other way. The simplest solution is that this endpoint exists and is unauthenticated, and that's it. If we have to, we can layer on additional options (e.g., to require authn/authz for this endpoint), but from my perspective the default option should be the endpoint exists, and is unauthenticated. |
Thanks, I considered this option as well and ultimately I convinced myself it wasn't worth the complexity of supporting binding on another port, serving some endpoints from that port and not the default, etc. |
We would get security reports that Elasticsearch includes an unauthenticated URL that discloses information. All of which takes time from people who have far more useful things to do. I think secure by default should mean no authenticated APIs by default, and they need to be opt-in. Otherwise we end up in a totally hypothetical position where one of our dependency has a vulnerability, and we'd be protected if all our APIs required authentication, but they don't because we opened up 1 unauthenticated API to support an orchestrator, and need to explain to users that they have an unauthenticated access path to their cluster that they don't need, but we enabled automatically. |
I was reading the k8s probe docs, and I noticed that while HTTP probes don't allow authentication credentials to be specified, you can set arbitrary HTTP headers. I don't know how helpful it would be, but we could do something like: httpHeaders:
- name: Authorization
value: ${SOME SECRET VALUE} |
Though this is possible to add headers, an issue here is operational complexity. To use an authenticated endpoint a secret needs to be maintained and distributed for each deployment. This adds significant complexity in Cloud. Also bear in mind that it's not just the ECK operator that needs to access this endpoint but also observability tools such as heartbeat (which may or may not be running near the deployment itself). Having to maintain credentials and distribute them to heartbeat instances is certainly possible but it adds quite a bit of operational complexity for an endpoint that does not emit any information apart from the fact that its ready to receive requests |
Would it work if we added the readiness endpoint, by default it would be authenticated just like any other API is, and we add a boot setting option to disable authentication for this endpoint, if the customer wants it, e.g Cloud or anyone else running Kubernetes on their own, or other platform where they would benefit from having this endpoint unauthenticated? |
My comment was strictly about scans, in response to your statement that's where they'd come up. Let's take this entire thread to a high-bandwidth discussion and come back to this thread to communicate decisions. I'll reach out to set that up. |
Pinging @elastic/es-distributed (Team:Distributed) |
A few weeks ago we had the high-bandwidth discussion that Jason mentioned and concluded:
|
Wouldn't it make sense to use a unix domain socket for this? |
Now that ES is compiled with --release 17 we can make use Java 17 APIs, so can use Unix-Domain Socket Channels from Java itself, which should simplify things a lot (and also since the wire protocol does not need to be HTTP) |
We did mention that idea but we didn't want to make this feature Unix-only, nor really to have to maintain a separate implementation for Windows. However a Unix domain socket does have advantages such as even tighter access controls without needing authentication in the wire protocol. Nobody mentioned in the meeting that this is much better-supported than I thought (TIL) which pretty much invalidates the arguments against it. We'll be discussing this topic again soon to work out the division of labour, but let's consider this question again and I am now +1 on using a Unix domain socket instead. |
If the main use-case here is container health then we don't really have to worry about Windows. Sure, a "generic" readiness API might be helpful in other contexts but I wouldn't necessarily rule it out here if it's the most appropriate solution for the use case at hand, which is container orchestration. For at least the foreseeable future, in that world, the underlying system in Linux. |
Readiness checks are per-node. The notion of a healthy "cluster" is orthogonal to this. The point of the endpoint should not be to determine cluster health but whether or not an individual node is capable of responding to user requests. That response might be "the cluster is borked". If a node cannot join the cluster, to me that means it's not "ready". |
This is kind of my point tho: "has joined a cluster" is a health property. Nodes can respond to requests as soon as the HTTP port is open whether they are in a cluster or not. Nodes cannot independently distinguish between "the entire cluster is borked" and "I cannot join an otherwise perfectly-healthy cluster", and it is much more common that the problem is "the entire cluster is borked". In that situation if we require nodes to join a cluster before they claim to be ready we would have zero ready nodes and therefore no way for clients to get information out of the cluster about its health status. It's extremely unusual, and definitely a bug, for a single node to fail to join an otherwise-healthy cluster within the |
How can a node that cannot join the cluster report on the cluster health status though? It can only really tell you it's own status.
Fair enough, but the question is whether we should consider a node "ready" until it has done so. I presume if we start routing requests to nodes that haven't joined the cluster yet the result is going to be failures (depending on the nature of the request). As mentioned above, what useful information can we get diagnostically from nodes that can't talk to the cluster? |
With no cluster there's no cluster health to report of course, but we can still determine a bunch of information in this situation: is it unable to discover other nodes? If there are other nodes out there, what are their statuses? Did we lose ≥½ of the masters? Is one node winning elections and then failing? Is there a stable master rejecting its join requests? The only non-bug in that list is "we lost ≥½ of the masters" but this is something that the orchestrator can't accurately determine itself. It needs insight into the state of the nodes that are trying (and failing) to form the cluster. In particular, if a node simply can't talk to the rest of the cluster, but the rest of the cluster is healthy, then this could well be an orchestrator bug. |
Yes, but those failures will not be substantially different from the failures you get sending requests to nodes that joined the cluster and then fell out of it again, and yet we consider this latter group of nodes to be |
IMO, we shouldn't. Readiness checks continue even after the container identifies as ready. If at some point the container loses connection with the cluster the readiness check will then fail and the orchestrator will stop directing traffic to that node. |
With that model, what would you expect to happen if the cluster loses quorum? My understanding is that all nodes would fail their readiness checks and be removed from routing, leaving no nodes to which to direct any requests. Bear in mind that this is a failure mode that I see more often than a single node being isolated from the cluster. I would rather have Elasticsearch be able to report failures to clients in this case, rather than sending them a generic "there are no ready nodes" error at the orchestration layer. |
Yeah, we may be struggling with a lack of flexibility here and maybe we are getting too lost in the weeds of how to define "ready". The reality is we can tweak the definition over time in a way that doesn't impact how it integrates with the orchestrator. As you say, in practice it might make more sense to adopt a more loose definition of "ready" as reporting something to the user is generally better than nothing. I'll just reiterate a few things:
|
💯 from me. |
Double-checking my understanding of the PR (just from a very quick look): for this to work as a TCP readiness check in K8s, we want the kubelet to connect to the TCP port and just check whether it can connect or not. Do I get correctly that we would instead invoke a Java Client (new JVM process) as an |
Hi @sebgl, as the PR is implemented right now, we'd need to use the Java command line client. It takes about 300-400ms on my Intel based Mac to verify the readiness status. If this is too slow, I think I can try an approach where keep the TCP port open/closed instead to signify the readiness status, able to connect would mean ES is ready, port down not. You are correct, the overhead of invoking the Java process is not negligible, it would definitely consume CPU cycles if it's invoked too often. |
I'm continue to be in favor of a CLI-based solution here even if it is a simple wrapper around what is effectively a TCP readiness check because it allows us to evolve the conditions on what determine "readiness" over time w/o affecting the orchestrator. I don't think latency is a concern, but as stated, launching a JVM is non-zero overhead so we certainly don't want the polling interval here to be too aggressive. @ChrisHegarty might have some ideas to minimize this overhead. |
Fwiw we're currently running the readiness probe every 5sec: https://github.com/elastic/cloud-on-k8s/blob/0264584373d3d2d7d5f80fbdd6fdf5fb853150e1/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L17-L30 (we can definitely tweak these settings)
Considering users may run small ES instances on small CPUs (e.g. "0.5 cpu" whatever that means), I suspect this could mean looking at the CPU graphs for the Pod we may clearly see the impact of the readiness probe running every 5 sec?
It's hard to conclude anything without real numbers in a real environment, but I have an intuition the overhead here for what's supposed to be something users shouldn't even notice when looking at the container cpu graphs might be a bit too much. |
the orchestrator needs something that either succeeds or fail, there's no condition here. Either it's ready, either it's not. If we were to expose a TCP probe that succeeds or fails, a file that exists or doesn't, an http endpoint that returns |
That's fair, but it doesn't solve the authentication issue which I think is the other primary motivation here. My point is the "wrapper" could be backed by any of the things you describe w/o affecting the orchestrator. There are also other benefits to having such a CLI tool, as it makes it easier to integrate into other systems/tools. I do share your concerns on the cost of running a Java console app every 5 seconds though. This is likely going to incur a non-trivial cost. We could poll less often but I'm not sure what the convention is here. Certainly we don't want to do it too infrequently or we defeat the purpose and end up sending requests to dead nodes for up to whatever that interval is. |
To add I think we ruled out the naive "file exists" check because we can't easily guarantee that we clean it up between restarts, or in the case of an unexpected crash, so we would likely end up with false positive readiness readings. |
I have some more data and I personally think that we should follow the TCP approach that @sebgl suggested. The readiness-probe-cli takes 300-400ms, but with about 1.5 CPU used. We can reduce this overhead by limiting the JIT to Tier1 compilation only, put class sharing on, but in my experience we'll be still in the same ballpark, perhaps reduce the CPU to 1.25. On a small container with 1CPU or less, using the probe every 5 seconds will create a significant overhead, e.g 10% or more. Given that we expect that we won't be changing ready/not-ready state that often on the Elasticsearch side, I think it's reasonable to accept the overhead of bringing up and dropping the listener thread, and using the port open as a signal of ES being ready or not ready. |
Can you clarify here. So we'd be going to just using the fact that we can connect on that port as the indicator that ES is ready? And on the ES side of things we'd only open the port when ES is indeed ready? If we are going through the work of opening up a completely separate port for this purpose could we not make it HTTP as well? That way our response could be more than just binary? Or is is providing an unauthenticaed HTTP endpoint troublesome given our current architecture? FYI, I still think it's worth providing a CLI tool even if it's not used by the k8s orchestrator so long as it's trivial to implement. It helps in other use cases. |
It's trivial in that it isn't complicated code, but there is a lot of boilerplate for something that we have no concrete uses for. I would be more inclined to drop the cli if we are going to use tcp. There are other ways to check if a tcp connection can be made locally, like netcat. |
Dropping a few notes to this already lengthy discussion. I assume that whatever solution we arrive at should interoperate with the native k8s readiness probe, without any customisation. This imposes some constraints on any possible ES solution - which are not unreasonable.
Given the above, I don't see how any existing open port or service that ES currently offers can really satisfy the readiness behaviour - as we would want it from the perspective of k8s. |
So to clarify my proposal here. I'm proposing changing my PR in the following way:
|
I'm wondering if this is something that should be explicitly enabled/configured as well. I'm just anticipating naive security scanners (all of them) to see an additional open port listening w/o TLS to be flagged as an issue, regardless of it's purpose or that fact that it exposes no sensitive data. We can pass the requisite config by default in our Docker image so that it's always enabled there (or just go based on |
Currently we listen only on localhost, so this port will never show as open on external scanners, but it will be seen by scanners installed on the same host. It's easy to add a flag to enable the feature, my concern would be that it will add usage complexity. When the port isn't open we can't tell if Elasticsearch is down or the feature wasn't enabled. I updated the PR to respond to the TCP based probe, performance is very good as expected (~10ms per call, a lot less CPU):
|
I have a suspicion that's commonly the case. At very least we should expect this to come up and we certainly would have to have a way to explicitly disable as many deployments will see a non-secure port as an absolute no-go.
I think we can have this enabled by default for Docker, which is the primary use-case here. We don't have the same problem of exposing a port there as we could just omit it from the |
... particularly since port 9400 happens to be the port used by a piece of malware known as Incommand. Sure it's ancient and Windows-only but these security scanners can be incredibly dumb.
It looks like by default we plan to handle a bind failure (e.g. because another node got there first) by scanning for the next available socket. I don't think that makes sense in this context - it means that clients have to go and read the ports file to work out what port to for each node, but that means there's a race condition between checking the ports file and the node restarting. Can we just use a single port and consider it to be fatal if we can't bind to it? We agreed that this behaviour is trappy in #54354 although it's a big deal to remove it where it exists already it'd be great not to add a new case to fix. Also it's possible that something else is already using this port on the machine on which Elasticsearch is running. If so, requiring this additional port to be free by default would prevent Elasticsearch from starting after the upgrade. I therefore also think this should be disabled by default (outside of a container at least). |
Yes, especially since the kubernetes probe cannot be that dynamic, it's going to expect this thing to be exposed by a static port. There's also basically no chance that port will be taken by something else in a containerized scenario so there's no point in the extra complexity given that's the primary use case. |
I've updated the PR to manually enable and configure the port. I've also tested with the k8s tcp probe, works as expected. The go language overhead is tiny, 9ms total to run and for us to respond. |
In the context of Kubernetes, readiness probes are used to determine whether or not to route traffic to a pod. In the context of Elasticsearch, this is useful since Elasticsearch takes some time after startup before it is ready to receive requests, and when it is shutting down, it would be better to route those requests elsewhere.
Today there is not a dedicated endpoint for this, which means that operators have to simulate it by relying on an endpoint like the main endpoint (
/
). This is less-than-ideal since the behavior is not dedicated, and also because this endpoint requires authentication. Authentication presents a challenge since it requires that credentials be available to the probe that executes the readiness check. Making those credentials available means that an operator such as ECK can not rely on native HTTP probes in Kubernetes. Finally, we would want a solution to this in 7.x since we like to preserve the option of simplifying our approach to managing 7.x deployments via Kubernetes.Ideally, there would be an endpoint that:
As far as the behavior of this endpoint, having discussed this with @dakrone, we have settled on the following proposal:
Note that it would be fine if response bodies to these requests were empty (JSON:
{}
), so that this endpoint is not otherwise revealing information.(@dakrone, please feel free if I got any of the details of our conversation wrong. @tvernum feel free to comment on the unauthenticated aspects, which we discussed previously.)
Lastly, this would be a precursor to removing
curl
from the Docker image, which I've previously discussed with @pugnascotia, but I will open a separate issue to capture our discussion.Relates #50187
The text was updated successfully, but these errors were encountered: