-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readiness Mechanism #175
Comments
Yes, and it's not precisely the most orthodox procedure.
Using tags as a liveness/readiness check doesn't seem like a widely adopted practice, and could not be especially idiomatic in some cloud vendors, even if all of them support metadata. |
If logs are the "official" liveness report method for CML and it outputs machine-readable logs in JSON as part of iterative/cml#22, using containers as per #146 and parsing container logs could be fine. |
@DavidGOrtega, please keep in mind that we already use SSH outside of Terraform (DVC) and tags require cloud–specific code on this repository for each vendor. SSH seems like a good polling mechanism until we have a clearer view of the machine API. |
yeah we can do SSH but I would prefer tagging because we do not have to rely on SSH. additionally its one of the ways that many vendors actually uses. I.E. aws marks the spot instance termination, unfortunately this is not the case for Azure. The mechanism would be having an internal API call that creates the tag upon script termination. While the TPI waits for that flag to be ready. The other solution is as you say wait for a flag via SSH but I find this weaker. Lets discuss this ASAP. We could create a fast meeting. |
I would prefer keeping SSH (more generically, running commands) until we have a better alternative, mainly because:
By readiness, I mean that the user–specified script starts correctly and does what it's intended to do; by liveness that it's operational over time (e.g. generating a heartbeat); by termination that it has stopped (gracefully or not); and by untethering that it wishes to report the deployment as complete and will self–destroy without external aid (e.g. what It looks like @pmrowla wants to subscribe to the termination event and, no matter where and how this is implemented, somebody has to wrap the user script and do something like this: Server sidebash user_script.sh
touch /tmp/done # or aws ··· "ready" Client side
|
tags
ssh
ssh-with-user_script
|
RecapitulationWhen to consider
|
Are you implying that AWS tags spot instances that are being terminated with “real” resource tags? I wasn't aware of that feature and haven't found it in the documentation, but it's definitely interesting. 🤔 |
This works fine from the DVC standpoint, and I would not expect For the DVC use case, it would be better if the the script wrapping was done in TPI (and was always written to some standard location by TPI) instead of relying on the user to put it in their own Basically, we have a default |
Terraform is a declarative way of creating resources where resource is an abstract concept. Anything can be a resource. A full web app can be a resource in terraform as the terraform tutorial teaches you. When creating a provider your are hiding all the internals to the user. We are not giving a packer config but a a provider with resources which implementation is hidden to the users. The benefit and value of giving them within a resource is for me invaluable and pretty clear. TPI could offer jupyter on demand, or hopefully with MLEM serving models that can scale up with Consul. |
Also you are assuming that the TPI in the DVC use case has to be waiting for the output and thats not the case. THIS is something that we are fixing properly with the executor. The executor takes a piece of script and is able to execute it exposing the logs that could be gathered via resourceRead at any point of the lifecycle. |
I consider this to be perfect and desirable. We launch the machine and provision it with the runner and we check the runner to be OK, if that do not happens the machine gets destroyed. And this is a must because the machine is not able to be destroyed itself and a machine without a valid runner is not less than a waste. |
I will only say one word. Executor. |
TPI.py needs (2), (3), and (4) depending on the user requirements. (2) is like TPI.go only needs to implement (2) I think - the rest is "easy" to implement in TPI.py over ssh, right? TPI.py can define I definitely think (4) is beyond the scope of TPI.go |
I have been unable to find that tutorial, but such a tutorial seems to have an interesting take on resources. While resources can be as abstract (?) and all–encompassing as we want to define them, we should also consider that Terraform power relies on small, indivisible resources (usually 1:1 to cloud resources) that can be mapped into a dependency graph and managed individually. Of course, we can do this on our own and offer the same level of resiliency and optimization, but we would be (re)writing a considerable part of Terraform Core.
Perhaps “hidden” isn't the most appropriate word and “resource” isn't the most adequate abstraction, but I definitely see the value of simplifying infrastructure so users don't have to define everything by themselves. See also Terraform Modules for a widely adopted way of doing something pretty similar.
While this deviates considerably of the initial issue topic, it's definitely worth another thread. It would be awesome to use this provider (or whatever it becomes in the mid term) for all our cloud computing needs, both on training and serving. Assuming you mean docker–machine (?), containers versus virtual machines still is open to debate, especially after we chose to support Kubernetes.
If we used metadata to wait for completion (the original DVC requirement), the provider would indeed have to wait until the script finishes, lest having to implement vendor–specific metadata support in DVC or PyTPI. If we choose SSH as a generic method, that's not the case, as PyTPI can be in charge of polling the resource without having to implement any vendor–specific code nor forcing the provider to wiat for 4 (completion) as per the definitions above.
Not sure if
Isn't this what container orchestrators do?
Agreed, but then we're resorting to stateless provisioning and Terraform wouldn't be of any use. 🤔 |
I do not understand what you mean |
Is not cml_runner resource a perfect example of that definition? Can I not launch X runners with a Terraform file in any vendor? |
Terraform was designed to manage infrastructure in a declarative way. Users specify the final state they want to achieve, and the Terraform Core engine takes care of performing and retrying lots of requests in the least possible amount of time until it either achieves the desired state, times out, or encounters a non–retriable failure. If we create a single “black box” resource that takes care internally of all the API calls, we can't take advantage of Terraform to manage drift and creation errors. If our provider gets interrupted somewhere in–between the network creation and the instance creation, we have no way of knowing what has been created and what has failed, and no way of fixing it. Moreover, we aren't taking advantage of the Terraform state file either, and treating it as something ephemeral that can be thrown away after the destroy process, whose success we're taking for granted. Terraform was designed to manage static infrastructure, and wasn't designed to tolerate interruptions during the
Yes, if it were just a resource on the GitHub Terraform provider and its single responsibilities were registering a runner on a repository/organization and returning the runner token as a computed field.
Here is the discrepancy: now, we're performing a dozen of API calls from a single resource to another different service (a user–chosen public cloud), and, if something goes awry, we have to handle it manually with lots of code, because Terraform can't do anything about it nor know where the failure was on a resource or another, because we're packing everything together. |
Posted the previous message just to avoid losing it. I'm putting this conversation on hold because there are so many |
Closing in favor of #315 because:
|
As stated in #315 thats not true. The main reason has been always be able to log the process of registering the runner. Aside that this feature is created to be able to be sure that we control the finalisation of the startup script, its indeed a feature already requested in clouds like azure. |
After #460, it looks like the current readiness check is effective enough and able to handle an unexpected failure of the script. This feature requires polling anyway, so the current SSH approach is good enough. |
Right now we have a very limited mechanism to determine the readiness of a machine: We parse
journalctl
via ssh.Cons
Proposal
We could add in the final stage of the startup script a tagging process so the TPI will wait for that tag instead of checking the vendor readiness. So instead of having
we would wait and check the existence of such tag
The text was updated successfully, but these errors were encountered: