Readiness Mechanism #175

DavidGOrtega · 2021-08-09T11:01:54Z

Right now we have a very limited mechanism to determine the readiness of a machine: We parse journalctl via ssh.

Cons

In the machine resource the startup scripts are processed after the machine is already marked as ready by the vendor. So in cases where we need our stack to be ready (created via startup script) we need to add a ssh-wait-mechanism to determine if its really ready. This behaviour is inherited from the original vendors where get the ready state before executing the startup script.
For every resource we need to maintain extra code. An in machine resource is not feasible.

Proposal

We could add in the final stage of the startup script a tagging process so the TPI will wait for that tag instead of checking the vendor readiness. So instead of having

statusInput := ec2.DescribeInstancesInput{
		InstanceIds: []*string{aws.String(instanceID)},
		Filters: []*ec2.Filter{
			{
				Name:   aws.String("instance-state-name"),
				Values: []*string{aws.String("running")},
			},
		},
	}

we would wait and check the existence of such tag

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2021-08-25T00:18:33Z

Right now we have a very limited mechanism to determine the readiness of a machine: We parse journalctl via ssh.

Yes, and it's not precisely the most orthodox procedure.

We could add in the final stage of the startup script a tagging process so the TPI will wait for that tag instead of checking the vendor readiness. So instead of having [...] we would wait and check the existence of such tag.

Using tags as a liveness/readiness check doesn't seem like a widely adopted practice, and could not be especially idiomatic in some cloud vendors, even if all of them support metadata.

0x2b3bfa0 · 2021-08-25T00:20:13Z

If logs are the "official" liveness report method for CML and it outputs machine-readable logs in JSON as part of iterative/cml#22, using containers as per #146 and parsing container logs could be fine.

0x2b3bfa0 · 2021-09-15T16:47:34Z

@DavidGOrtega, please keep in mind that we already use SSH outside of Terraform (DVC) and tags require cloud–specific code on this repository for each vendor. SSH seems like a good polling mechanism until we have a clearer view of the machine API.

DavidGOrtega · 2021-09-15T16:53:19Z

@DavidGOrtega, please keep in mind that we already use SSH outside of Terraform (DVC) and tags require cloud–specific code on this repository for each vendor. SSH seems like a good polling mechanism until we have a clearer view of the machine API.

yeah we can do SSH but I would prefer tagging because we do not have to rely on SSH. additionally its one of the ways that many vendors actually uses. I.E. aws marks the spot instance termination, unfortunately this is not the case for Azure.

The mechanism would be having an internal API call that creates the tag upon script termination. While the TPI waits for that flag to be ready.

The other solution is as you say wait for a flag via SSH but I find this weaker. Lets discuss this ASAP. We could create a fast meeting.

0x2b3bfa0 · 2021-09-15T17:28:57Z

I would prefer keeping SSH (more generically, running commands) until we have a better alternative, mainly because:

Currently, we already use SSH extensively, both internally and in the projects that consume our library: iterative/tpi, iterative/dvc.
Terraform doesn't expose any kind of API beyond command line, and using cloud–specific metadata for readiness, liveness, termination and untethering events can complicate the interface with the other tools, like in the tpi package. They would have to either query the cloud directly or parse the TF_LOG=info terraform apply standard output.
If we have to define those events and map them to the status of the custom script, it doesn't really matter – to my mind – if we use SSH or not. Polling is more or less the same and similarly reliable, and most orchestrators use a command as readiness/liveness check anyway.

By readiness, I mean that the user–specified script starts correctly and does what it's intended to do; by liveness that it's operational over time (e.g. generating a heartbeat); by termination that it has stopped (gracefully or not); and by untethering that it wishes to report the deployment as complete and will self–destroy without external aid (e.g. what cml-runner does).

It looks like @pmrowla wants to subscribe to the termination event and, no matter where and how this is implemented, somebody has to wrap the user script and do something like this:

Server side

bash user_script.sh
touch /tmp/done # or aws ··· "ready"

Client side

while ! ssh ··· stat /tmp/done; do sleep 10; done # or aws ··· "==ready"

casperdcl · 2021-09-15T18:05:01Z

ssh

con: client needs to poll
pro: "broken" user_script.sh will mean no connection will happen
pro: could address pulling logs for free

ssh-with-user_script

pro: client doesn't have to write config (TPI.go initiates an SSH connection to run user_script.sh)
pro: "broken" user_script.sh will be handled by TPI.go
pro: could address pulling logs for free
likely "best" approach for https://github.com/iterative/tpi & https://github.com/iterative/dvc /CC @pmrowla?

0x2b3bfa0 · 2021-09-15T22:39:23Z

Recapitulation

When to consider `terraform apply` successful

When the cloud vendor acknowledges the creation of the machine (creation)
When the startup_script starts running and declares itself ready (readiness)
When the startup_script emits a heartbeat or responds to a periodic query (liveness)
When the startup_script finishes the execution (completion)

Terraform was originally designed to deal only with 1 (creation), and save the internal state to a file as soon as it matches the desired one.

What happens if we consider 2 (readiness) or 3 (liveness)

[Ab]using the provider to consider starting startup_script as part of the desired state increases considerably the risks of infrastructure drift for each second of terraform apply running on unreliable equipment.

We're creating all the resources at once sequientially, without taking advantage of Terraform's dependency graph and planner — which, incidentally, are the only core features of Terraform beyond printing colored text — so we can't detect (nor correct) any drift produced by interrupted runs.

Note: written by the criminal hand who merged #70 to support SSH readiness checks for the cml_runner resource, which always runs in CI/CD environments that are pretty reliable; not the case with DVC machines, launched from personal computers.

What happens if we consider 4 (completion)

What do we need Terraform for? If we keep the infrastructure tethered to a client process during all its lifecycle, we don't need state management beyond the process' memory.

This already exists and it's called Hashicorp Packer: it provisions a machine in the cloud, runs a user–specified script, streams its output with SSH, and destroys all the resources once it finishes. It just happens that it was designed to create disk images of the final state, but that functionality can be disabled.

Note: as suggested in #164 (comment), it would be possible to adopt an hybrid approach, letting cloud vendors create the reources for us and offering a resilient client that can be interrupted without consequences, but this is a broader, unrelated topic that deviates from our current code base and implies a Copernican Turn.

Using cloud resource metadata instead of SSH

I like the @DavidGOrtega proposal of using metadata (tags, labels, annotations, ...) to signal the machine status. Not for internal readiness checks, but rather for users looking at the cloud console or command–line tool, and definitely not to be implemented in the short term.

Why it could not be a good fit for the DVC use case

Unfortunately, it doesn't seem especially appropriate for @pmrowla's DVC machine use case, because querying metadata requires vendor–specific code, and we would be forced to use the provider to poll for 4 (completion).

Implementing the vendor–specific polling code outside the provider (e.g. DVC or PyTPI) doesn't seem a realistic alternative, both because of the complexity and because of the effort dupplication.

Disadvantages over command execution (e.g SSH) for internal use

Metadata is not intended to transfer logs, and we would need an external logging service or a separate SSH channel. Equally, it's not intended to run commands nor perform any kind of complex tasks.

After RFC1149, I don't see impossible to implment a tag-based communication protocol, but it doesn't seem especially practical either.

Understanding `ssh-with-user_script` from #175 (comment)

- pro: client doesn't have to write config (TPI.go initiates an SSH connection to run user_script.sh)
- pro: "broken" user_script.sh will be handled by TPI.go
- pro: could address pulling logs for free
- likely "best" approach for https://github.com/iterative/tpi & https://github.com/iterative/dvc /CC @pmrowla?

If this provider spawns an instance first, waits for it to accept SSH (generic command execution) connections, runs the user script and stream the standard input/output/error channels back to the client, we would be again waiting for 4 (completion).

Additionally, running long, non–retriable tasks through SSH would add a new single point of failure, as SSH commands get killed as soon as the network connection fails.

If we choose this approach, we would need to nohup the process and, basically, replicate the startup_script functionality.

0x2b3bfa0 · 2021-09-16T00:01:13Z

additionally its one of the ways that many vendors actually uses. I.E. aws marks the spot instance termination

Are you implying that AWS tags spot instances that are being terminated with “real” resource tags? I wasn't aware of that feature and haven't found it in the documentation, but it's definitely interesting. 🤔

pmrowla · 2021-09-16T05:30:56Z

And the obvious answer is, of course, wrapping the user–provided script as suggested on #175 (comment) and touching a flag file as soon as it stops running. Then, the client can use SSH to poll the script status by reading the flag file.

This works fine from the DVC standpoint, and I would not expect startup_script to block terraform apply until the script is completed. (My original question in slack was mainly intended to double check whether or not something like this existed already.)

For the DVC use case, it would be better if the the script wrapping was done in TPI (and was always written to some standard location by TPI) instead of relying on the user to put it in their own startup_script code.

Basically, we have a default startup_script that we will use in dvc machine and can do this kind of file flag signaling on the DVC side in that default script. But we also want to support users providing their own scripts as well (and ideally we would not require users to always copy/paste the proper file flag generation bits into their own scripts)

DavidGOrtega · 2021-09-16T07:43:54Z

What do we need Terraform for? If we keep the infrastructure tethered to a client process during all its lifecycle, we don't need state management beyond the process' memory.

Terraform is a declarative way of creating resources where resource is an abstract concept. Anything can be a resource. A full web app can be a resource in terraform as the terraform tutorial teaches you. When creating a provider your are hiding all the internals to the user. We are not giving a packer config but a a provider with resources which implementation is hidden to the users. The benefit and value of giving them within a resource is for me invaluable and pretty clear.

TPI could offer jupyter on demand, or hopefully with MLEM serving models that can scale up with Consul.
All that exposed through a declarative language. Why this approach VS i.e. docker compose? I think is a matter of flavour but we could expose resources in bare metal machines with non Linux OS something imposible with the other approach.

DavidGOrtega · 2021-09-16T08:01:32Z

Also you are assuming that the TPI in the DVC use case has to be waiting for the output and thats not the case.
DVC like CML should create run and leave, implementing the resourceRead to be able to have the output at any state (missing in cml_runner resource)

THIS is something that we are fixing properly with the executor. The executor takes a piece of script and is able to execute it exposing the logs that could be gathered via resourceRead at any point of the lifecycle.

DavidGOrtega · 2021-09-16T08:09:15Z

Note: written by the criminal hand who merged #70 to support SSH readiness checks for the cml_runner resource

I consider this to be perfect and desirable. We launch the machine and provision it with the runner and we check the runner to be OK, if that do not happens the machine gets destroyed. And this is a must because the machine is not able to be destroyed itself and a machine without a valid runner is not less than a waste.

DavidGOrtega · 2021-09-16T08:11:50Z

nohup

I will only say one word. Executor.

casperdcl · 2021-09-16T09:18:36Z

TPI.py needs (2), (3), and (4) depending on the user requirements. (2) is like docker run/ssh -T &/nohup, while (3)-(4) are like docker run -it/ssh.

TPI.go only needs to implement (2) I think - the rest is "easy" to implement in TPI.py over ssh, right? TPI.py can define startup.sh (which in turn does interesting things, sets up polling, calls dvc_machine.sh etc.)

I definitely think (4) is beyond the scope of TPI.go

0x2b3bfa0 · 2021-09-16T17:01:02Z

Terraform is a declarative way of creating resources where resource is an abstract concept. Anything can be a resource. A full web app can be a resource in terraform as the terraform tutorial teaches you. ^{[citation needed]}

I have been unable to find that tutorial, but such a tutorial seems to have an interesting take on resources. While resources can be as abstract (?) and all–encompassing as we want to define them, we should also consider that Terraform power relies on small, indivisible resources (usually 1:1 to cloud resources) that can be mapped into a dependency graph and managed individually.

Of course, we can do this on our own and offer the same level of resiliency and optimization, but we would be (re)writing a considerable part of Terraform Core.

“Providers define individual units of infrastructure, for example compute instances or private networks, as resources. You can compose resources from different providers into reusable Terraform configurations called modules, and manage them with a consistent language and workflow” (emphasis mine)

When creating a provider your are hiding all the internals to the user. We are not giving a packer config but a a provider with resources which implementation is hidden to the users. The benefit and value of giving them within a resource is for me invaluable and pretty clear.

Perhaps “hidden” isn't the most appropriate word and “resource” isn't the most adequate abstraction, but I definitely see the value of simplifying infrastructure so users don't have to define everything by themselves. See also Terraform Modules for a widely adopted way of doing something pretty similar.

TPI could offer jupyter on demand, or hopefully with MLEM serving models that can scale up with Consul.
All that exposed through a declarative language. Why this approach VS i.e. docker compose? I think is a matter of flavour but we could expose resources in bare metal machines with non Linux OS something imposible with the other approach.

While this deviates considerably of the initial issue topic, it's definitely worth another thread. It would be awesome to use this provider (or whatever it becomes in the mid term) for all our cloud computing needs, both on training and serving.

Assuming you mean docker–machine (?), containers versus virtual machines still is open to debate, especially after we chose to support Kubernetes.

Also you are assuming that the TPI in the DVC use case has to be waiting for the output and thats not the case.

If we used metadata to wait for completion (the original DVC requirement), the provider would indeed have to wait until the script finishes, lest having to implement vendor–specific metadata support in DVC or PyTPI.

If we choose SSH as a generic method, that's not the case, as PyTPI can be in charge of polling the resource without having to implement any vendor–specific code nor forcing the provider to wiat for 4 (completion) as per the definitions above.

DVC like CML should create run and leave, implementing the resourceRead to be able to have the output at any state (missing in cml_runner resource)

Not sure if ReadFunc is appropriate to retrieve the script output. 🤔 Nevertheless, we probably should implement it for cloud resources.

THIS is something that we are fixing properly with the executor. The executor takes a piece of script and is able to execute it exposing the logs that could be gathered via resourceRead at any point of the lifecycle.

Isn't this what container orchestrators do?

I consider this to be perfect and desirable. We launch the machine and provision it with the runner and we check the runner to be OK, if that do not happens the machine gets destroyed. And this is a must because the machine is not able to be destroyed itself and a machine without a valid runner is not less than a waste.

Agreed, but then we're resorting to stateless provisioning and Terraform wouldn't be of any use. 🤔

DavidGOrtega · 2021-09-16T20:17:20Z

Agreed, but then we're resorting to stateless provisioning and Terraform wouldn't be of any use.

I do not understand what you mean

DavidGOrtega · 2021-09-16T20:20:03Z

small, manageable resources (usually 1:1 to cloud resources) that can be mapped into a dependency graph and managed individually.

Is not cml_runner resource a perfect example of that definition? Can I not launch X runners with a Terraform file in any vendor?

0x2b3bfa0 · 2021-09-22T23:21:45Z

I do not understand what you mean

Terraform was designed to manage infrastructure in a declarative way. Users specify the final state they want to achieve, and the Terraform Core engine takes care of performing and retrying lots of requests in the least possible amount of time until it either achieves the desired state, times out, or encounters a non–retriable failure.

If we create a single “black box” resource that takes care internally of all the API calls, we can't take advantage of Terraform to manage drift and creation errors. If our provider gets interrupted somewhere in–between the network creation and the instance creation, we have no way of knowing what has been created and what has failed, and no way of fixing it.

Moreover, we aren't taking advantage of the Terraform state file either, and treating it as something ephemeral that can be thrown away after the destroy process, whose success we're taking for granted. Terraform was designed to manage static infrastructure, and wasn't designed to tolerate interruptions during the apply process nor extreme infrastructure drift.

Is not cml_runner resource a perfect example of that definition? Can I not launch X runners with a Terraform file

Yes, if it were just a resource on the GitHub Terraform provider and its single responsibilities were registering a runner on a repository/organization and returning the runner token as a computed field.

in any vendor?

Here is the discrepancy: now, we're performing a dozen of API calls from a single resource to another different service (a user–chosen public cloud), and, if something goes awry, we have to handle it manually with lots of code, because Terraform can't do anything about it nor know where the failure was on a resource or another, because we're packing everything together.

0x2b3bfa0 · 2021-09-22T23:22:42Z

Posted the previous message just to avoid losing it. I'm putting this conversation on hold because there are so many p0 issues waiting for me.

0x2b3bfa0 · 2021-11-29T23:20:33Z

Closing in favor of #315 because:

GitHub runners no longer need a readiness waiter; see Remove useless runner waiting code #315.
After Integrate task resource #237, the “obvious” way of communicating the runner status is through the object storage.

DavidGOrtega · 2021-11-30T08:27:34Z

GitHub runners no longer need a readiness waiter

As stated in #315 thats not true. The main reason has been always be able to log the process of registering the runner.

Aside that this feature is created to be able to be sure that we control the finalisation of the startup script, its indeed a feature already requested in clouds like azure.

0x2b3bfa0 · 2022-04-24T12:08:35Z

After #460, it looks like the current readiness check is effective enough and able to handle an unexpected failure of the script. This feature requires polling anyway, so the current SSH approach is good enough.

DavidGOrtega added enhancement New feature or request p1-important High priority cloud-new New cloud support request labels Aug 9, 2021

casperdcl assigned DavidGOrtega Aug 10, 2021

DavidGOrtega added p0-critical Max priority (ASAP) and removed p1-important High priority labels Sep 15, 2021

0x2b3bfa0 mentioned this issue Sep 21, 2021

Add a new state_mv method iterative/tpi#13

Merged

casperdcl added cloud-common Applies to every cloud vendor and removed cloud-new New cloud support request labels Nov 24, 2021

0x2b3bfa0 closed this as completed Nov 29, 2021

DavidGOrtega reopened this Nov 30, 2021

0x2b3bfa0 mentioned this issue Dec 3, 2021

task bucket usage vs "directory" within a bucket #299

Closed

casperdcl added p1-important High priority and removed p0-critical Max priority (ASAP) labels Apr 1, 2022

0x2b3bfa0 added discussion Waiting for team decision resource-runner iterative_runner TF resource and removed p1-important High priority labels Apr 24, 2022

0x2b3bfa0 linked a pull request Apr 24, 2022 that will close this issue

Destroy instance if runner does not sync #460

Merged

0x2b3bfa0 removed the enhancement New feature or request label Apr 24, 2022

0x2b3bfa0 closed this as completed Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness Mechanism #175

Readiness Mechanism #175

DavidGOrtega commented Aug 9, 2021 •

edited

Loading

0x2b3bfa0 commented Aug 25, 2021

0x2b3bfa0 commented Aug 25, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 15, 2021

DavidGOrtega commented Sep 15, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 15, 2021 •

edited

Loading

casperdcl commented Sep 15, 2021

0x2b3bfa0 commented Sep 15, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 16, 2021 •

edited

Loading

pmrowla commented Sep 16, 2021 •

edited

Loading

DavidGOrtega commented Sep 16, 2021 •

edited

Loading

DavidGOrtega commented Sep 16, 2021 •

edited

Loading

DavidGOrtega commented Sep 16, 2021

DavidGOrtega commented Sep 16, 2021

casperdcl commented Sep 16, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 16, 2021 •

edited

Loading

DavidGOrtega commented Sep 16, 2021

DavidGOrtega commented Sep 16, 2021

0x2b3bfa0 commented Sep 22, 2021

0x2b3bfa0 commented Sep 22, 2021

0x2b3bfa0 commented Nov 29, 2021

DavidGOrtega commented Nov 30, 2021

0x2b3bfa0 commented Apr 24, 2022

Readiness Mechanism #175

Readiness Mechanism #175

Comments

DavidGOrtega commented Aug 9, 2021 • edited Loading

Cons

Proposal

0x2b3bfa0 commented Aug 25, 2021

0x2b3bfa0 commented Aug 25, 2021 • edited Loading

0x2b3bfa0 commented Sep 15, 2021

DavidGOrtega commented Sep 15, 2021 • edited Loading

0x2b3bfa0 commented Sep 15, 2021 • edited Loading

Server side

Client side

casperdcl commented Sep 15, 2021

tags

ssh

ssh-with-user_script

0x2b3bfa0 commented Sep 15, 2021 • edited Loading

Recapitulation

When to consider terraform apply successful

What happens if we consider 2 (readiness) or 3 (liveness)

What happens if we consider 4 (completion)

Suggested solution for DVC machines

Deployment

Polling

Using cloud resource metadata instead of SSH

Why it could not be a good fit for the DVC use case

Disadvantages over command execution (e.g SSH) for internal use

Understanding ssh-with-user_script from #175 (comment)

0x2b3bfa0 commented Sep 16, 2021 • edited Loading

pmrowla commented Sep 16, 2021 • edited Loading

DavidGOrtega commented Sep 16, 2021 • edited Loading

DavidGOrtega commented Sep 16, 2021 • edited Loading

DavidGOrtega commented Sep 16, 2021

DavidGOrtega commented Sep 16, 2021

casperdcl commented Sep 16, 2021 • edited Loading

0x2b3bfa0 commented Sep 16, 2021 • edited Loading

DavidGOrtega commented Sep 16, 2021

DavidGOrtega commented Sep 16, 2021

0x2b3bfa0 commented Sep 22, 2021

0x2b3bfa0 commented Sep 22, 2021

0x2b3bfa0 commented Nov 29, 2021

DavidGOrtega commented Nov 30, 2021

0x2b3bfa0 commented Apr 24, 2022

DavidGOrtega commented Aug 9, 2021 •

edited

Loading

0x2b3bfa0 commented Aug 25, 2021 •

edited

Loading

DavidGOrtega commented Sep 15, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 15, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 15, 2021 •

edited

Loading

When to consider `terraform apply` successful

Understanding `ssh-with-user_script` from #175 (comment)

0x2b3bfa0 commented Sep 16, 2021 •

edited

Loading

pmrowla commented Sep 16, 2021 •

edited

Loading

DavidGOrtega commented Sep 16, 2021 •

edited

Loading

DavidGOrtega commented Sep 16, 2021 •

edited

Loading

casperdcl commented Sep 16, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 16, 2021 •

edited

Loading