Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness Mechanism #175

Closed
DavidGOrtega opened this issue Aug 9, 2021 · 22 comments · Fixed by #460
Closed

Readiness Mechanism #175

DavidGOrtega opened this issue Aug 9, 2021 · 22 comments · Fixed by #460
Assignees
Labels
cloud-common Applies to every cloud vendor discussion Waiting for team decision resource-runner iterative_runner TF resource

Comments

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Aug 9, 2021

Right now we have a very limited mechanism to determine the readiness of a machine: We parse journalctl via ssh.

Cons

  • In the machine resource the startup scripts are processed after the machine is already marked as ready by the vendor. So in cases where we need our stack to be ready (created via startup script) we need to add a ssh-wait-mechanism to determine if its really ready. This behaviour is inherited from the original vendors where get the ready state before executing the startup script.
  • For every resource we need to maintain extra code. An in machine resource is not feasible.

Proposal

We could add in the final stage of the startup script a tagging process so the TPI will wait for that tag instead of checking the vendor readiness. So instead of having

statusInput := ec2.DescribeInstancesInput{
		InstanceIds: []*string{aws.String(instanceID)},
		Filters: []*ec2.Filter{
			{
				Name:   aws.String("instance-state-name"),
				Values: []*string{aws.String("running")},
			},
		},
	}

we would wait and check the existence of such tag

@DavidGOrtega DavidGOrtega added enhancement New feature or request p1-important High priority cloud-new New cloud support request labels Aug 9, 2021
@0x2b3bfa0
Copy link
Member

Right now we have a very limited mechanism to determine the readiness of a machine: We parse journalctl via ssh.

Yes, and it's not precisely the most orthodox procedure.

We could add in the final stage of the startup script a tagging process so the TPI will wait for that tag instead of checking the vendor readiness. So instead of having [...] we would wait and check the existence of such tag.

Using tags as a liveness/readiness check doesn't seem like a widely adopted practice, and could not be especially idiomatic in some cloud vendors, even if all of them support metadata.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Aug 25, 2021

If logs are the "official" liveness report method for CML and it outputs machine-readable logs in JSON as part of iterative/cml#22, using containers as per #146 and parsing container logs could be fine.

@DavidGOrtega DavidGOrtega added p0-critical Max priority (ASAP) and removed p1-important High priority labels Sep 15, 2021
@0x2b3bfa0
Copy link
Member

@DavidGOrtega, please keep in mind that we already use SSH outside of Terraform (DVC) and tags require cloud–specific code on this repository for each vendor. SSH seems like a good polling mechanism until we have a clearer view of the machine API.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Sep 15, 2021

@DavidGOrtega, please keep in mind that we already use SSH outside of Terraform (DVC) and tags require cloud–specific code on this repository for each vendor. SSH seems like a good polling mechanism until we have a clearer view of the machine API.

yeah we can do SSH but I would prefer tagging because we do not have to rely on SSH. additionally its one of the ways that many vendors actually uses. I.E. aws marks the spot instance termination, unfortunately this is not the case for Azure.

The mechanism would be having an internal API call that creates the tag upon script termination. While the TPI waits for that flag to be ready.

The other solution is as you say wait for a flag via SSH but I find this weaker. Lets discuss this ASAP. We could create a fast meeting.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Sep 15, 2021

I would prefer keeping SSH (more generically, running commands) until we have a better alternative, mainly because:

  • Currently, we already use SSH extensively, both internally and in the projects that consume our library: iterative/tpi, iterative/dvc.

  • Terraform doesn't expose any kind of API beyond command line, and using cloud–specific metadata for readiness, liveness, termination and untethering events can complicate the interface with the other tools, like in the tpi package. They would have to either query the cloud directly or parse the TF_LOG=info terraform apply standard output.

  • If we have to define those events and map them to the status of the custom script, it doesn't really matter – to my mind – if we use SSH or not. Polling is more or less the same and similarly reliable, and most orchestrators use a command as readiness/liveness check anyway.


By readiness, I mean that the user–specified script starts correctly and does what it's intended to do; by liveness that it's operational over time (e.g. generating a heartbeat); by termination that it has stopped (gracefully or not); and by untethering that it wishes to report the deployment as complete and will self–destroy without external aid (e.g. what cml-runner does).

It looks like @pmrowla wants to subscribe to the termination event and, no matter where and how this is implemented, somebody has to wrap the user script and do something like this:

Server side

bash user_script.sh
touch /tmp/done # or aws ··· "ready"

Client side

while ! ssh ··· stat /tmp/done; do sleep 10; done # or aws ··· "==ready"

@casperdcl
Copy link
Contributor

tags

  • pro: client doesn't have to write config
  • con: "broken" user_script.sh will mean silently broken stack
  • con: not supported by every vendor

ssh

  • con: client needs to poll
  • pro: "broken" user_script.sh will mean no connection will happen
  • pro: could address pulling logs for free

ssh-with-user_script

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Sep 15, 2021

Recapitulation

When to consider terraform apply successful

  1. When the cloud vendor acknowledges the creation of the machine (creation)
  2. When the startup_script starts running and declares itself ready (readiness)
  3. When the startup_script emits a heartbeat or responds to a periodic query (liveness)
  4. When the startup_script finishes the execution (completion)

Terraform was originally designed to deal only with 1 (creation), and save the internal state to a file as soon as it matches the desired one.

What happens if we consider 2 (readiness) or 3 (liveness)

[Ab]using the provider to consider starting startup_script as part of the desired state increases considerably the risks of infrastructure drift for each second of terraform apply running on unreliable equipment.

We're creating all the resources at once sequientially, without taking advantage of Terraform's dependency graph and planner — which, incidentally, are the only core features of Terraform beyond printing colored text — so we can't detect (nor correct) any drift produced by interrupted runs.

Note: written by the criminal hand who merged #70 to support SSH readiness checks for the cml_runner resource, which always runs in CI/CD environments that are pretty reliable; not the case with DVC machines, launched from personal computers.

What happens if we consider 4 (completion)

What do we need Terraform for? If we keep the infrastructure tethered to a client process during all its lifecycle, we don't need state management beyond the process' memory.

This already exists and it's called Hashicorp Packer: it provisions a machine in the cloud, runs a user–specified script, streams its output with SSH, and destroys all the resources once it finishes. It just happens that it was designed to create disk images of the final state, but that functionality can be disabled.

Note: as suggested in #164 (comment), it would be possible to adopt an hybrid approach, letting cloud vendors create the reources for us and offering a resilient client that can be interrupted without consequences, but this is a broader, unrelated topic that deviates from our current code base and implies a Copernican Turn.

Suggested solution for DVC machines

It looks like DVC machines are intended to run a script, wait until it finishes running — again, 4 (completion) as per the list above — and return/stream the output or, at least, know its exit code.

is there an standard/optimal way to determine whether a terraform startup_script has finished running on some remote instance

And the obvious answer is, of course, wrapping the user–provided script as suggested on #175 (comment) and touching a flag file as soon as it stops running. Then, the client can use SSH to poll the script status by reading the flag file.

Deployment

resource "iterative_machine" "machine" {
  name = "machine"
  cloud = "aws"

  startup_script = <<-END
    #!/bin/bash
    touch /tmp/script.started
    bash &> /tmp/script.log <<< "echo this is the user script"
    echo $? > /tmp/script.exit.code
  END
}

Polling

echo waiting for the script to start running...
while ! ssh ··· stat /tmp/script.started; do sleep 10; done
echo waiting for the script to finish running...
while ! ssh ··· tail --follow /tmp/script.log; do sleep 10; done
echo retrieving the script exit status...
while ! ssh ··· cat /tmp/script.exit.code; do sleep 10; done

Note: recommending custom scripts and flag files versus systemd because it won't run on Kubernetes and other container orchestrators.

Using cloud resource metadata instead of SSH

I like the @DavidGOrtega proposal of using metadata (tags, labels, annotations, ...) to signal the machine status. Not for internal readiness checks, but rather for users looking at the cloud console or command–line tool, and definitely not to be implemented in the short term.

Why it could not be a good fit for the DVC use case

Unfortunately, it doesn't seem especially appropriate for @pmrowla's DVC machine use case, because querying metadata requires vendor–specific code, and we would be forced to use the provider to poll for 4 (completion).

Implementing the vendor–specific polling code outside the provider (e.g. DVC or PyTPI) doesn't seem a realistic alternative, both because of the complexity and because of the effort dupplication.

Disadvantages over command execution (e.g SSH) for internal use

Metadata is not intended to transfer logs, and we would need an external logging service or a separate SSH channel. Equally, it's not intended to run commands nor perform any kind of complex tasks.

After RFC1149, I don't see impossible to implment a tag-based communication protocol, but it doesn't seem especially practical either.

Understanding ssh-with-user_script from #175 (comment)

- pro: client doesn't have to write config (TPI.go initiates an SSH connection to run user_script.sh)
- pro: "broken" user_script.sh will be handled by TPI.go
- pro: could address pulling logs for free
- likely "best" approach for https://github.com/iterative/tpi & https://github.com/iterative/dvc /CC @pmrowla?

If this provider spawns an instance first, waits for it to accept SSH (generic command execution) connections, runs the user script and stream the standard input/output/error channels back to the client, we would be again waiting for 4 (completion).

Additionally, running long, non–retriable tasks through SSH would add a new single point of failure, as SSH commands get killed as soon as the network connection fails.

If we choose this approach, we would need to nohup the process and, basically, replicate the startup_script functionality.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Sep 16, 2021

additionally its one of the ways that many vendors actually uses. I.E. aws marks the spot instance termination

Are you implying that AWS tags spot instances that are being terminated with “real” resource tags? I wasn't aware of that feature and haven't found it in the documentation, but it's definitely interesting. 🤔

@pmrowla
Copy link

pmrowla commented Sep 16, 2021

And the obvious answer is, of course, wrapping the user–provided script as suggested on #175 (comment) and touching a flag file as soon as it stops running. Then, the client can use SSH to poll the script status by reading the flag file.

This works fine from the DVC standpoint, and I would not expect startup_script to block terraform apply until the script is completed. (My original question in slack was mainly intended to double check whether or not something like this existed already.)

For the DVC use case, it would be better if the the script wrapping was done in TPI (and was always written to some standard location by TPI) instead of relying on the user to put it in their own startup_script code.

Basically, we have a default startup_script that we will use in dvc machine and can do this kind of file flag signaling on the DVC side in that default script. But we also want to support users providing their own scripts as well (and ideally we would not require users to always copy/paste the proper file flag generation bits into their own scripts)

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Sep 16, 2021

What do we need Terraform for? If we keep the infrastructure tethered to a client process during all its lifecycle, we don't need state management beyond the process' memory.

Terraform is a declarative way of creating resources where resource is an abstract concept. Anything can be a resource. A full web app can be a resource in terraform as the terraform tutorial teaches you. When creating a provider your are hiding all the internals to the user. We are not giving a packer config but a a provider with resources which implementation is hidden to the users. The benefit and value of giving them within a resource is for me invaluable and pretty clear.

TPI could offer jupyter on demand, or hopefully with MLEM serving models that can scale up with Consul.
All that exposed through a declarative language. Why this approach VS i.e. docker compose? I think is a matter of flavour but we could expose resources in bare metal machines with non Linux OS something imposible with the other approach.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Sep 16, 2021

Also you are assuming that the TPI in the DVC use case has to be waiting for the output and thats not the case.
DVC like CML should create run and leave, implementing the resourceRead to be able to have the output at any state (missing in cml_runner resource)

THIS is something that we are fixing properly with the executor. The executor takes a piece of script and is able to execute it exposing the logs that could be gathered via resourceRead at any point of the lifecycle.

@DavidGOrtega
Copy link
Contributor Author

Note: written by the criminal hand who merged #70 to support SSH readiness checks for the cml_runner resource

I consider this to be perfect and desirable. We launch the machine and provision it with the runner and we check the runner to be OK, if that do not happens the machine gets destroyed. And this is a must because the machine is not able to be destroyed itself and a machine without a valid runner is not less than a waste.

@DavidGOrtega
Copy link
Contributor Author

nohup

I will only say one word. Executor.

@casperdcl
Copy link
Contributor

casperdcl commented Sep 16, 2021

TPI.py needs (2), (3), and (4) depending on the user requirements. (2) is like docker run/ssh -T &/nohup, while (3)-(4) are like docker run -it/ssh.

TPI.go only needs to implement (2) I think - the rest is "easy" to implement in TPI.py over ssh, right? TPI.py can define startup.sh (which in turn does interesting things, sets up polling, calls dvc_machine.sh etc.)

I definitely think (4) is beyond the scope of TPI.go

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Sep 16, 2021

Terraform is a declarative way of creating resources where resource is an abstract concept. Anything can be a resource. A full web app can be a resource in terraform as the terraform tutorial teaches you. [citation needed]

I have been unable to find that tutorial, but such a tutorial seems to have an interesting take on resources. While resources can be as abstract (?) and all–encompassing as we want to define them, we should also consider that Terraform power relies on small, indivisible resources (usually 1:1 to cloud resources) that can be mapped into a dependency graph and managed individually.

Of course, we can do this on our own and offer the same level of resiliency and optimization, but we would be (re)writing a considerable part of Terraform Core.

“Providers define individual units of infrastructure, for example compute instances or private networks, as resources. You can compose resources from different providers into reusable Terraform configurations called modules, and manage them with a consistent language and workflow” (emphasis mine)

When creating a provider your are hiding all the internals to the user. We are not giving a packer config but a a provider with resources which implementation is hidden to the users. The benefit and value of giving them within a resource is for me invaluable and pretty clear.

Perhaps “hidden” isn't the most appropriate word and “resource” isn't the most adequate abstraction, but I definitely see the value of simplifying infrastructure so users don't have to define everything by themselves. See also Terraform Modules for a widely adopted way of doing something pretty similar.

TPI could offer jupyter on demand, or hopefully with MLEM serving models that can scale up with Consul.
All that exposed through a declarative language. Why this approach VS i.e. docker compose? I think is a matter of flavour but we could expose resources in bare metal machines with non Linux OS something imposible with the other approach.

While this deviates considerably of the initial issue topic, it's definitely worth another thread. It would be awesome to use this provider (or whatever it becomes in the mid term) for all our cloud computing needs, both on training and serving.

Assuming you mean docker–machine (?), containers versus virtual machines still is open to debate, especially after we chose to support Kubernetes.

Also you are assuming that the TPI in the DVC use case has to be waiting for the output and thats not the case.

If we used metadata to wait for completion (the original DVC requirement), the provider would indeed have to wait until the script finishes, lest having to implement vendor–specific metadata support in DVC or PyTPI.

If we choose SSH as a generic method, that's not the case, as PyTPI can be in charge of polling the resource without having to implement any vendor–specific code nor forcing the provider to wiat for 4 (completion) as per the definitions above.

DVC like CML should create run and leave, implementing the resourceRead to be able to have the output at any state (missing in cml_runner resource)

Not sure if ReadFunc is appropriate to retrieve the script output. 🤔 Nevertheless, we probably should implement it for cloud resources.

THIS is something that we are fixing properly with the executor. The executor takes a piece of script and is able to execute it exposing the logs that could be gathered via resourceRead at any point of the lifecycle.

Isn't this what container orchestrators do?

I consider this to be perfect and desirable. We launch the machine and provision it with the runner and we check the runner to be OK, if that do not happens the machine gets destroyed. And this is a must because the machine is not able to be destroyed itself and a machine without a valid runner is not less than a waste.

Agreed, but then we're resorting to stateless provisioning and Terraform wouldn't be of any use. 🤔

@DavidGOrtega
Copy link
Contributor Author

Agreed, but then we're resorting to stateless provisioning and Terraform wouldn't be of any use.

I do not understand what you mean

@DavidGOrtega
Copy link
Contributor Author

small, manageable resources (usually 1:1 to cloud resources) that can be mapped into a dependency graph and managed individually.

Is not cml_runner resource a perfect example of that definition? Can I not launch X runners with a Terraform file in any vendor?

@0x2b3bfa0
Copy link
Member

I do not understand what you mean

Terraform was designed to manage infrastructure in a declarative way. Users specify the final state they want to achieve, and the Terraform Core engine takes care of performing and retrying lots of requests in the least possible amount of time until it either achieves the desired state, times out, or encounters a non–retriable failure.

If we create a single “black box” resource that takes care internally of all the API calls, we can't take advantage of Terraform to manage drift and creation errors. If our provider gets interrupted somewhere in–between the network creation and the instance creation, we have no way of knowing what has been created and what has failed, and no way of fixing it.

Moreover, we aren't taking advantage of the Terraform state file either, and treating it as something ephemeral that can be thrown away after the destroy process, whose success we're taking for granted. Terraform was designed to manage static infrastructure, and wasn't designed to tolerate interruptions during the apply process nor extreme infrastructure drift.

Is not cml_runner resource a perfect example of that definition? Can I not launch X runners with a Terraform file

Yes, if it were just a resource on the GitHub Terraform provider and its single responsibilities were registering a runner on a repository/organization and returning the runner token as a computed field.

in any vendor?

Here is the discrepancy: now, we're performing a dozen of API calls from a single resource to another different service (a user–chosen public cloud), and, if something goes awry, we have to handle it manually with lots of code, because Terraform can't do anything about it nor know where the failure was on a resource or another, because we're packing everything together.

@0x2b3bfa0
Copy link
Member

Posted the previous message just to avoid losing it. I'm putting this conversation on hold because there are so many p0 issues waiting for me.

@casperdcl casperdcl added cloud-common Applies to every cloud vendor and removed cloud-new New cloud support request labels Nov 24, 2021
@0x2b3bfa0
Copy link
Member

Closing in favor of #315 because:

  1. GitHub runners no longer need a readiness waiter; see Remove useless runner waiting code #315.
  2. After Integrate task resource #237, the “obvious” way of communicating the runner status is through the object storage.

@DavidGOrtega
Copy link
Contributor Author

GitHub runners no longer need a readiness waiter

As stated in #315 thats not true. The main reason has been always be able to log the process of registering the runner.

Aside that this feature is created to be able to be sure that we control the finalisation of the startup script, its indeed a feature already requested in clouds like azure.

@DavidGOrtega DavidGOrtega reopened this Nov 30, 2021
@casperdcl casperdcl added p1-important High priority and removed p0-critical Max priority (ASAP) labels Apr 1, 2022
@0x2b3bfa0 0x2b3bfa0 added discussion Waiting for team decision resource-runner iterative_runner TF resource and removed p1-important High priority labels Apr 24, 2022
@0x2b3bfa0 0x2b3bfa0 linked a pull request Apr 24, 2022 that will close this issue
@0x2b3bfa0 0x2b3bfa0 removed the enhancement New feature or request label Apr 24, 2022
@0x2b3bfa0
Copy link
Member

After #460, it looks like the current readiness check is effective enough and able to handle an unexpected failure of the script. This feature requires polling anyway, so the current SSH approach is good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-common Applies to every cloud vendor discussion Waiting for team decision resource-runner iterative_runner TF resource
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants