Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc machine: implement initial terraform support #6462

Merged
merged 20 commits into from
Aug 27, 2021

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Aug 19, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Related to #6440
Related to: #6263

To test:

Currently the only supported TPI config options are name and cloud (both required as a part of dvc executor add). Eventually all of the TPI options should be supported in .dvc/config

@pmrowla pmrowla added the A: executors Related to the executors feature label Aug 19, 2021
@pmrowla pmrowla requested a review from a team as a code owner August 19, 2021 08:12
@pmrowla pmrowla requested a review from efiop August 19, 2021 08:12
@pmrowla pmrowla self-assigned this Aug 19, 2021
@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 19, 2021

asciicast

@pmrowla pmrowla requested a review from dberenbaum August 19, 2021 08:27
@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 19, 2021

Not really sure what we want to do with the terraform CLI output. It seems like something we should suppress, but there's no other useful way for us to display progress since we have to wrap their CLI (and operations to manage cloud machines will generally be slow)

I left out a dvc executor list implementation for now since I'm not sure how we want to differentiate between listing what's in the DVC repo config (like remote list) and actually listing state for whether or not instances of machines from the config have been started/destroyed


@dberenbaum take a look and see what you think so far

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 19, 2021

I'm also not sold on the executor naming here. Something like machine (like how docker uses docker-machine) might be better, since as it stands right now there is overlap between dvc.executor as in "cloud machine executor instance" and dvc.repo.experiments.executor as in "experiment pipeline executor"

I think the eventual workflow will basically be something along the lines of

machine = Executor(Machine)Backend.init(machine_name)
executor = machine.get_executor()  # return a pipeline/experiment/generic command executor instance that will run on the desired machine
executor.reproduce()

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 19, 2021

Also we probably don't actually need the jinja dependency, since the tf files we will generating are simple enough, but I left it in for now since the initial dvc-tpi example also used it.

@pmrowla pmrowla marked this pull request as draft August 19, 2021 08:40
@pmrowla pmrowla removed the request for review from efiop August 19, 2021 08:40
@dberenbaum
Copy link
Collaborator

Great stuff!

Responses to your comments:

Not really sure what we want to do with the terraform CLI output. It seems like something we should suppress, but there's no other useful way for us to display progress since we have to wrap their CLI (and operations to manage cloud machines will generally be slow)

Not sure either. Should it take a long time to create/destroy machines? Maybe @skshetry can weigh in here.

I left out a dvc executor list implementation for now since I'm not sure how we want to differentiate between listing what's in the DVC repo config (like remote list) and actually listing state for whether or not instances of machines from the config have been started/destroyed

Why not start with just the repo config and we can add more later? Even if the output format has to change, I don't see much downside unless I'm missing something. See also https://www.notion.so/iterative/Remote-Executors-215a0767aa0341d892b52767e58e2ac6#2281a6b533e4433496d0dfb8db6bf96b for ideas.

I'm also not sold on the executor naming here. Something like machine (like how docker uses docker-machine) might be better, since as it stands right now there is overlap between dvc.executor as in "cloud machine executor instance" and dvc.repo.experiments.executor as in "experiment pipeline executor"

No strong feelings here. We can get some other opinions on this before we commit to the name.

Also we probably don't actually need the jinja dependency, since the tf files we will generating are simple enough, but I left it in for now since the initial dvc-tpi example also used it.

Okay. We might need jinja anyway when we introduce dvc exp init. I forgot that it's not already used for plots.


Other thoughts:

  • python-terraform is only available up to 0.10.1 on pypi (https://pypi.org/simple/python-terraform/), meaning that the pip install fails for me. It's not hard to work around, but it raises questions about this library, which I know you mentioned as well. 0.10.1 was released in June 2019 (https://github.com/beelit94/python-terraform/tags) and there's only one other release since then, which wasn't even uploaded to pypi.
  • Do we need dvc executor default? How would this work?
  • If a user has an existing terraform resource or directory with a main.tf (and possibly state files), is there a way for us to manage this? Not sure we need to support this yet, but we should consider whether it will be easy to implement.
  • Why make cloud a positional argument? This might relate to the previous bullet and making sure we retain flexibility.
  • Should we add a rename subcommand to be consistent with remote?
  • Is init a good name for what it does? It looks like it does terraform init && terraform apply. Is this going to be confusing for terraform users to have init actually create instances? Maybe start or create make sense?

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 20, 2021

Why not start with just the repo config and we can add more later? Even if the output format has to change, I don't see much downside unless I'm missing something. See also https://www.notion.so/iterative/Remote-Executors-215a0767aa0341d892b52767e58e2ac6#2281a6b533e4433496d0dfb8db6bf96b for ideas.

This is fine with me, but this type of stuff can also be split out into separate tasks

  • Do we need dvc executor default? How would this work?

I'm not really sure, it will depend on where we want to go with usage I think. I was thinking along the lines of local is basically a default (and currently the only option in dvc), but users might not want that behavior. But it may make more sense for users to be specifying alternative default executors on a per-pipeline or per-stage basis in dvc.yaml like we've discussed before.

  • If a user has an existing terraform resource or directory with a main.tf (and possibly state files), is there a way for us to manage this? Not sure we need to support this yet, but we should consider whether it will be easy to implement.

We will probably want a executor.terraform_dir or similar config option, that would allow users to specify where their existing terraform environment is (and override the use of .dvc/tmp/exec/terraform

  • Why make cloud a positional argument? This might relate to the previous bullet and making sure we retain flexibility.

Mostly just for config schema validation reasons. It's possible to modify it with dvc executor modify myexecutor cloud azure after the fact (the same way you can modify remote URLs after the fact). But I'm not opposed to changing the usage around.

  • Should we add a rename subcommand to be consistent with remote?

It seems like we will need one, but I left it out for now for the same reasons as list. I don't think executor rename can be a pure config command - it will probably need to modify terraform state to rename existing machines that have already been started (via terraform mv)

  • Is init a good name for what it does? It looks like it does terraform init && terraform apply. Is this going to be confusing for terraform users to have init actually create instances? Maybe start or create make sense?

init and destroy make sense to me from a "design a generic workflow that in theory could run on a backend other than terraform" standpoint, but start or create would work too.

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 20, 2021

Also, I should note that for anyone testing this, I'm pretty sure that creating more than one executor will not work properly since the current main.tf generation just replaces your entire main.tf file instead of properly adding/removing additional resources to an existing tf file.

(this will need to be addressed in the future)

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 20, 2021

Also, regarding jinja, we could also drop it if we opt to make DVC-generated terraform configs non-human-readable (or at least less readable) and output them as JSON (.tf.json) instead of terraform HCL (.tf).

I guess the relevant question here would be whether or not we intentionally want the DVC-generated terraform configs to be user-editable. Users are always going to be able to manually modify the files, but writing them in JSON will make it more obvious that they are not supposed to be manually edited (and that users should use dvc executor ... to manage them instead)

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 20, 2021

Regarding managing multiple executors, it seems like what we will need is separate terraform directories per executor (so .dvc/tmp/exec/terraform/<executor_name>/main.tf instead of a single top level main.tf containing multiple resources.

We want to individually provision/destroy the machines per DVC executor (i.e. a single terraform resource), but terraform is intended to be used on a per-configuration (encompassing all possible resources) basis rather than a per-resource basis. apply/destroy -target=... does support targeting individual resources, but the docs and the CLI include warnings about how you aren't supposed to use terraform in that way.

The thing to note here would be that doing separate configurations per executor means we will have multiple terraform states to manage rather than a single state w/all of our possible resources.

@dberenbaum
Copy link
Collaborator

This is fine with me, but this type of stuff can also be split out into separate tasks

Sounds good.

I'm not really sure, it will depend on where we want to go with usage I think. I was thinking along the lines of local is basically a default (and currently the only option in dvc), but users might not want that behavior. But it may make more sense for users to be specifying alternative default executors on a per-pipeline or per-stage basis in dvc.yaml like we've discussed before.

Consistent with above, let's leave it out for now and come back to it.

Mostly just for config schema validation reasons. It's possible to modify it with dvc executor modify myexecutor cloud azure after the fact (the same way you can modify remote URLs after the fact). But I'm not opposed to changing the usage around.

My concern with the usage is that there could be times when cloud is not a relevant option (such as connecting to an existing resource like an internal host).

It seems like we will need one, but I left it out for now for the same reasons as list. I don't think executor rename can be a pure config command - it will probably need to modify terraform state to rename existing machines that have already been started (via terraform mv)

We can leave rename for a separate future task.

Regarding managing multiple executors, it seems like what we will need is separate terraform directories per executor (so .dvc/tmp/exec/terraform/<executor_name>/main.tf instead of a single top level main.tf containing multiple resources.

πŸ‘

@casperdcl
Copy link
Contributor

casperdcl commented Aug 23, 2021

/CC @iterative/cml

@0x2b3bfa0
Copy link
Member

  • pip install tpi could handle downloading tpi.go & terraform.go binaries (or even compiling them)

If we plan to stick with Go, as per iterative/terraform-provider-iterative#164 (comment), we could provide a Python wrapper package including precompiled wheels.

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 24, 2021

My concern with the usage is that there could be times when cloud is not a relevant option (such as connecting to an existing resource like an internal host).

In this case I think cloud would be ssh, and then we would have ssh options specific to DVC for hostname/port (since it's not an terraform-provider-iterative "cloud")

@@ -224,10 +225,16 @@ class RelPath(str):
"row_limit": All(Coerce(int), Range(1)), # obsoleted
"row_cleanup_quota": All(Coerce(int), Range(0, 100)), # obsoleted
},
"machine": {
str: {
"cloud": All(Lower, Choices("aws", "azure")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

Suggested change
"cloud": All(Lower, Choices("aws", "azure")),
"cloud": All(Lower, Choices("aws", "azure", "gcp", "kubernetes")),

/CC @iterative/cml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also

CLOUD_BACKENDS = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do gcp and kubernetes work with the existing TPI? I was just going off of what's documented in the readme right now

The Iterative Provider currently supports AWS and Azure. Google Cloud Platform is not currently supported.

Copy link
Contributor Author

@pmrowla pmrowla Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, this kind of thing can also be addressed later (#6480)

This PR is not meant to be a complete/finished implementation in any way

The main purpose here is to get the general API/workflow laid out so we aren't blocking other DVC team members that will be filling out the actual implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes gcp & kubernetes are in the same category as dvclive - they should work even though the docs aren't there yet.

setup.py Show resolved Hide resolved
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Aug 24, 2021

In this case I think cloud would be ssh, and then we would have ssh options specific to DVC for hostname/port (since it's not an terraform-provider-iterative "cloud")

The question is, should we support DVC logging in through SSH to an unknown environment and acting like it knows what is there and how to find it? Cloud execution is "easy" because we provide the system images and know how to provision them, but accessing a custom system is a completely different scenario. We can provide reference Ansible roles for do it yourself users, though.

I'm inclined to think that Kubernetes should be our go–to abstraction layer for on–premises executor support; trying to operate at lower levels could be risky for maintenance, support and user experience.

@dberenbaum
Copy link
Collaborator

We aren't going to ssh into an existing machine anytime soon most likely, but keeping it in mind as a possible option is what matters for now.

Both ssh and kubernetes seem like odd values to provide for the cloud argument IMO. I guess we are copying that from the terraform-provider-iterative syntax. It seems more natural to call it something more generic like provider since that's the equivalent in terraform. Not a blocker, maybe it's just me 🀷 .

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 25, 2021

Both ssh and kubernetes seem like odd values to provide for the cloud argument IMO. I guess we are copying that from the terraform-provider-iterative syntax. It seems more natural to call it something more generic like provider since that's the equivalent in terraform. Not a blocker, maybe it's just me 🀷 .

Yeah, it's cloud because that's the name of the TPI option. Basically I think the .dvc/config options should 1:1 map with the TPI ones. It doesn't really make sense for us to say "dvc config should have provider = aws but if you are using TPI yourself you need to use cloud = aws"

@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 25, 2021

I'm inclined to think that Kubernetes should be our go–to abstraction layer for on–premises executor support; trying to operate at lower levels could be risky for maintenance, support and user experience.

It would be better to have this kind of discussion in notion - in this case there's a dedicated story for managing existing/custom resources, but it's not a requirement that we are planning on addressing any time in the near future.

@pmrowla pmrowla marked this pull request as ready for review August 25, 2021 08:15
@pmrowla pmrowla changed the title [WIP] dvc machine: implement initial terraform support dvc machine: implement initial terraform support Aug 25, 2021
@casperdcl
Copy link
Contributor

windows tests should pass now in theory :) iterative/terraform-provider-iterative#179 (comment)

@pmrowla pmrowla requested a review from efiop August 27, 2021 03:41
@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 27, 2021

This initial PR should be ready to merge, parts of dvc/tpi and dvc/machine/backend/... will likely be split into a separate python package at some point (as a general python wrapper for TPI/terraform) but for now it can stay in DVC so the rest of the remote execution work isn't blocked

@efiop efiop merged commit 4ee9829 into iterative:master Aug 27, 2021
@efiop efiop added the feature is a feature label Aug 27, 2021
@dberenbaum
Copy link
Collaborator

How should we document dvc machine while its in development? Do we want another wiki?

@pmrowla pmrowla deleted the terraform-poc branch August 30, 2021 00:46
@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 30, 2021

Using a wiki page makes the most sense for now I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: executors Related to the executors feature feature is a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants