Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Deep Learning VM (DLVM) images as a base image option for Caliban. #20

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sagravat
Copy link

This contribution enables Deep Learning VM (DLVM) images from the Google Cloud AI Platform to be used as base images for execution in local, cloud, shell, and notebook modes. The notebook mode includes a Jupyterlab extension widget that allows the user to directly submit the notebook for training on the AI Platform with configurable hardware accelerators like GPUs and TPUs.

@codecov
Copy link

codecov bot commented Jun 19, 2020

Codecov Report

Merging #20 into master will decrease coverage by 0.69%.
The diff coverage is 10.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #20      +/-   ##
==========================================
- Coverage   45.67%   44.97%   -0.70%     
==========================================
  Files          17       17              
  Lines        2728     2777      +49     
==========================================
+ Hits         1246     1249       +3     
- Misses       1482     1528      +46     
Impacted Files Coverage Δ
caliban/history/utils.py 29.12% <0.00%> (-0.33%) ⬇️
caliban/main.py 22.72% <0.00%> (-1.09%) ⬇️
caliban/docker.py 23.16% <7.14%> (-2.08%) ⬇️
caliban/cloud/core.py 21.60% <20.00%> (-0.22%) ⬇️
caliban/config.py 59.87% <40.00%> (-0.66%) ⬇️
caliban/gke/utils.py 71.53% <0.00%> (-0.39%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a912775...3a417e9. Read the comment docs.

@sagravat sagravat changed the title Added Deep Learning VM (DLVM) images as a base image option for Caliban. Add Deep Learning VM (DLVM) images as a base image option for Caliban. Jun 20, 2020
Copy link
Collaborator

@sritchie sritchie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sagravat , this is great! I want to see if we can separate the logic here a bit more. Ideally, we would be able to reuse the existing run_interactive and run_notebook commands (I think that's the names), and make this PR into TWO changes.

The first is the --base_image support, and the second is the GCP notebook submitter.

The second is the GCP scheduler, which I think will work without the DLVM base images.

I'm having trouble parsing the new entrypoint that the second change requires. Is it really different than how we launch jupyterlab normally? If not, I wonder if we could actually enable this by making it easier for folks to install ANY jupyterlab extension.

Happy to chat more. Take a look and let me know if these notes make sense.

"--dlvm",
help="DLVM base image type. Must be one of "
"{}".format(dlvm_types) + ". If supplied, "
"Caliban will skip the build and push steps and use this image tag.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, I'm going to add comments as I read, so this may be clear later! Naively I would have assumed that the DLVM would be a base image, and that you could still install dependencies on top, right? If that is true, can we call this argument --base_image?

Right now, --image_id removes any requirement.txt installation; it's truly a flag to skip any build, including even getting your code into the image.

Really, the syntax should be caliban run e2a4af785bdb...

@@ -486,7 +486,7 @@ def build_job_specs(
experiments=experiments)


def generate_image_tag(project_id, docker_args, dry_run: bool = False):
def generate_image_tag(project_id, docker_args, dlvm: str = None, dry_run: bool = False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we call it --base_image we can add that key to the generate_docker_args function here: https://github.com/google/caliban/blob/master/caliban/cli.py#L536

return _dlvm_config(job_mode).get(dlvm_arg)


def _dlvm_config(job_mode: JobMode) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is good. I think I see now why you have --dlvm vs --base_image. What we COULD do is have a prefix for --base_image, something like --base_image dlvm:pytorch, that would force a lookup here. That would let this feature give us general base images too.

@@ -391,9 +391,17 @@ def _notebook_entries(lab: bool = False, version: Optional[str] = None) -> str:

library = "jupyterlab" if lab else "jupyter"

return """
if not dlvm:
return """
RUN pip install {}{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, wait, good catch that we need up date this pip too! I can do that separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this --lab flag. I'm thinking that we should only do this special installation if --lab is specified. If not, even with a dlvm base image, we should just install normal jupyter. wdyt?

RUN pip install {}{}
""".format(library, version_suffix)
else:
return """
RUN /opt/conda/bin/pip install \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this only work on the deep learning VMs? and does it work on ALL the deep learning VMs?

If it works without the dlvms, maybe it needs its own flag.

@@ -511,7 +526,10 @@ def _dockerfile_template(
if base_image_fn is None:
base_image_fn = base_image_id

base_image = base_image_fn(job_mode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead pass

base_image_fn = lambda job_mode: _dlvm_id(job_mode, dlvm)

Then we can keep to the API.

@@ -378,7 +378,7 @@ def _credentials_entries(user_id: int,
return ret


def _notebook_entries(lab: bool = False, version: Optional[str] = None) -> str:
def _notebook_entries(lab: bool = False, version: Optional[str] = None, dlvm: bool = False) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this flag called scheduler instead? I think it doesn't depend on dlvm and COULD be a separate flag.

if inject_notebook.value != 'none':
install_lab = inject_notebook == NotebookInstall.lab
if dlvm is None:
dockerfile += _notebook_entries(lab=install_lab, version=jupyter_version, dlvm=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about remove the if/else and make the line

 dockerfile += _notebook_entries(lab=install_lab, version=jupyter_version, dlvm=bool(dlvm))


dockerfile += """

USER {uid}:{gid}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! Now that we're on python 3.6, we can actually make this

dockerfile += f"""
USER {uid}:{gid}
"""

using f-strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants