Skip to content

Commit

Permalink
docs: add k8s sec diagram (#9593)
Browse files Browse the repository at this point in the history
  • Loading branch information
tara-hpe authored Jul 1, 2024
1 parent 7e2e3f6 commit 462ed58
Show file tree
Hide file tree
Showing 7 changed files with 146 additions and 27 deletions.
Binary file added docs/assets/images/_det-ai-sys-k8s-01-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/_det-ai-sys-k8s-01-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 39 additions & 27 deletions docs/setup-cluster/k8s/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@
Deploy on Kubernetes
######################

This document describes how the Determined runs on `Kubernetes <https://kubernetes.io/>`__. For
instructions on installing Determined on Kubernetes, see the :ref:`installation guide
<install-on-kubernetes>`.
This document describes how Determined runs on Kubernetes. For instructions on installing Determined
on Kubernetes, see the :ref:`installation guide <install-on-kubernetes>`.

In this topic guide, we will cover:
This guide covers:

#. How Determined works on Kubernetes.
#. Limitations of Determined on Kubernetes.
Expand All @@ -19,44 +18,57 @@ In this topic guide, we will cover:
************************************

:ref:`Installing Determined on Kubernetes <install-on-kubernetes>` deploys an instance of the
Determined master and a Postgres database in the Kubernetes cluster. Once the master is up and
running, you can launch :ref:`experiments <experiments>`, :ref:`notebooks <notebooks>`,
:ref:`TensorBoards <tensorboards>`, :ref:`commands <commands-and-shells>`, and :ref:`shells
<commands-and-shells>`. When new workloads are submitted to the Determined master, the master
launches jobs and config maps on the Kubernetes cluster to execute those workloads. Users of
Determined shouldn't need to interact with Kubernetes directly after installation, as Determined
handles all the necessary interaction with the Kubernetes cluster. Kubernetes creates and cleans up
pods for all jobs that Determined may request.

It is also important to note that when running Determined on Kubernetes, a higher priority value
means a higher priority (e.g. a priority 50 task will run before a priority 40 task). This is
different from priority scheduling in non-Kubernetes deployments, where lower priority values mean a
higher priority (e.g. a priority 40 task will run before a priority 50 task).
Determined master and a Postgres database in the Kubernetes cluster.

.. image:: /assets/images/_det-ai-sys-k8s-01-light.png
:class: only-dark
:alt: Determined AI system architecture diagram describing how the master node works on kubernetes in dark mode

.. image:: /assets/images/_det-ai-sys-k8s-01-light.png
:class: only-light
:alt: Determined AI system architecture diagram describing how the master node works on kubernetes in light mode

|
Once the master is running, you can launch :ref:`experiments <experiments>`, :ref:`notebooks
<notebooks>`, :ref:`TensorBoards <tensorboards>`, :ref:`commands <commands-and-shells>`, and
:ref:`shells <commands-and-shells>`. When new workloads are submitted to the Determined master, the
master launches jobs and config maps on the Kubernetes cluster to execute those workloads. Users do
not need to interact with Kubernetes directly after installation, as Determined handles all the
necessary interaction with the Kubernetes cluster. Kubernetes creates and cleans up pods for all
jobs requested by Determined.

.. note::

When running Determined on Kubernetes, a higher priority value means a higher priority (e.g., a
priority 50 task will run before a priority 40 task). This is different from non-Kubernetes
deployments, where lower priority values mean higher priority (e.g., a priority 40 task will run
before a priority 50 task).

.. _limitations-on-kubernetes:

***************************
Limitations on Kubernetes
***************************

This section outlines the current limitations of Determined on Kubernetes.

Scheduling
==========

By default, the Kubernetes scheduler does not support gang scheduling or preemption. This can be
problematic for distributed deep learning workloads that require multiple pods to be scheduled
before execution starts. Determined includes built-in support for the `lightweight coscheduling
plugin <https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__,
which extends the default Kubernetes scheduler to support gang scheduling. Determined also includes
support for priority-based preemption scheduling. Neither are enabled by default. For more details
and instructions on how to enable the coscheduling plugin, refer to
:ref:`gang-scheduling-on-kubernetes` and :ref:`priority-scheduling-on-kubernetes`.
before execution starts.

Determined includes built-in support for the `lightweight coscheduling plugin
<https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__, which
extends the default Kubernetes scheduler to support gang scheduling. Determined also supports
priority-based preemption scheduling. Neither feature is enabled by default. For more details and
instructions on how to enable the coscheduling plugin, refer to :ref:`gang-scheduling-on-kubernetes`
and :ref:`priority-scheduling-on-kubernetes`.

Dynamic Agents
==============

Determined is not able to autoscale your cluster, but equivalent functionality is available by using
Determined cannot autoscale your cluster. However, equivalent functionality is available by using
the `Kubernetes Cluster Autoscaler
<https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler>`_, which is supported on
`GKE <https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler>`_ and `EKS
Expand All @@ -78,7 +90,7 @@ root. For more information, see: :ref:`run-as-user`.

`kubectl <https://kubernetes.io/docs/tasks/tools/>`_ is a command-line tool for interacting with a
Kubernetes cluster. `Helm <https://helm.sh/docs/helm/helm_install/>`_ is used to install and upgrade
Determined on Kubernetes. This section covers some of the useful kubectl and helm commands when
Determined on Kubernetes. This section covers some useful ``kubectl`` and ``helm`` commands when
:ref:`running Determined on Kubernetes <install-on-kubernetes>`.

For all the commands listed below, include ``-n <kubernetes namespace name>`` if running Determined
Expand Down
Empty file.
1 change: 1 addition & 0 deletions model_hub/model_hub/__version__ 2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = "0.34.1-dev0"
Empty file added model_hub/model_hub/py 2.typed
Empty file.
106 changes: 106 additions & 0 deletions model_hub/model_hub/utils 2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import logging
import os
import typing
import urllib.parse
from typing import Any, Dict, List, Union

import filelock
import numpy as np
import requests
import torch


def expand_like(arrays: List[np.ndarray], fill: float = -100) -> np.ndarray:
"""
Stacks a list of arrays along the first dimension; the arrays are allowed to differ in
the second dimension but should match for dim > 2.
The output will have dimension
(sum([l.shape[0] for l in arrays]), max([l.shape[1] for l in in arrays]), ...)
For arrays that have fewer entries in the second dimension than the max, we will
pad with the fill value.
Args:
arrays: List of np.ndarray to stack along the first dimension
fill: Value to fill in when padding to max size in the second dimension
Returns:
stacked array
"""
full_shape = list(arrays[0].shape)
if len(full_shape) == 1:
return np.concatenate(arrays)
full_shape[0] = sum(a.shape[0] for a in arrays)
full_shape[1] = max(a.shape[1] for a in arrays)
result = np.full(full_shape, fill)
row_offset = 0
for a in arrays:
result[row_offset : row_offset + a.shape[0], : a.shape[1]] = a
row_offset += a.shape[0]
return result


def numpify(x: Union[List, np.ndarray, torch.Tensor]) -> np.ndarray:
"""
Converts List or torch.Tensor to numpy.ndarray.
"""
if isinstance(x, np.ndarray):
return x
if isinstance(x, List):
return np.array(x)
if isinstance(x, torch.Tensor):
return x.cpu().numpy() # type: ignore
raise TypeError("Expected input of type List, np.ndarray, or torch.Tensor.")


def download_url(download_directory: str, url: str) -> str:
url_path = urllib.parse.urlparse(url).path
basename = url_path.rsplit("/", 1)[1]

os.makedirs(download_directory, exist_ok=True)
filepath = os.path.join(download_directory, basename)
lock = filelock.FileLock(filepath + ".lock")

with lock:
if not os.path.exists(filepath):
logging.info("Downloading {} to {}".format(url, filepath))

r = requests.get(url, stream=True)
with open(filepath, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
return filepath


def compute_num_training_steps(experiment_config: Dict, global_batch_size: int) -> int:
max_length_unit = list(experiment_config["searcher"]["max_length"].keys())[0]
max_length: int = experiment_config["searcher"]["max_length"][max_length_unit]
if max_length_unit == "batches":
return max_length
if max_length_unit == "epochs":
if "records_per_epoch" in experiment_config:
return max_length * int(experiment_config["records_per_epoch"] / global_batch_size)
raise Exception(
"Missing num_training_steps hyperparameter in the experiment "
"configuration, which is needed to configure the learning rate scheduler."
)
# Otherwise, max_length_unit=='records'
return int(max_length / global_batch_size)


class AttrDict(dict):
def __init__(self, *args: Any, **kwargs: Any) -> None:
super().__init__(*args, **kwargs)
self.__dict__ = self
for key in self.keys():
if isinstance(self[key], dict):
self[key] = AttrDict(self[key])

if typing.TYPE_CHECKING:

def __getattr__(self, item: Any) -> Any:
return True

def __setattr__(self, item: Any, value: Any) -> None:
return None

0 comments on commit 462ed58

Please sign in to comment.