docs: add k8s sec diagram (#9593)

determined-ai · Jul 1, 2024 · 462ed58 · 462ed58
1 parent 7e2e3f6
commit 462ed58
Show file tree

Hide file tree

Showing 7 changed files with 146 additions and 27 deletions.
diff --git a/docs/assets/images/_det-ai-sys-k8s-01-dark.png b/docs/assets/images/_det-ai-sys-k8s-01-dark.png
diff --git a/docs/assets/images/_det-ai-sys-k8s-01-light.png b/docs/assets/images/_det-ai-sys-k8s-01-light.png
diff --git a/docs/setup-cluster/k8s/_index.rst b/docs/setup-cluster/k8s/_index.rst
@@ -4,11 +4,10 @@
  Deploy on Kubernetes
 ######################
 
-This document describes how the Determined runs on `Kubernetes <https://kubernetes.io/>`__. For
-instructions on installing Determined on Kubernetes, see the :ref:`installation guide
-<install-on-kubernetes>`.
+This document describes how Determined runs on Kubernetes. For instructions on installing Determined
+on Kubernetes, see the :ref:`installation guide <install-on-kubernetes>`.
 
-In this topic guide, we will cover:
+This guide covers:
 
 #. How Determined works on Kubernetes.
 #. Limitations of Determined on Kubernetes.
@@ -19,44 +18,57 @@ In this topic guide, we will cover:
 ************************************
 
 :ref:`Installing Determined on Kubernetes <install-on-kubernetes>` deploys an instance of the
-Determined master and a Postgres database in the Kubernetes cluster. Once the master is up and
-running, you can launch :ref:`experiments <experiments>`, :ref:`notebooks <notebooks>`,
-:ref:`TensorBoards <tensorboards>`, :ref:`commands <commands-and-shells>`, and :ref:`shells
-<commands-and-shells>`. When new workloads are submitted to the Determined master, the master
-launches jobs and config maps on the Kubernetes cluster to execute those workloads. Users of
-Determined shouldn't need to interact with Kubernetes directly after installation, as Determined
-handles all the necessary interaction with the Kubernetes cluster. Kubernetes creates and cleans up
-pods for all jobs that Determined may request.
-
-It is also important to note that when running Determined on Kubernetes, a higher priority value
-means a higher priority (e.g. a priority 50 task will run before a priority 40 task). This is
-different from priority scheduling in non-Kubernetes deployments, where lower priority values mean a
-higher priority (e.g. a priority 40 task will run before a priority 50 task).
+Determined master and a Postgres database in the Kubernetes cluster.
+
+.. image:: /assets/images/_det-ai-sys-k8s-01-light.png
+   :class: only-dark
+   :alt: Determined AI system architecture diagram describing how the master node works on kubernetes in dark mode
+
+.. image:: /assets/images/_det-ai-sys-k8s-01-light.png
+   :class: only-light
+   :alt: Determined AI system architecture diagram describing how the master node works on kubernetes in light mode
+
+|
+
+Once the master is running, you can launch :ref:`experiments <experiments>`, :ref:`notebooks
+<notebooks>`, :ref:`TensorBoards <tensorboards>`, :ref:`commands <commands-and-shells>`, and
+:ref:`shells <commands-and-shells>`. When new workloads are submitted to the Determined master, the
+master launches jobs and config maps on the Kubernetes cluster to execute those workloads. Users do
+not need to interact with Kubernetes directly after installation, as Determined handles all the
+necessary interaction with the Kubernetes cluster. Kubernetes creates and cleans up pods for all
+jobs requested by Determined.
+
+.. note::
+
+   When running Determined on Kubernetes, a higher priority value means a higher priority (e.g., a
+   priority 50 task will run before a priority 40 task). This is different from non-Kubernetes
+   deployments, where lower priority values mean higher priority (e.g., a priority 40 task will run
+   before a priority 50 task).
 
 .. _limitations-on-kubernetes:
 
 ***************************
  Limitations on Kubernetes
 ***************************
 
-This section outlines the current limitations of Determined on Kubernetes.
-
 Scheduling
 ==========
 
 By default, the Kubernetes scheduler does not support gang scheduling or preemption. This can be
 problematic for distributed deep learning workloads that require multiple pods to be scheduled
-before execution starts. Determined includes built-in support for the `lightweight coscheduling
-plugin <https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__,
-which extends the default Kubernetes scheduler to support gang scheduling. Determined also includes
-support for priority-based preemption scheduling. Neither are enabled by default. For more details
-and instructions on how to enable the coscheduling plugin, refer to
-:ref:`gang-scheduling-on-kubernetes` and :ref:`priority-scheduling-on-kubernetes`.
+before execution starts.
+
+Determined includes built-in support for the `lightweight coscheduling plugin
+<https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__, which
+extends the default Kubernetes scheduler to support gang scheduling. Determined also supports
+priority-based preemption scheduling. Neither feature is enabled by default. For more details and
+instructions on how to enable the coscheduling plugin, refer to :ref:`gang-scheduling-on-kubernetes`
+and :ref:`priority-scheduling-on-kubernetes`.
 
 Dynamic Agents
 ==============
 
-Determined is not able to autoscale your cluster, but equivalent functionality is available by using
+Determined cannot autoscale your cluster. However, equivalent functionality is available by using
 the `Kubernetes Cluster Autoscaler
 <https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler>`_, which is supported on
 `GKE <https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler>`_ and `EKS
@@ -78,7 +90,7 @@ root. For more information, see: :ref:`run-as-user`.
 
 `kubectl <https://kubernetes.io/docs/tasks/tools/>`_ is a command-line tool for interacting with a
 Kubernetes cluster. `Helm <https://helm.sh/docs/helm/helm_install/>`_ is used to install and upgrade
-Determined on Kubernetes. This section covers some of the useful kubectl and helm commands when
+Determined on Kubernetes. This section covers some useful ``kubectl`` and ``helm`` commands when
 :ref:`running Determined on Kubernetes <install-on-kubernetes>`.
 
 For all the commands listed below, include ``-n <kubernetes namespace name>`` if running Determined

diff --git a/model_hub/model_hub/__init__ 2.py b/model_hub/model_hub/__init__ 2.py
diff --git a/model_hub/model_hub/__version__ 2.py b/model_hub/model_hub/__version__ 2.py
@@ -0,0 +1 @@
+__version__ = "0.34.1-dev0"
diff --git a/model_hub/model_hub/py 2.typed b/model_hub/model_hub/py 2.typed
diff --git a/model_hub/model_hub/utils 2.py b/model_hub/model_hub/utils 2.py
@@ -0,0 +1,106 @@
+import logging
+import os
+import typing
+import urllib.parse
+from typing import Any, Dict, List, Union
+
+import filelock
+import numpy as np
+import requests
+import torch
+
+
+def expand_like(arrays: List[np.ndarray], fill: float = -100) -> np.ndarray:
+    """
+    Stacks a list of arrays along the first dimension; the arrays are allowed to differ in
+    the second dimension but should match for dim > 2.
+
+    The output will have dimension
+    (sum([l.shape[0] for l in arrays]), max([l.shape[1] for l in in arrays]), ...)
+    For arrays that have fewer entries in the second dimension than the max, we will
+    pad with the fill value.
+
+    Args:
+        arrays: List of np.ndarray to stack along the first dimension
+        fill: Value to fill in when padding to max size in the second dimension
+
+    Returns:
+        stacked array
+    """
+    full_shape = list(arrays[0].shape)
+    if len(full_shape) == 1:
+        return np.concatenate(arrays)
+    full_shape[0] = sum(a.shape[0] for a in arrays)
+    full_shape[1] = max(a.shape[1] for a in arrays)
+    result = np.full(full_shape, fill)
+    row_offset = 0
+    for a in arrays:
+        result[row_offset : row_offset + a.shape[0], : a.shape[1]] = a
+        row_offset += a.shape[0]
+    return result
+
+
+def numpify(x: Union[List, np.ndarray, torch.Tensor]) -> np.ndarray:
+    """
+    Converts List or torch.Tensor to numpy.ndarray.
+    """
+    if isinstance(x, np.ndarray):
+        return x
+    if isinstance(x, List):
+        return np.array(x)
+    if isinstance(x, torch.Tensor):
+        return x.cpu().numpy()  # type: ignore
+    raise TypeError("Expected input of type List, np.ndarray, or torch.Tensor.")
+
+
+def download_url(download_directory: str, url: str) -> str:
+    url_path = urllib.parse.urlparse(url).path
+    basename = url_path.rsplit("/", 1)[1]
+
+    os.makedirs(download_directory, exist_ok=True)
+    filepath = os.path.join(download_directory, basename)
+    lock = filelock.FileLock(filepath + ".lock")
+
+    with lock:
+        if not os.path.exists(filepath):
+            logging.info("Downloading {} to {}".format(url, filepath))
+
+            r = requests.get(url, stream=True)
+            with open(filepath, "wb") as f:
+                for chunk in r.iter_content(chunk_size=8192):
+                    if chunk:
+                        f.write(chunk)
+    return filepath
+
+
+def compute_num_training_steps(experiment_config: Dict, global_batch_size: int) -> int:
+    max_length_unit = list(experiment_config["searcher"]["max_length"].keys())[0]
+    max_length: int = experiment_config["searcher"]["max_length"][max_length_unit]
+    if max_length_unit == "batches":
+        return max_length
+    if max_length_unit == "epochs":
+        if "records_per_epoch" in experiment_config:
+            return max_length * int(experiment_config["records_per_epoch"] / global_batch_size)
+        raise Exception(
+            "Missing num_training_steps hyperparameter in the experiment "
+            "configuration, which is needed to configure the learning rate scheduler."
+        )
+    # Otherwise, max_length_unit=='records'
+    return int(max_length / global_batch_size)
+
+
+class AttrDict(dict):
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        super().__init__(*args, **kwargs)
+        self.__dict__ = self
+        for key in self.keys():
+            if isinstance(self[key], dict):
+                self[key] = AttrDict(self[key])
+
+    if typing.TYPE_CHECKING:
+
+        def __getattr__(self, item: Any) -> Any:
+            return True
+
+        def __setattr__(self, item: Any, value: Any) -> None:
+            return None