Merge pull request #227 from lsst-sqre/tickets/DM-38408

DM-38408: Add support for the new Nublado lab controller
lsst-sqre · Mar 23, 2023 · 9d98985 · 9d98985
2 parents dbacdf8 + 3b7947b
commit 9d98985
Show file tree

Hide file tree

Showing 18 changed files with 502 additions and 195 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,39 @@
+# Change log
+
+Versioning follows [semver](https://semver.org/).
+
+Dependencies are updated to the latest available version during each release. Those changes are not noted here explicitly.
+
+## 5.0.0 (2023-03-22)
+
+### Backwards-incompatible changes
+
+- Settings are now handled with Pydantic and undergo much stricter validation. In particular, the Slack web hook URL must now be a valid URL if provided.
+- In order to enable stricter and more useful Pydantic validation of flock specifications, the syntax for creating a flock has changed. `business` is now a dictionary, the `restart` option has been moved under it, the type of business is specified with `type`, and the business configuration options have moved under that key as `options`. Options that are not applicable to a given business type are now rejected.
+- The `jupyter.url_prefix` option is now just `url_prefix`, and `juyter.image` is now just `image`. The names of the setting under `image` have changed.
+- The `TAPQueryRunner` options `tap_sync` and `tap_query_set` are now just `sync` and `query_set`.
+- `lab_settle_time` is no longer supported as a configuration option for the businesses that spawn a Nublado lab. It defaulted to 0 and we never set it.
+- `JupyterJitterLoginLoop` has been retired. Instead, set the `jitter` option on `JupyterPythonLoop`.
+- `JupyterLoginLoop` has been merged with `JupyterPythonLoop`. The only difference in the former is that no lab session was created and no code was run, which seems pointless and not worth the distinction. `JupyterPythonLoop` runs a simple addition by default, which should be an improvement over `JupyterLoginLoop` in every likely situation.
+
+### New features
+
+- When the production logging profile is used, the messages from monkeys are no longer reported to the main mobu log, only to the individual monkey logs. This should produce considerably less noise in external log aggregators.
+- The notebook being run is now included in all Slack error reports, not just for code execution failures.
+- The API documentation now shows only the relevant options for the type of business when showing how to create a flock.
+- Add support for running a business once and returning its results, via a POST to the new `/run` endpoint.
+- Add support for the new Nublado lab controller (see [SQR-066](https://sqr-066.lsst.io/).
+- The time a business pauses after a failure before it is restarted is now configurable with the `error_idle_time` option and defaults to 10 minutes (instead of 1 minute) for Nublado businesses, since this is how long JupyterHub will wait for a lab to spawn before giving up.
+
+### Bug fixes
+
+- The `dp0.2` `TAPQueryRunner` query set is now lighter-weight and will consume less memory and CPU to execute, hopefully reducing timeout errors.
+- Cell numbering in error reports is now across all cells, not just code cells.
+- `TAPQueryRunner` no longer creates a TAP client in its `__init__` method, since creating a TAP client makes HTTP requests to the TAP server that can fail and failure would potentially crash mobu. Instead, it creates the TAP client in `startup` and handles exceptions properly so that they're reported to Slack.
+- Business failures during `startup` are now counted as a failed execution so that a business that fails repeatedly in `startup` doesn't report 100% success in the flock summary.
+- The code run by `JupyterPythonLoop` and `NotebookRunner` to get the Kubernetes node on which the lab is running now uses `lsst.rsp.get_node` instead of the deprecated `rubin_jupyer_utils.lab.notebook.utils.get_node`.
+
+### Other changes
+
+- Slightly improve logging when monkeys are shut down due to errors.
+- mobu's internals have been extensively refactored following the design in [SQR-072](https://sqr-072.lsst.io/) to hopefully make future maintenance easier.
diff --git a/src/mobu/config.py b/src/mobu/config.py
@@ -61,6 +61,17 @@ class Configuration(BaseSettings):
         example="https://data.example.org/",
     )
 
+    use_cachemachine: bool = Field(
+        True,
+        field="Whether to use cachemachine to look up an image",
+        description=(
+            "Set this to false in environments using the new Nublado lab"
+            " controller."
+        ),
+        env="USE_CACHEMACHINE",
+        example=False,
+    )
+
     cachemachine_image_policy: CachemachinePolicy = Field(
         CachemachinePolicy.available,
         field="Class of cachemachine images to use",

diff --git a/src/mobu/models/business/base.py b/src/mobu/models/business/base.py
@@ -12,11 +12,13 @@
 
 
 class BusinessOptions(BaseModel):
-    """Options for monkey business.
+    """Options for monkey business."""
 
-    Each type of business should create its own options class that inherits
-    from this class and adds any additional options that it supports.
-    """
+    error_idle_time: int = Field(
+        60,
+        title="How long to wait after an error before restarting",
+        example=600,
+    )
 
     idle_time: int = Field(
         60,

diff --git a/src/mobu/models/business/jupyterpythonloop.py b/src/mobu/models/business/jupyterpythonloop.py
@@ -31,6 +31,7 @@ class JupyterPythonLoopOptions(NubladoBusinessOptions):
             "The number of code snippets to execute before restarting the lab."
         ),
         example=25,
+        ge=1,
     )
 
 

diff --git a/src/mobu/models/business/notebookrunner.py b/src/mobu/models/business/notebookrunner.py
@@ -33,6 +33,7 @@ class NotebookRunnerOptions(NubladoBusinessOptions):
             " left off."
         ),
         example=25,
+        ge=1,
     )
 
     repo_branch: str = Field(

diff --git a/src/mobu/models/business/nublado.py b/src/mobu/models/business/nublado.py
@@ -2,11 +2,12 @@
 
 from __future__ import annotations
 
-from typing import Optional
+from abc import ABCMeta, abstractmethod
+from enum import Enum
+from typing import Literal, Optional
 
-from pydantic import Field
+from pydantic import BaseModel, Field
 
-from ..jupyter import JupyterConfig, JupyterImage
 from .base import BusinessData, BusinessOptions
 
 __all__ = [
@@ -15,6 +16,105 @@
 ]
 
 
+class NubladoImageClass(str, Enum):
+    """Possible ways of selecting an image."""
+
+    RECOMMENDED = "recommended"
+    LATEST_RELEASE = "latest-release"
+    LATEST_WEEKLY = "latest-weekly"
+    LATEST_DAILY = "latest-daily"
+    BY_REFERENCE = "by-reference"
+    BY_TAG = "by-tag"
+
+
+class NubladoImageSize(Enum):
+    """Acceptable sizes of images to spawn."""
+
+    Fine = "Fine"
+    Diminutive = "Diminutive"
+    Tiny = "Tiny"
+    Small = "Small"
+    Medium = "Medium"
+    Large = "Large"
+    Huge = "Huge"
+    Gargantuan = "Gargantuan"
+    Colossal = "Colossal"
+
+
+class NubladoImage(BaseModel, metaclass=ABCMeta):
+    """Base class for different ways of specifying the lab image to spawn."""
+
+    # Ideally this would just be class, but it is a keyword and adding all the
+    # plumbing to correctly serialize Pydantic models by alias instead of
+    # field name is tedious and annoying. Live with the somewhat verbose name.
+    image_class: NubladoImageClass = Field(
+        ...,
+        title="Class of image to spawn",
+    )
+
+    size: NubladoImageSize = Field(
+        NubladoImageSize.Large,
+        title="Size of image to spawn",
+        description="Must be one of the sizes understood by Nublado.",
+    )
+
+    @abstractmethod
+    def to_spawn_form(self) -> dict[str, str]:
+        """Convert to data suitable for posting to JupyterHub's spawn form.
+
+        Returns
+        -------
+        dict of str to str
+            Post data to send to the JupyterHub spawn page.
+        """
+
+
+class NubladoImageByClass(NubladoImage):
+    """Spawn the recommended image."""
+
+    image_class: Literal[
+        NubladoImageClass.RECOMMENDED,
+        NubladoImageClass.LATEST_RELEASE,
+        NubladoImageClass.LATEST_WEEKLY,
+        NubladoImageClass.LATEST_DAILY,
+    ] = Field(
+        NubladoImageClass.RECOMMENDED,
+        title="Class of image to spawn",
+    )
+
+    def to_spawn_form(self) -> dict[str, str]:
+        return {
+            "image_class": self.image_class.value,
+            "size": self.size.value,
+        }
+
+
+class NubladoImageByReference(NubladoImage):
+    """Spawn an image by full Docker reference."""
+
+    image_class: Literal[NubladoImageClass.BY_REFERENCE] = Field(
+        NubladoImageClass.BY_REFERENCE, title="Class of image to spawn"
+    )
+
+    reference: str = Field(..., title="Docker reference of lab image to spawn")
+
+    def to_spawn_form(self) -> dict[str, str]:
+        return {"image_list": self.reference, "size": self.size.value}
+
+
+class NubladoImageByTag(NubladoImage):
+    """Spawn an image by image tag."""
+
+    image_class: Literal[NubladoImageClass.BY_TAG] = Field(
+        NubladoImageClass.BY_TAG, title="Class of image to spawn"
+    )
+
+    tag: str = Field(..., title="Tag of image to spawn")
+
+    def to_spawn_form(self) -> dict[str, str]:
+        return {"image_tag": self.tag, "size": self.size.value}
+
+
 class NubladoBusinessOptions(BusinessOptions):
     """Options for any business that runs code in a Nublado lab."""
 
@@ -33,6 +133,17 @@ class NubladoBusinessOptions(BusinessOptions):
         60, title="Timeout for deleting a lab in seconds", example=60
     )
 
+    # Zero-to-JupyterHub forces the spawner timeout to 10 minutes, and this is
+    # not unreasonable if the user is trying to spawn an image that isn't
+    # prepulled. Since JupyterHub won't let us delete a lab while it's waiting
+    # for spawn, increase the error idle time to long enough that JupyterHub
+    # will have timed out.
+    error_idle_time: int = Field(
+        600,
+        title="How long to wait after an error before restarting",
+        example=600,
+    )
+
     execution_idle_time: int = Field(
         1,
         title="How long to wait between cell executions in seconds",
@@ -50,6 +161,12 @@ class NubladoBusinessOptions(BusinessOptions):
         ),
     )
 
+    image: (
+        NubladoImageByClass | NubladoImageByReference | NubladoImageByTag
+    ) = Field(
+        default_factory=NubladoImageByClass, title="Nublado lab image to use"
+    )
+
     jitter: int = Field(
         0,
         title="Maximum random time to pause",
@@ -64,11 +181,6 @@ class NubladoBusinessOptions(BusinessOptions):
         example=60,
     )
 
-    jupyter: JupyterConfig = Field(
-        default_factory=JupyterConfig,
-        title="Jupyter lab spawning configuration",
-    )
-
     spawn_settle_time: int = Field(
         10,
         title="How long to wait before polling spawn progress in seconds",
@@ -85,17 +197,33 @@ class NubladoBusinessOptions(BusinessOptions):
         610, title="Timeout for spawning a lab in seconds", example=610
     )
 
+    url_prefix: str = Field("/nb/", title="URL prefix for JupyterHub")
+
     working_directory: Optional[str] = Field(
         None,
         title="Working directory when running code",
         example="notebooks/tutorial-notebooks",
     )
 
 
+class RunningImage(BaseModel):
+    """Information about the running JupyterLab image."""
+
+    reference: Optional[str] = Field(
+        None,
+        title="Docker reference for the image",
+    )
+
+    description: Optional[str] = Field(
+        None,
+        title="Human-readable description of the image",
+    )
+
+
 class NubladoBusinessData(BusinessData):
     """Status of a running Nublado business."""
 
-    image: Optional[JupyterImage] = Field(
+    image: Optional[RunningImage] = Field(
         None,
         title="JupyterLab image information",
         description="Will only be present when there is an active Jupyter lab",

diff --git a/src/mobu/models/jupyter.py b/src/mobu/models/jupyter.py
-Original file line number
+Diff line change
@@ Expand Up / @@ -31,6 +31,7 @@ class JupyterPythonLoopOptions(NubladoBusinessOptions): @@
                 "The number of code snippets to execute before restarting the lab."
             ),
             example=25,
+            ge=1,
         )
@@ Expand Down @@