From eb1b0de39cf28abffa696393b9c2d21fe9de384f Mon Sep 17 00:00:00 2001 From: Tara Date: Thu, 19 Sep 2024 15:21:37 -0500 Subject: [PATCH] docs: Add workload alerting (#9938) --- docs/integrations/_index.rst | 62 +++++-- docs/integrations/notification/_index.rst | 118 +++----------- docs/integrations/notification/slack.rst | 2 + .../notification/workload-alerting.rst | 151 ++++++++++++++++++ docs/integrations/notification/zapier.rst | 2 + 5 files changed, 223 insertions(+), 112 deletions(-) create mode 100644 docs/integrations/notification/workload-alerting.rst diff --git a/docs/integrations/_index.rst b/docs/integrations/_index.rst index 8f3ec193ace..dc969b9d6e7 100644 --- a/docs/integrations/_index.rst +++ b/docs/integrations/_index.rst @@ -7,23 +7,51 @@ .. meta:: :description: Discover how Determined integrates with other popular machine learning ecosystem tools. -Determined is designed to easily integrate with other popular ML ecosystem tools for tasks that are -related to model training, such as ETL, ML pipelines, and model serving. It is recommended to use -the :ref:`python-sdk` to interact with Determined. - -- :ref:`data-transformers`: Dive into how Determined integrates with data transformation tools such - as :ref:`pachyderm-integration`. -- :ref:`ides-index`: Determined shells can be used in the popular IDEs similarly to a common remote - SSH host. -- :ref:`notifications`: Make use of webhooks to integrate Determined into your existing workflows. -- :ref:`prometheus-grafana`: Discover how to enable a Grafana dashboard to monitor Determined - hardware and system metrics on a cloud cluster, such as AWS or Kubernetes. - -Learn more: - -Visit the `Works with Determined `__ -repository to find examples of how to use Determined with a variety of ML ecosystem tools, including -Pachyderm, DVC, Delta Lake, Seldon, Spark, Argo, Airflow, and Kubeflow. +Determined seamlessly integrates with popular ML ecosystem tools to enhance your model training +workflow. From data transformation to monitoring and alerting, our integrations help streamline your +ML pipeline. + +****************** + Key Integrations +****************** + +- **Data Transformation**: Integrate with tools like :ref:`pachyderm-integration` to streamline + your data preprocessing. + +- **Development Environments**: Use Determined shells in popular IDEs, similar to remote SSH hosts. + Learn more at :ref:`ides-index`. + +- **Workload Alerting**: Set up :ref:`workload-alerting` through webhooks to stay informed about + your experiments in real-time. For a comprehensive overview of notification options, see + :ref:`notifications`. + +- **Monitoring**: Enable Grafana dashboards to monitor hardware and system metrics on cloud + clusters. See :ref:`prometheus-grafana` for details. + +***************** + Getting Started +***************** + +To make the most of these integrations, we recommend using the :ref:`python-sdk` to interact with +Determined. + +************** + Explore More +************** + +Visit our `Works with Determined `__ +repository for examples of using Determined with various ML ecosystem tools, including: + +- Pachyderm +- DVC +- Delta Lake +- Seldon +- Spark +- Argo +- Airflow +- Kubeflow + +These examples demonstrate how Determined can enhance your existing ML workflows and tools. .. toctree:: :hidden: diff --git a/docs/integrations/notification/_index.rst b/docs/integrations/notification/_index.rst index 34c6305de27..8f14f2258ac 100644 --- a/docs/integrations/notification/_index.rst +++ b/docs/integrations/notification/_index.rst @@ -4,9 +4,14 @@ Notifications ############### -Monitoring experiment status is a vital part of working with Determined. In order to integrate -Determined into your existing workflows, you can make use of webhooks to update other systems, -receive emails, slack messages, and more when an experiment is updated. +Monitoring experiment status is crucial when working with Determined. To integrate Determined into +your existing workflows, you can use :ref:`workload-alerting` through webhooks. This feature allows +you to receive timely updates about your experiments via various channels such as email, Slack +messages, or other systems. + +Workload alerting is particularly useful for real-time monitoring, debugging, and custom +notifications. For example, you can configure alerts to trigger as soon as specific events occur in +your experiments, rather than waiting for tasks to reach final states like "Completed" or "Error". Webhooks such as tasklog webhooks are useful for real-time monitoring, debugging, custom notifications, and integration with other systems. For example, using ``Tasklog``, you could get @@ -127,15 +132,17 @@ Below is an example of handling a signed payload in Python. Supported Triggers ================== -``Completed`` or ``Error`` will be triggered when an experiment in scope is completed or errored. +Determined supports the following webhook trigger types: + +``COMPLETED`` or ``ERROR`` will be triggered when an experiment in scope is completed or errored. -``Tasklog`` will be triggered when a task matching regex is detected. +``TASKLOG`` will be triggered when a task matching regex is detected. -``Custom`` will only be triggered from experiment code. +``CUSTOM`` will only be triggered from experiment code. .. code:: - # Here is an example code to trigger a custom trigger. + # Example code to trigger a custom trigger. # config.yaml integrations: @@ -147,99 +154,20 @@ Supported Triggers with det.core.init() as core_context: core_context.alert(title="some title", description="some description", level="info") -******************* - Creating Webhooks -******************* - -To create a webhook, follow these steps: - -- Navigate to ``/det/webhooks`` or select **Webhooks** in the left-side navigation pane. -- Choose **New Webhook**. - -.. image:: /assets/images/webhook.png - :width: 100% - :alt: Webhooks interface showing New Webhook button. - -.. note:: - - If you do not have sufficient permissions to view and create webhooks, consult with a systems - administrator. - -- Workspace: Select a workspace where you have permission to create webhooks. -- Name: Supply a unique identifier for referencing the webhook in the experiment configuration. -- URL: Enter the webhook URL. -- Type: Choose either ``Default`` or ``Slack``. The ``Slack`` type automatically formats message - content for better readability on Slack. -- Trigger: Select the event you want to monitor. See the list of supported triggers in the - :ref:`supported-webhook-triggers` section. -- Triggered by: Choose whether to monitor all experiments within the workspace. For the ``Custom`` - option, the trigger applies only to specific experiments. - -.. code:: - - # Example of an experiment configuration with webhooks - - integrations: - webhooks: - webhook_name: - - - -- Regex: If the webhook is configured to trigger on Tasklog, define a regex using `Golang Regex - Syntax `_. - -.. image:: /assets/images/webhook_modal.png - :width: 100% - :alt: Webhook user interface showing the fields you will interact with. - -Once created, your webhook will automatically execute for the selected events within the specified -experiments. - -****************** - Testing Webhooks -****************** - -To test a webhook, select the more-options menu to the right of the webhook record to access -available actions. - -.. image:: /assets/images/webhook_action.png - :width: 100% - :alt: Webhooks interface showing where to find the actions menu - -Select **Test Webhook** to trigger a test event to be sent to the defined webhook URL with a mock -payload as stated below: +**************** + Using Webhooks +**************** -.. code:: - - { - "event_id": "b8667b8a-e14d-40e5-83ee-a64e31bdc5f4", - "event_type": "EXPERIMENT_STATE_CHANGE", - "timestamp": 1665695871, - "condition": { - "state": "COMPLETED" - }, - "event_data": { - "data": "test" - } - } - -******************* - Deleting Webhooks -******************* +To get started with webhooks in Determined: -To delete a webhook, select the more-options menu to the right of the webhook record to expand -available actions. +#. For step-by-step instructions on creating webhooks, see :ref:`creating-webhooks`. -****************** - Editing Webhooks -****************** +#. For use cases and best practices, visit :ref:`workload-alerting` guide. -To edit a webhook, select the more-options menu to the right of the webhook record to expand -available actions. - -.. note:: +#. For platform-specific integration guides, see: - Determined only supports editing the URL of webhooks. To modify other attributes, delete and - recreate the webhook. + - :ref:`slack-integration` + - :ref:`zapier-integration` .. toctree:: :caption: Notification diff --git a/docs/integrations/notification/slack.rst b/docs/integrations/notification/slack.rst index dc4599dd8b6..2e3d89589a1 100644 --- a/docs/integrations/notification/slack.rst +++ b/docs/integrations/notification/slack.rst @@ -1,3 +1,5 @@ +.. _slack-integration: + ####### Slack ####### diff --git a/docs/integrations/notification/workload-alerting.rst b/docs/integrations/notification/workload-alerting.rst new file mode 100644 index 00000000000..5b0df0b5cf7 --- /dev/null +++ b/docs/integrations/notification/workload-alerting.rst @@ -0,0 +1,151 @@ +.. _workload-alerting: + +################### + Workload Alerting +################### + +Workload alerting allows you to monitor the state of your experiments and share important +information with your team members. This feature enables proactive issue detection while maintaining +a good signal-to-noise ratio. + +.. note:: + + To use this experimental feature, enable "Webhook Improvement" in :ref:`user settings + `. + +************** + Key Concepts +************** + +- Webhook Trigger options: "All experiments in Workspace" and "Specific experiment(s) with matching + configuration" +- Webhook Exclusion +- Trigger Types: COMPLETED, ERROR, TASKLOG, CUSTOM +- Alert Levels: INFO, WARN, DEBUG, ERROR + +For detailed information on supported triggers and example usage, see :ref:`notifications`. + +.. _creating-webhooks: + +******************* + Creating Webhooks +******************* + +As a non-admin user with Editor or higher permissions, you can configure webhooks within your +workspace. Here's how to create webhooks: + +#. Navigate to the **Webhooks** section in the WebUI. + +#. Select **New Webhook**. + +#. In the New Webhook dialogue: + + - Select your Workspace + - Name your webhook + - Paste the webhook URL (e.g., from Zapier) + - Set Type to either Default or Slack + - Select the Trigger event (COMPLETED, ERROR, TASKLOG, or CUSTOM) + - Choose the Trigger by option: "All experiments in Workspace" or "Specific experiment(s) with + matching configuration" + - If "Specific experiment(s) with matching configuration", note the Webhook Name for use in + experiment configurations + +#. Click **Create Webhook**. + +******************* + Deleting Webhooks +******************* + +To delete a webhook, select the more-options menu to the right of the webhook record to expand +available actions. + +****************** + Editing Webhooks +****************** + +To edit a webhook, select the more-options menu to the right of the webhook record to expand +available actions. + +.. note:: + + Determined only supports editing the URL of webhooks. To modify other attributes, delete and + recreate the webhook. + +*********** + Use Cases +*********** + +Webhooks in Determined offer versatile solutions for various monitoring and alerting needs. Let's +explore some common use cases to help you leverage this powerful feature effectively. + +Case 1: Share Simple State on All Experiments in Workspace +========================================================== + +This use case is ideal for teams that want to maintain a broad overview of all experiments running +in a workspace, ensuring that no important updates are missed. + +#. Create a webhook with the "All experiments in Workspace" option. +#. Select the desired trigger events (COMPLETED, ERROR, TASKLOG). +#. All experiments in the workspace will now trigger this webhook unless explicitly excluded. + +Case 2: Exclude Specific Experiments from Triggering Webhooks +============================================================= + +During active development or debugging, you may want to prevent certain experiments from triggering +alerts to reduce noise and focus on specific tasks. + +#. Edit the experiment configuration: + + .. code:: yaml + + integrations: + webhooks: + exclude: true + +#. Run the experiment and verify that no webhooks are triggered. + +Case 3: Customizable Monitoring for Specific Experiments +======================================================== + +For critical experiments or those requiring special attention, you can set up custom monitoring to +receive tailored alerts based on specific conditions or events in your code. + +#. Create a webhook with the "Specific experiment(s) with matching configuration" option and + "CUSTOM" trigger type. + +#. Note the Webhook Name. + +#. In the experiment configuration, reference the webhook: + + .. code:: yaml + + integrations: + webhooks: + webhook_name: + - + +#. In your experiment code, use the `core_context.alert()` function to trigger the webhook: + + .. code:: python + + with det.core.init() as core_context: + core_context.alert( + title="Custom Alert", + description="This is a custom alert", + level="INFO" + ) + +#. Run the experiment and check the event log in your webhook service for the custom data. + +For more details on custom triggers, see :ref:`notifications`. + +**************** + Best Practices +**************** + +- Use "Open" subscription mode for general monitoring of all experiments in a workspace. +- Leverage "Run specific" mode and custom triggers for fine-grained control over alerts for + critical experiments. +- Use webhook exclusion for experiments under active iteration to reduce noise. +- Regularly review and update your webhook configurations to ensure they remain relevant and + useful. diff --git a/docs/integrations/notification/zapier.rst b/docs/integrations/notification/zapier.rst index cdaeb4a8721..57c6e1a0f37 100644 --- a/docs/integrations/notification/zapier.rst +++ b/docs/integrations/notification/zapier.rst @@ -1,3 +1,5 @@ +.. _zapier-integration: + ######## Zapier ########