From 3d2b9bc18c3b8c7bad7e4410cae0e5c2ad40690e Mon Sep 17 00:00:00 2001 From: Tara Charter Date: Thu, 24 Oct 2024 11:38:53 -0500 Subject: [PATCH] docs: log policies restore exp config restore experiment configuration file content original #10098 --- .../reference/experiment-config-reference.rst | 24 +++++++++++++++---- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst index 4374e0e399a..e4b489d927d 100644 --- a/docs/reference/experiment-config-reference.rst +++ b/docs/reference/experiment-config-reference.rst @@ -315,9 +315,21 @@ language syntax). For more information about the syntax, you can visit this `RE2 - ``action``: Optional. The action to take when the pattern is matched. Actions include: - - ``exclude_node``: Excludes a failed trial's restart attempts from being scheduled on nodes - with matching error logs. - - ``cancel_retries``: Prevents a trial from restarting if it reports a matching log. + - ``exclude_node``: Excludes a failed trial's restart attempts (due to its ``max_restarts`` + policy) from being scheduled on nodes with matched error logs. This is useful for bypassing + nodes with hardware issues, such as uncorrectable GPU ECC errors. + + .. note:: + + This option is not supported on PBS systems. + + For the agent resource manager, if a trial becomes unschedulable due to enough node + exclusions, and ``launch_error`` in the master config is set to true (default), the trial will + fail. + + - ``cancel_retries``: Prevents a trial from restarting if a log matches the pattern, even if the + trial has remaining max_restarts. This avoids using resources retrying a trial that encounters + failures unlikely to be resolved by retrying, such as CUDA memory issues. Example configuration: @@ -334,12 +346,14 @@ Example configuration: type: cancel_retries When a log policy matches, its name (if provided) will be displayed as a label in the WebUI, -allowing for easy identification of specific issues or events during a run. +allowing for easy identification of specific issues or events during a run. These labels will appear +in both the run table and run detail views. These settings may also be specified at the cluster or resource pool level through task container defaults. -To find out more about log management, visit :ref:`Log Management `. +To find out more about log management features like **Log Search** and **Log Signal**, visit +:ref:`Log Management `. .. _log-retention-days: