toitware · floitsch · Sep 10, 2024 · Sep 10, 2024 · Sep 10, 2024
diff --git a/public/docs/fleet/index.mdx b/public/docs/fleet/index.mdx
@@ -10,6 +10,11 @@ in the cloud &mdash; and the developer tooling to help orchestrate the devices.
 possible to host your own broker, so all your data and code remains under your
 control.
 
+Artemis is designed to be [reliable](./reliability) and robust. Configuration is
+treated as code, and none of the servers are critical. As long as
+the configuration is stored in a reliable system (typically Git), Artemis
+can recover from a full loss of all servers.
+
 ---------------------
 
 # Installation

diff --git a/public/docs/fleet/reliability.mdx b/public/docs/fleet/reliability.mdx
@@ -0,0 +1,134 @@
+# Reliability
+
+The most important job of Artemis is to keep devices updatable. Since updates
+are delivered through the broker, it is critical that the device can reach the broker.
+
+## Strategy
+
+The Artemis service that is installed on the devices periodically downloads its
+configuration from the broker.
+
+If the device can't reach the broker, it will retry more and more aggressively.
+First it will reduce the interval between
+check-ins. At some point, it will turn off non-critical containers during connection
+attempts. Eventually, it will even disable critical containers for a short period of time.
+
+All of these measures don't help if the broker is not available anymore. In that case,
+the device attempts to reach a recovery URL. This URL can provide a new broker
+configuration, allowing the device to connect again.
+
+Being able to contact the broker means that the device can fetch new configurations,
+and, hopefully, also firmware. However, any new firmware isn't guaranteed to be able
+to do the same. As such, Artemis only commits to a new firmware after it has
+successfully connected to the broker with the new firmware.
+
+As a last resort, Artemis also uses a watchdog. If the device can't connect to the
+broker for a long time, the watchdog will trigger and reboot the device. This also
+guards against the Artemis service itself getting into a bad state.
+
+## Max offline
+
+The frequency at which Artemis contacts the broker can be configured using the
+`max-offline` setting in the pod specification. A value of `0s` means that the
+device stays connected to the broker and polls it continuously (but at most
+every 20 seconds). A value of `1h` means that the device can be offline for
+up to an hour before it tries to reconnect.
+
+Here is a pod specification with a `max-offline` value of `1h`:
+
+```yaml
+# yaml-language-server: $schema=https://toit.io/schemas/artemis/pod-specification/v1.json
+
+$schema: https://toit.io/schemas/artemis/pod-specification/v1.json
+name: my-pod
+sdk-version: v2.0.0-alpha.160
+artemis-version: v0.24.0
+max-offline: 1h
+firmware-envelope: x64-linux
+containers: {}
+```
+
+The max-offline value can also be set using the `toit` CLI:
+
+```bash
+toit device -d DEVICE-ID set-max-offline 1h
+```
+
+This configuration change is not permanent and will be lost with the next firmware update.
+
+## Status states
+
+The Artemis service on the device keeps track of its connection status. It is either
+green, yellow, orange or red. As of September 2024, the following heuristics are used
+to determine the status:
+- green: the device is within the `max-offline` window.
+- yellow: the device is outside the `max-offline` window, but within a factor of 2. For
+  example, if `max-offline` is set to 1h, then the device is in yellow state if it hasn't
+  connected for more than 1h, but less than 2h.
+- orange: the device is outside the `max-offline` window, but within a factor of 3.
+- red: the device is outside the `max-offline` window, and has been for more than 3 times
+  the `max-offline` value.
+
+The following actions (again, as of September 2024) are taken based on the status.
+
+### Orange
+
+When a device is in orange state, Artemis reduces the interval between reconnection
+attempts by 50%. For example, if `max-offline` is set to 1h, then the device will
+try to reconnect every 30 minutes.
+
+Artemis also disables all non-critical containers during reconnection attempts.
+
+In this state, Artemis always tries a random entry from the recovery-URLs list to
+get a new broker configuration.
+
+### Red
+
+Devices that can't connect to the broker for a long time are put into red state.
+
+The red state is a more extreme version of the orange state. Artemis will reduce the
+reconnect interval by a factor of 4. For example, if `max-offline` is
+set to 1h, then the device will try to reconnect every 15 minutes.
+
+As for the orange state, Artemis disables all non-critical containers during
+reconnection attempts. However, now Artemis also disables critical containers
+15% of the time.
+
+Contrary to the orange state, Artemis does not always contact the recovery URL.
+In case the recovery-URL connection makes things worse Artemis only fetches
+recovery URLs 20% of the time.
+
+## Recovery URLs
+
+Even though recovery URLs are baked into pods, they are stored as properties of
+the fleet. We don't expect recovery URLs to change often, and having them in the
+fleet configuration avoids duplication.
+
+A recovery URL can be added to an existing fleet with the following command:
+
+```bash
+artemis fleet recovery add RECOVERY-URL
+```
+
+If the existing broker for a device doesn't exist anymore, then the device can
+use the recovery URL to get a new broker configuration. The new broker configuration
+can be generated by running `artemis fleet recovery export -o recovery.json` and
+serving the resulting `recovery.json` file at the recovery URL.
+
+We recommend to use a different domain as recovery URL. This way, the recovery URL
+is not affected by the same issues as the main broker.
+
+Note that recovery URLs don't need to be online until they are needed.
+
+## Watchdogs
+
+As mentioned earlier, Artemis uses watchdogs to monitor its own health. Given a
+certain `max-offline` value, the watchdog will trigger if the device hasn't connected
+to the broker for more than 5 times the `max-offline` value (or at least 2 hours, if
+`max-offline` is small).
+
+This watchdog is a powerful last resort in that it handles many different failures. If
+Artemis itself gets into a bad state the watchdog will eventually help to recover.
+Similarly, user programs can communicate with the Artemis service through the [Artemis
+package](https://pkg.toit.io/package/github.com%2Ftoitware%2Ftoit-artemis) and a bug
+in the user program could accidentally disallow Artemis to connect to the broker.