Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reliability information. #1155

Merged
merged 2 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions public/docs/fleet/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ in the cloud — and the developer tooling to help orchestrate the devices.
possible to host your own broker, so all your data and code remains under your
control.

Artemis is designed to be [reliable](./reliability) and robust. Configuration is
treated as code, and none of the servers are critical. As long as
the configuration is stored in a reliable system (typically Git), Artemis
can recover from a full loss of all servers.

---------------------

# Installation
Expand Down
134 changes: 134 additions & 0 deletions public/docs/fleet/reliability.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Reliability

The most important job of Artemis is to keep devices updatable. Since updates
are delivered through the broker, it is critical that the device can reach the broker.

## Strategy

The Artemis service that is installed on the devices periodically downloads its
configuration from the broker.

If the device can't reach the broker, it will retry more and more aggressively.
First it will reduce the interval between
check-ins. At some point, it will turn off non-critical containers during connection
attempts. Eventually, it will even disable critical containers for a short period of time.

All of these measures don't help if the broker is not available anymore. In that case,
the device attempts to reach a recovery URL. This URL can provide a new broker
configuration, allowing the device to connect again.

Being able to contact the broker means that the device can fetch new configurations,
and, hopefully, also firmware. However, any new firmware isn't guaranteed to be able
to do the same. As such, Artemis only commits to a new firmware after it has
successfully connected to the broker with the new firmware.

As a last resort, Artemis also uses a watchdog. If the device can't connect to the
broker for a long time, the watchdog will trigger and reboot the device. This also
guards against the Artemis service itself getting into a bad state.

## Max offline

The frequency at which Artemis contacts the broker can be configured using the
`max-offline` setting in the pod specification. A value of `0s` means that the
device stays connected to the broker and polls it continuously (but at most
every 20 seconds). A value of `1h` means that the device can be offline for
up to an hour before it tries to reconnect.

Here is a pod specification with a `max-offline` value of `1h`:

```yaml
# yaml-language-server: $schema=https://toit.io/schemas/artemis/pod-specification/v1.json

$schema: https://toit.io/schemas/artemis/pod-specification/v1.json
name: my-pod
sdk-version: v2.0.0-alpha.160
artemis-version: v0.24.0
max-offline: 1h
firmware-envelope: x64-linux
containers: {}
```

The max-offline value can also be set using the `toit` CLI:

```bash
toit device -d DEVICE-ID set-max-offline 1h
```

This configuration change is not permanent and will be lost with the next firmware update.

## Status states

The Artemis service on the device keeps track of its connection status. It is either
green, yellow, orange or red. As of September 2024, the following heuristics are used
to determine the status:
- green: the device is within the `max-offline` window.
- yellow: the device is outside the `max-offline` window, but within a factor of 2. For
example, if `max-offline` is set to 1h, then the device is in yellow state if it hasn't
connected for more than 1h, but less than 2h.
- orange: the device is outside the `max-offline` window, but within a factor of 3.
- red: the device is outside the `max-offline` window, and has been for more than 3 times
the `max-offline` value.

The following actions (again, as of September 2024) are taken based on the status.

### Orange

When a device is in orange state, Artemis reduces the interval between reconnection
attempts by 50%. For example, if `max-offline` is set to 1h, then the device will
try to reconnect every 30 minutes.

Artemis also disables all non-critical containers during reconnection attempts.

In this state, Artemis always tries a random entry from the recovery-URLs list to
get a new broker configuration.

### Red

Devices that can't connect to the broker for a long time are put into red state.

The red state is a more extreme version of the orange state. Artemis will reduce the
reconnect interval by a factor of 4. For example, if `max-offline` is
set to 1h, then the device will try to reconnect every 15 minutes.

As for the orange state, Artemis disables all non-critical containers during
reconnection attempts. However, now Artemis also disables critical containers
15% of the time.

Contrary to the orange state, Artemis does not always contact the recovery URL.
In case the recovery-URL connection makes things worse Artemis only fetches
recovery URLs 20% of the time.

## Recovery URLs

Even though recovery URLs are baked into pods, they are stored as properties of
the fleet. We don't expect recovery URLs to change often, and having them in the
fleet configuration avoids duplication.

A recovery URL can be added to an existing fleet with the following command:

```bash
artemis fleet recovery add RECOVERY-URL
```

If the existing broker for a device doesn't exist anymore, then the device can
use the recovery URL to get a new broker configuration. The new broker configuration
can be generated by running `artemis fleet recovery export -o recovery.json` and
serving the resulting `recovery.json` file at the recovery URL.

We recommend to use a different domain as recovery URL. This way, the recovery URL
is not affected by the same issues as the main broker.

Note that recovery URLs don't need to be online until they are needed.

## Watchdogs

As mentioned earlier, Artemis uses watchdogs to monitor its own health. Given a
certain `max-offline` value, the watchdog will trigger if the device hasn't connected
to the broker for more than 5 times the `max-offline` value (or at least 2 hours, if
`max-offline` is small).

This watchdog is a powerful last resort in that it handles many different failures. If
Artemis itself gets into a bad state the watchdog will eventually help to recover.
Similarly, user programs can communicate with the Artemis service through the [Artemis
package](https://pkg.toit.io/package/github.com%2Ftoitware%2Ftoit-artemis) and a bug
in the user program could accidentally disallow Artemis to connect to the broker.
Loading