parent | nav_order |
---|---|
Development |
4 |
This document describes an HTTP-based protocol for orchestrating fleet-wide reboots, used by Zincati. It is modeled after a distributed counting semaphore with recursive locking and lock-ownership.
The FleetLock protocol is a request-response protocol where operations are always initiated by the client (i.e. Zincati). Each operation consists of a JSON payload sent as a POST request to the server.
At an high level, the client can perform two operations:
RecursiveLock
: try to reserve (lock) a slot for rebootingUnlockIfHeld
: try to release (unlock) a slot that it was previously holding
Semaphore locks are owned, so that only the client that created a lock can release it. All operations are recursive, meaning that multiple unbalanced lock/unlock actions by a client are allowed.
Clients start off in one of two states based on the system condition: "initialization" or "finalization". There are a number of states between "initialization" and "finalization" as well. In the "initialization" state, the client tries to release any reboot slot it may have previously held. A successful unlock operation means that the client can proceed into its "steady" state and look for further updates. When an update is found and locally staged, the client proceed into its "pre-reboot" state and tries to lock a reboot slot. A successful lock operation means that the client can proceed into its "finalization" state and finalize a pending update, then reboot.
All endpoints defined below are relative to a common deployment-specific base URL:
/v1/pre-reboot
: reserve/lock a reboot slot/v1/steady-state
: release/unlock a reboot slot
All POST requests contain well-formed JSON body according to the following schema:
client_params
(object, mandatory)id
(string, mandatory, non-empty): client identifier (e.g. node name or UUID)group
(string, mandatory, non-empty): reboot-group of the client
Client ID is a case-sensitive textual label that uniquely identifies a lock holder. It is generated and persisted by each client.
Client group is a mandatory textual label, conforming to the regexp ^[a-zA-Z0-9.-]+$
. This labels can be configured on each client. A server SHOULD check this value and MAY use it to provide multiple reboot buckets (sorting a fleet of nodes into reboot tiers).
By default, Zincati uses the group name "default
" unless explicitly configured otherwise.
Locking and unlocking requests must contain a fleet-lock-protocol
header with a fixed value of true
to ensure that the actual request was directly intended and not a part of unintentional redirection.
If the operation is succesful, a 200 status code is returned. Every other code is considered as a failed operation.
A client with UUID c988d2509fdf4cdcbed39037c56406fb
and group workers
can try to acquire a reboot slot from https://example.com/base
in a way which is conceptually similar to the following:
Request body:
{
"client_params": {
"group": "workers",
"id": "c988d2509fdf5cdcbed39037c56406fb"
}
}
POST request:
curl -H "fleet-lock-protocol: true" -d @body.json http://example.com/base/v1/pre-reboot
Errors on the service endpoints SHOULD be returned to the client as JSON objects, with a 4xx or 5xx HTTP status code. Error values carry a type-identifier and a textual description, according to the following schema:
Key | Optional | Description |
---|---|---|
kind | required | error type identifier, as a non-empty JSON string |
value | required | human-friendly error description, as a non-empty JSON string |
This allows clients to show more specific error details to cluster administrators, instead of generic HTTP errors.
For example, an error value like the following could be returned on /v1/pre-reboot
when all available slots are already in use:
{
"kind": "failed_lock_semaphore_full",
"value": "semaphore currently full, all slots are locked already"
}
Zincati will log this error using the content of value
, and it will track the kind
label in metrics.
A server MUST ensure that possible values for kind
have a bounded/small cardinality.