Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains a revised and expanded implementation of the
servo check
functionality.Background context (skip down if you can't be bothered)
For the uninitiated, a
Check
object represents a runtime for that a particular dependency is available or that configuration is correct. Examples include "Can I talk to Redis?", "Does my service account in Kubernetes have sufficient permissions?", and "Are my Prometheus queries well-formed and returning results?".As I have been building out Opsani Dev and doing a ton of setup and tear-down work, I have become increasingly convinced that checks represent a key driver of deployment efficiency and user experience. The most difficult aspects of getting a deployment live are gathering information (a social problem) and verifying requirements and configuration (a technical problem). Checks provide a workflow and feedback loop that enables us to move rapidly between data gathering and verification in a sustainable and repeatable way.
The original check implementation provided a dead simple mechanism for servo connectors to run a self-survey and return a pass/fail value with an accompanying descriptive message. This quickly proved to be insufficient because trying to aggregate all the dimensions of connector status into a single boolean and string description became incomprehensible.
The next implementation expanded from a single check value into a list of checks, enabling checks to become atomic and better model the internal realities of connectors that may be partially working. This model carried us for quite a while.
As the collection of connectors continued to grow and configuration became more complex, cracks started to form in the list of checks approach. For one thing, the check methods started to grow very large by virtue of the amount of boilerplate code necessary to run checks -- it typically requires rescuing and introspecting exceptions, building a collection of objects literals, and then actually implementing the logic.
But then, the real problems started to show. Because checks were written linearly within the
check
event handler, failures that occur deep in the list of checks would not execute until everything else before it had completed. For some situations, this is just some annoying latency that slows down the feedback loop. But in others where you have long-running check implementations, it is a deal-breaker. For example, when implementing checks for Kubernetes there are a series of things that you want to run, starting from seeing if you can connect to the API server and then scaling up to "Can I deploy a canary?", "Can I scale this service to the maximum replica count?", "Can a Pod I create pull an image?", "Will the scheduler schedule a Pod with the lowest and highest guardrails?", etc. All of these operations are potentially long-running. This creates developer and operator tension (in this case me being mad at myself) because on the one hand you want to focus on completeness and correctness so you want to be exhaustive in the things you check, while on the other you want a high throughput workflow to focus only on the problems and get shit done.After pondering these challenges for a while and sketching out a number of designs, I finally landed on something that I think is really solid.
Checks 2.0
Features:
Creating a Check
Checks can now be created with a decorator:
The decorator transforms the function into a method that returns a
Check
object. When called it, it invokes the original method implementation and determines success/failure based on the return type:Exceptions are guarded for you. Write the shortest code possible that can check the condition.
You can also return a message that will be displayed in the CLI (more on this later):
or return a bool and message to do both at once:
A check that returns
None
and doesn't raise is a success:Check metadata
Checks can be enriched with metadata:
Metadata comes into play a bit later. But for now, keep in mind that the
id
is a short unique identifier that will be auto-assigned if unspecified andtags
is a set of lightweight descriptors about check behavior and context.Creating Checks
Checking several conditions in one connector can be verbose even with the decorator:
but more importantly, there is no way to work with the collection. It's an all or nothing operation where all the checks are run and returned every time you call
servo check
.We can do better on both fronts:
The checks are now encapsulated into a standalone class that can be tested in isolation. The check event handler is now nice and tidy.
The
BaseChecks
class has some interesting capabilities. It enforces a policy that all instance methods are prefixed withcheck_
and return aCheck
object or are designated as helper methods by starting with an underscore:Checks are always executed in method definition order (or top-to-bottom if you prefer).
This becomes important in a second.
Required checks
Not all checks are created equal. There are some checks that upon failure imply that all following checks will already have implicitly failed. Such checks can be described as required.
Consider the example of implementing checks for Kubernetes. The very first thing that it makes sense to do is check if you can connect to the API server (or run
kubectl
in a subprocess). If this check fails, then it makes zero sense to even attempt to check if you can create a Pod, read Deployments, have the required secrets, etc.To handle these cases, we can combine the notion of a required check with the guarantee of checks executing in method definition order to express these relationships between checks:
In this example, we have two required checks that act as circuit breakers to halt execution upon failure. If
check_api
fails, then no other checks will be run (more on this in a minute) and you will get a single error to debug. Ifcheck_api
succeeds butcheck_read_deployments
fails, then the resource limits won't be checked because if you can't see the Deployment you can't get it to its containers and the requests/limits values.New event handler
The check metadata mentioned earlier combines with required checks and the execution order guarantee to provide some very nice capabilities for controlling check execution.
To support these enhancements, the method signature of the
check
event handler has changed:There are a few things going on here. We have two new positional parameters:
filter_
andhalt_on
. Let's look at these one at a time.Filtering checks
The
filter_
argument is an instance ofservo.checks.Filter
which looks like this (edited down for brevity and clarity):These are the same attributes discussed earlier in the check metadata section. The filter matches against checks using AND semantics (all constraints must be satisfied for a match to occur).
The
name
andid
attributes can be compared against an exact value (typestr
), a set of possible values (typeSequence[str]
, which includes lists, sets, and tuples of strings), or evaluated against a regular expression pattern (typePattern[str]
). Values are compared case-sensitively.id
values are always lowercase alphanumeric characters or_
.Tags are evaluated with set intersection semantics (the constraint is satisfied if the check has any tags in common with the filter).
A value of
None
always evaluates positively for the particular constraint.Halting check execution
Depending on what you are doing, it can be desirable to handle failing checks differently. You may wish to fail fast to identify a blocked requirement or you may want to run every check and get a sense for how broken your setup is all in.
This is where the
halt_on
parameter comes in.halt_on
is a value of theHaltOnFailed
enumeration which looks like:Selecting the appropriate
halt_on
value lets you decide how much feedback you want to gather in a given check run.CLI upgrades
All of the above changes are pretty hard to utilize without an interface. As such, the servo CLI has been upgraded with some new tricks:
Results get aggregated and summarized by connector:
We can run a check by name:
Or a set of IDs comma separated:
Or every check that contains "exec" (strings in slashes "/like this/" are compiled as regex):
And set the halting behavior in the face of failures:
Creating Checks from an Iterable
Sometimes you have a collection of homogenous items that need to be checked. A common example is a list of queries for a metrics provider like Prometheus:
We don't want to handwrite a method for each of these and if we just loop over it, we can't use filters to focus on the failure cases -- making debugging slower and noisier.
What we want is the ability to synthesize a checks class without having to write the code by hand:
Here the
check_query
inner function is going to be used just like earlier examples that were "checkified" via the@check
decorator and theself.config.metrics
collection is going to be treated like a list of methods in aBaseChecks
subclass.The call to
create_checks_from_iterable
returns a new dynamically created subclass ofBaseChecks
withcheck_
instance methods attached for every item in theself.config.metrics
collection.The
PrometheusChecks
class behaves exactly like a manually coded checks subclass and can be filtered, etc.Checkable Protocol
Protocols are a relatively recent extension to Python that supports structural subtyping. This is basically the idea that a class does not have to explicitly inherit from another class in order to be considered its subtype. It is an extension of the concept of duck typing in dynamic languages to the typing system (sometimes called "Static Duck Typing", see [https://www.python.org/dev/peps/pep-0544/](PEP 544)).
The
servo.checks.Checkable
protocol defines a single method called__check__
that returns aCheck
object. The protocol is used extensively in the internals but can be used as a public API to provide check implementations for arbitrary objects.Safe and productive by default
The checks subsystem works really hard to make the easy thing delightful and the wrong impossible. There is extensive enforcement around type hint contracts to avoid typo bugs. The code is extensively documented and covered with tests.
Open Questions
Support async checks in parallel?
The predictable execution path of the revised implementation opens the door to executing more checks in parallel. Right now check events for connectors are parallelized while individual checks are executed serially. Required checks effectively let you partition groups of checks and run them in parallel since you know that they have no interdependencies and their parent dependencies have already been met.
Does it make sense to promote checks to a top-level concept like configuration?
The benefit is eliminating additional boilerplate with the trade-off that the design becomes de-facto more rigid and magical as most folks will never realize that checks built on top of eventing and can be directly customized with an event handler method.
But if I have now covered the 80%+ of cases then most folks would never even need to know.