Enhanced checks architecture #29

blakewatters · 2020-08-26T19:35:53Z

This PR contains a revised and expanded implementation of the servo check functionality.

Background context (skip down if you can't be bothered)

For the uninitiated, a Check object represents a runtime for that a particular dependency is available or that configuration is correct. Examples include "Can I talk to Redis?", "Does my service account in Kubernetes have sufficient permissions?", and "Are my Prometheus queries well-formed and returning results?".

As I have been building out Opsani Dev and doing a ton of setup and tear-down work, I have become increasingly convinced that checks represent a key driver of deployment efficiency and user experience. The most difficult aspects of getting a deployment live are gathering information (a social problem) and verifying requirements and configuration (a technical problem). Checks provide a workflow and feedback loop that enables us to move rapidly between data gathering and verification in a sustainable and repeatable way.

The original check implementation provided a dead simple mechanism for servo connectors to run a self-survey and return a pass/fail value with an accompanying descriptive message. This quickly proved to be insufficient because trying to aggregate all the dimensions of connector status into a single boolean and string description became incomprehensible.

The next implementation expanded from a single check value into a list of checks, enabling checks to become atomic and better model the internal realities of connectors that may be partially working. This model carried us for quite a while.

As the collection of connectors continued to grow and configuration became more complex, cracks started to form in the list of checks approach. For one thing, the check methods started to grow very large by virtue of the amount of boilerplate code necessary to run checks -- it typically requires rescuing and introspecting exceptions, building a collection of objects literals, and then actually implementing the logic.

But then, the real problems started to show. Because checks were written linearly within the check event handler, failures that occur deep in the list of checks would not execute until everything else before it had completed. For some situations, this is just some annoying latency that slows down the feedback loop. But in others where you have long-running check implementations, it is a deal-breaker. For example, when implementing checks for Kubernetes there are a series of things that you want to run, starting from seeing if you can connect to the API server and then scaling up to "Can I deploy a canary?", "Can I scale this service to the maximum replica count?", "Can a Pod I create pull an image?", "Will the scheduler schedule a Pod with the lowest and highest guardrails?", etc. All of these operations are potentially long-running. This creates developer and operator tension (in this case me being mad at myself) because on the one hand you want to focus on completeness and correctness so you want to be exhaustive in the things you check, while on the other you want a high throughput workflow to focus only on the problems and get shit done.

After pondering these challenges for a while and sketching out a number of designs, I finally landed on something that I think is really solid.

Checks 2.0

Features:

Create check from method
Create multiple checks in a class
Create checks from an iterable (e.g. a list)
Required checks
Check metadata
Filtering checks
Halting runs
Instrumentation
Enhanced CLI
Checkable Protocol
Safe and productive by default

Creating a Check

Checks can now be created with a decorator:

@check("Something...")
def check_something() -> None:
  ...

The decorator transforms the function into a method that returns a Check object. When called it, it invokes the original method implementation and determines success/failure based on the return type:

@check("Success")
def check_success() -> bool:
  return True # the check succeeded

@check("Failure")
def check_failure() -> bool:
  return False # failed

@check("Fail by exception")
def check_exception() -> None:
  raise RuntimeError("Something went wrong")

Exceptions are guarded for you. Write the shortest code possible that can check the condition.

You can also return a message that will be displayed in the CLI (more on this later):

@check("Message")
def check_message() -> str:
  return "Success message"

or return a bool and message to do both at once:

@check("Tuple outcome")
def check_tuple() -> Tuple[bool, str]:
  return (True, "Returning a tuple value works fine")

A check that returns None and doesn't raise is a success:

@check("None")
def check_none() -> None:
  print("Whatever I do here is a success unless I raise an error.")

Check metadata

Checks can be enriched with metadata:

@check("Metadata",
    description="Include a longer detailed description here...",
    id="metadata",
    tags=("fast", "low_priority"),
)
def check_metadata() -> None:
  ...

Metadata comes into play a bit later. But for now, keep in mind that the id is a short unique identifier that will be auto-assigned if unspecified and tags is a set of lightweight descriptors about check behavior and context.

Creating Checks

Checking several conditions in one connector can be verbose even with the decorator:

class SomeConnector(BaseConnector):
    @event()
    async def check(self) -> List[Check]:
        @check("one")
        def check_one() -> None:
            ...
        
        @check("two")
        def check_two() -> None:
            ...
        
        @check("three")
        def check_three() -> None:
            ...
        
        return [check_one(), check_two(), check_three()]

but more importantly, there is no way to work with the collection. It's an all or nothing operation where all the checks are run and returned every time you call servo check.

We can do better on both fronts:

class SomeChecks(BaseChecks):
    @check("one")
    async def check_one(self) -> None:
        ...
    
    @check("two")
    async def check_two(self) -> None:
        ...
    
    @check("three")
    async def check_three(self) -> None:
        ...

class SomeConnector(BaseConnector):
    @event()
    async def check(self) -> List[Check]:
        return await SomeChecks.run(self.config)

The checks are now encapsulated into a standalone class that can be tested in isolation. The check event handler is now nice and tidy.

The BaseChecks class has some interesting capabilities. It enforces a policy that all instance methods are prefixed with check_ and return a Check object or are designated as helper methods by starting with an underscore:

class ValidExampleChecks(BaseChecks):
    @check("valid")
    async def check_valid(self) -> None:
        ...
    
    def _reverse_string(self, input: str) -> str:
        return input.reverse()

class InvalidExampleChecks(BaseChecks):
    # Forgot to decorate -- wrong return value
    async def check_valid(self) -> None:
        ...
    
    def not_a_check(self) -> None:
        ...
    
    @check("not checkable return value")
    async def check_invalid_return(self) -> int:
        # Cannot be coerced into a check result
        123

Checks are always executed in method definition order (or top-to-bottom if you prefer).
This becomes important in a second.

Required checks

Not all checks are created equal. There are some checks that upon failure imply that all following checks will already have implicitly failed. Such checks can be described as required.

Consider the example of implementing checks for Kubernetes. The very first thing that it makes sense to do is check if you can connect to the API server (or run kubectl in a subprocess). If this check fails, then it makes zero sense to even attempt to check if you can create a Pod, read Deployments, have the required secrets, etc.

To handle these cases, we can combine the notion of a required check with the guarantee of checks executing in method definition order to express these relationships between checks:

class KubernetesChecks(BaseChecks):
    @check("API connectivity", required=True)
    def check_api(self) -> None:
        raise RuntimeError("can't reach API")
    
    @check("read namespace")
    def check_read_namespace(self) -> None:
        ...
    
    @check("has running Pods")
    def check_pods(self) -> None:
        ...
    
    @check("read deployments", required=True)
    def check_read_deployments(self) -> None:
        raise RuntimeError("can't read Deployments")
    
    @check("containers have resource limits")
    def check_resource_limits(self) -> None:
        ...

In this example, we have two required checks that act as circuit breakers to halt execution upon failure. If check_api fails, then no other checks will be run (more on this in a minute) and you will get a single error to debug. If check_api succeeds but check_read_deployments fails, then the resource limits won't be checked because if you can't see the Deployment you can't get it to its containers and the requests/limits values.

New event handler

The check metadata mentioned earlier combines with required checks and the execution order guarantee to provide some very nice capabilities for controlling check execution.

To support these enhancements, the method signature of the check event handler has changed:

class NewEventConnector(BaseConnector):
    @on_event()
    async def check(self, 
        filter_: Optional[Filter] = None, 
        halt_on: HaltOnFailed = HaltOnFailed.requirement
    ) -> List[Check]:
        ...

There are a few things going on here. We have two new positional parameters: filter_ and halt_on. Let's look at these one at a time.

Filtering checks

The filter_ argument is an instance of servo.checks.Filter which looks like this (edited down for brevity and clarity):

class Filter(BaseModel):
    name: Union[None, str, Sequence[str], Pattern[str]] = None
    id: Union[None, str, Sequence[str], Pattern[str]] = None
    tags: Optional[Set[str]] = None

    def matches(self, check: Check) -> bool:
        ...

These are the same attributes discussed earlier in the check metadata section. The filter matches against checks using AND semantics (all constraints must be satisfied for a match to occur).

The name and id attributes can be compared against an exact value (type str), a set of possible values (type Sequence[str], which includes lists, sets, and tuples of strings), or evaluated against a regular expression pattern (type Pattern[str]). Values are compared case-sensitively. id values are always lowercase alphanumeric characters or _.

Tags are evaluated with set intersection semantics (the constraint is satisfied if the check has any tags in common with the filter).

A value of None always evaluates positively for the particular constraint.

Halting check execution

Depending on what you are doing, it can be desirable to handle failing checks differently. You may wish to fail fast to identify a blocked requirement or you may want to run every check and get a sense for how broken your setup is all in.

This is where the halt_on parameter comes in. halt_on is a value of the HaltOnFailed enumeration which looks like:

class HaltOnFailed(str, enum.Enum):
    """HaltOnFailed is an enumeration that describes how to handle check failures.
    """

    requirement = "requirement"
    """Halt running when a required check has failed.
    """

    check = "check"
    """Halt running when any check has failed.
    """

    never = "never"
    """Never halt running regardless of check failures.
    """

Selecting the appropriate halt_on value lets you decide how much feedback you want to gather in a given check run.

CLI upgrades

All of the above changes are pretty hard to utilize without an interface. As such, the servo CLI has been upgraded with some new tricks:

$ servo check --help
Usage: servo check [OPTIONS] [CONNECTORS]...

  Check that the servo is ready to run

Options:
  [CONNECTORS]...                 Connectors to check
  -n, --name TEXT                 Filter by name  [default: False]
  -i, --id TEXT                   Filter by ID  [default: False]
  -t, --tag TEXT                  Filter by tag  [default: False]
  -h, --halt-on-failed [requirement|check|never]
                                  Halt running checks on a failure condition
                                  [default: requirement]

  -v, --verbose                   Display verbose output  [default: False]
  -q, --quiet                     Do not echo generated output to stdout
                                  [default: False]

  --help                          Show this message and exit.

Results get aggregated and summarized by connector:

CONNECTOR    STATUS    ERRORS
servo        X FAILED  (1/1) Opsani API connectivity: Response status code: 500
prometheus   √ PASSED
vegeta       √ PASSED

We can run a check by name:

$ servo check --name "Check throughput"
CONNECTOR    STATUS    ERRORS
prometheus   √ PASSED

Or a set of IDs comma separated:

$ servo check -i "fae728a9, 09b17996" -v
CONNECTOR    CHECK             ID        TAGS    STATUS    MESSAGE
prometheus   Check throughput  fae728a9  -       √ PASSED  returned 1 TimeSeries readings
             Check error_rate  09b17996  -       √ PASSED  returned 0 TimeSeries readings

Or every check that contains "exec" (strings in slashes "/like this/" are compiled as regex):

$ servo check -n "/.*exec.+/" -v
CONNECTOR    CHECK             ID               TAGS    STATUS    MESSAGE
vegeta       Vegeta execution  check_execution  -       √ PASSED  Vegeta exit code: 0

And set the halting behavior in the face of failures:

$ servo check --halt-on-failed=requirement prometheus

CONNECTOR    STATUS    ERRORS
prometheus   X FAILED  (1/2) Check throughput: caught exception: ConnectError(OSError("Multiple exceptions: [Errno 61] Connect call failed ('::1', 9091, 0, 0), [Errno 61] Connect call failed ('127.0.0.1', 9091)"))
                       (2/2) Check error_rate: caught exception: ConnectError(OSError("Multiple exceptions: [Errno 61] Connect call failed ('::1', 9091, 0, 0), [Errno 61] Connect call failed ('127.0.0.1', 9091)"))

$ servo check --halt-on-failed=check prometheus
CONNECTOR    STATUS    ERRORS
prometheus   X FAILED  (1/1) Check throughput: caught exception: ConnectError(OSError("Multiple exceptions: [Errno 61] Connect call failed ('::1', 9091, 0, 0), [Errno 61] Connect call failed ('127.0.0.1', 9091)"))

Creating Checks from an Iterable

Sometimes you have a collection of homogenous items that need to be checked. A common example is a list of queries for a metrics provider like Prometheus:

prometheus:
  base_url: http://localhost:9091/
  metrics:
  - name: throughput
    query: rate(http_requests_total[1s])[3m]
    unit: rps
  - name: error_rate
    query: rate(errors)
    unit: '%'
  - name: go_threads
    query: gc_info
    unit: threads
  - name: go_memstats_alloc_bytes
    query: go_memstats_alloc_bytes
    unit: bytes

We don't want to handwrite a method for each of these and if we just loop over it, we can't use filters to focus on the failure cases -- making debugging slower and noisier.

What we want is the ability to synthesize a checks class without having to write the code by hand:

class PrometheusConnector(BaseConnector):
    config: PrometheusConfiguration

    @on_event()
    async def check(self, 
        filter_: Optional[Filter] = None, 
        halt_on: HaltOnFailed = HaltOnFailed.requirement
    ) -> List[Check]:
        start, end = datetime.now() - timedelta(minutes=10), datetime.now()        
        async def check_query(metric: PrometheusMetric) -> str:
            result = await self._query_prom(metric, start, end)
            return f"returned {len(result)} TimeSeries readings"

        # wrap all queries into checks and verify that they work
        PrometheusChecks = create_checks_from_iterable(check_query, self.config.metrics)
        return await PrometheusChecks.run(self.config, filter_, halt_on=halt_on)

Here the check_query inner function is going to be used just like earlier examples that were "checkified" via the @check decorator and the self.config.metrics collection is going to be treated like a list of methods in a BaseChecks subclass.

The call to create_checks_from_iterable returns a new dynamically created subclass of BaseChecks with check_ instance methods attached for every item in the self.config.metrics collection.

The PrometheusChecks class behaves exactly like a manually coded checks subclass and can be filtered, etc.

Checkable Protocol

Protocols are a relatively recent extension to Python that supports structural subtyping. This is basically the idea that a class does not have to explicitly inherit from another class in order to be considered its subtype. It is an extension of the concept of duck typing in dynamic languages to the typing system (sometimes called "Static Duck Typing", see [https://www.python.org/dev/peps/pep-0544/](PEP 544)).

The servo.checks.Checkable protocol defines a single method called __check__ that returns a Check object. The protocol is used extensively in the internals but can be used as a public API to provide check implementations for arbitrary objects.

Safe and productive by default

The checks subsystem works really hard to make the easy thing delightful and the wrong impossible. There is extensive enforcement around type hint contracts to avoid typo bugs. The code is extensively documented and covered with tests.

Open Questions

Support async checks in parallel?

The predictable execution path of the revised implementation opens the door to executing more checks in parallel. Right now check events for connectors are parallelized while individual checks are executed serially. Required checks effectively let you partition groups of checks and run them in parallel since you know that they have no interdependencies and their parent dependencies have already been met.

Does it make sense to promote checks to a top-level concept like configuration?

class TopLevelExample(BaseConnector):
    config: ExampleConfiguration
    checks: ExampleChecks

The benefit is eliminating additional boilerplate with the trade-off that the design becomes de-facto more rigid and magical as most folks will never realize that checks built on top of eventing and can be directly customized with an event handler method.

But if I have now covered the 80%+ of cases then most folks would never even need to know.

* Added new supporting classes for building running checks. * BaseChecks provides simple means for building collections of checks. * Expanded checks metadata with timing, exceptions, etc. * Check.run provides one-liner ad-hoc checks. * Check objects now have an id and tags. * Checks runner supports filtering by name, id, and tags. * check decorator transforms methods into check runners (handles common behaviors).

…nature

Blake Watters added 7 commits August 26, 2020 07:05

Upgrade checks on all existing connectors

6d2715c

Run matching required checks when filtering

e6b5f57

Wire checks filtering into the CLI

685f26a

Add create_checks_from_iterable to support checking collections

8939662

Doc touchups

fcb642d

Elevate servo.checks.Filter to public API and add to checks event sig…

0267edc

…nature

blakewatters changed the title ~~Feature/enhanced checks~~ Enhanced checks architecture Aug 29, 2020

Blake Watters added 5 commits August 28, 2020 18:36

Wire failure mode handling into checks and CLI

b4ef678

Update CHANGELOG to cover new checks hotness

041e106

Add halt_on update missed from last push

3af71f7

Wire all filtering options to the CLI

66e9432

Parse tags from comma separated values

e1fecbc

blakewatters merged commit b68b65d into master Aug 30, 2020

blakewatters deleted the feature/enhanced-checks branch September 1, 2020 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced checks architecture #29

Enhanced checks architecture #29

blakewatters commented Aug 26, 2020 •

edited

Loading

Enhanced checks architecture #29

Enhanced checks architecture #29

Conversation

blakewatters commented Aug 26, 2020 • edited Loading

Background context (skip down if you can't be bothered)

Checks 2.0

Creating a Check

Check metadata

Creating Checks

Required checks

New event handler

Filtering checks

Halting check execution

CLI upgrades

Creating Checks from an Iterable

Checkable Protocol

Safe and productive by default

Open Questions

Support async checks in parallel?

Does it make sense to promote checks to a top-level concept like configuration?

blakewatters commented Aug 26, 2020 •

edited

Loading