diff --git a/docs/includes/attributes.adoc b/docs/includes/attributes.adoc index f69a03ef1c1..1684198e8dd 100644 --- a/docs/includes/attributes.adoc +++ b/docs/includes/attributes.adoc @@ -225,6 +225,7 @@ endif::[] :security-provider-abac-base-url: {javadoc-base-url}/io.helidon.security.providers.abac :security-provider-google-login-base-url: {javadoc-base-url}/io.helidon.security.providers.google.login :security-provider-jwt-base-url: {javadoc-base-url}/io.helidon.security.providers.jwt +:nima-faulttolerance-javadoc-base-url: {javadoc-base-url}/io.helidon.nima.faulttolerance // 3rd party versioned URLs :jaeger-doc-base-url: https://www.jaegertracing.io/docs/{version-lib-jaeger} diff --git a/docs/nima/fault-tolerance.adoc b/docs/nima/fault-tolerance.adoc index 876ac8ee7ee..28da1dbc423 100644 --- a/docs/nima/fault-tolerance.adoc +++ b/docs/nima/fault-tolerance.adoc @@ -26,5 +26,280 @@ include::{rootdir}/includes/nima.adoc[] +== Contents + +- <> +- <> +- <> +- <> +- <> + +== Overview + +Helidon Nima Fault Tolerance support is inspired by +link:{microprofile-fault-tolerance-spec-url}[MicroProfile Fault Tolerance]. +The API defines the notion of a _fault handler_ that can be combined with other handlers to +improve application robustness. Handlers are created to manage error conditions (faults) +that may occur in real-world application environments. Examples include service restarts, +network delays, temporal infrastructure instabilities, etc. + +The interaction of multiple microservices bring some new challenges from distributed systems +that require careful planning. Faults in distributed systems should be compartmentalized +to avoid unnecessary service interruptions. For example, if comparable information can +be obtained from multiples sources, a user request _should not_ be denied when a subset +of these sources is unreachable or offline. Similarly, if a non-essential source has been +flagged as unreachable, an application should avoid continuous access to that source +as that would result in much higher response times. + +In order to tackle the most common types of application faults, the Helidon Nima Fault +Tolerance API provides support for circuit breakers, retries, timeouts, bulkheads and fallbacks. +In addition, the API makes it very easy to create and monitor asynchronous tasks that +do not require explicit creation and management of threads or executors. + +For more information, see +link:{nima-faulttolerance-javadoc-base-url}/module-summary.html[Fault Tolerance Nima API Javadocs]. + +include::{rootdir}/includes/dependencies.adoc[] + +[source,xml] +---- + + io.helidon.nima.fault-tolerance + helidon-nima-fault-tolerance + +---- + +== API + +The Nima Fault Tolerance API is _blocking_ and based on the JDK's virtual thread model. +As a result, methods return _direct_ values instead of promises in the form of +`Single` or `Multi`. + +In the sections that follow, we shall briefly explore each of the constructs provided +by this API. + +=== Retries + +Temporal networking problems can sometimes be mitigated by simply retrying +a certain task. A `Retry` handler is created using a `RetryPolicy` that +indicates the number of retries, delay between retries, etc. + +[source,java] +---- +Retry retry = Retry.builder() + .retryPolicy(Retry.JitterRetryPolicy.builder() + .calls(3) + .delay(Duration.ofMillis(100)) + .build()) + .build(); +T result = retry.invoke(this::retryOnFailure); +---- + +The sample code above will retry calls to the supplier `this::retryOnFailure` +for up to 3 times with a 100 millisecond delay between them. + +NOTE: The return type of method `retryOnFailure` in the example above must +be some `T` and the parameter to the retry handler's `invoke` +method `Supplier`. + +If the call to the supplier provided completes exceptionally, it will be treated as +a failure and retried until the maximum number of attempts is reached; finer control +is possible by creating a retry policy and using methods such as +`applyOn(Class... classes)` and +`skipOn(Class... classes)` to control the exceptions +that must be retried and those that must be ignored. + +=== Timeouts + +A request to a service that is inaccessible or simply unavailable should be bounded +to ensure a certain quality of service and response time. Timeouts can be configured +to avoid excessive waiting times. In addition, a fallback action can be defined +if a timeout expires as we shall cover in the next section. + +The following is an example of using `Timeout`: +[source,java] +---- +T result = Timeout.create(Duration.ofMillis(10)) + .invoke(this::mayTakeVeryLong); +---- + +NOTE: Using a handler's `create` method is an alternative to using a builder that is +more convenient when default settings are acceptable. + +The example above monitors the call to method `mayTakeVeryLong` and reports a +`TimeoutException` if the execution takes more than 10 milliseconds to complete. + +=== Fallbacks + +A fallback to a _known_ result can sometimes be an alternative to +reporting an error. For example, if we are unable to access a service +we may fall back to the last result obtained from that service at an +earlier time. + +A `Fallback` instance is created by providing a function that takes a `Throwable` +and produces some `T` to be used when the intended method failed to return a value: + +[source,java] +---- +T result = Fallback.create(throwable -> lastKnownValue) + .invoke(this::mayFail); +---- + +This example calls the method `mayFail` and if it produces a `Throwable`, it maps +it to the last known value using the fallback handler. + +=== Circuit Breakers + +Failing to execute a certain task or to call another service repeatedly can have a direct +impact on application performance. It is often preferred to avoid calls to non-essential +services by simply preventing that logic to execute altogether. A circuit breaker can be +configured to monitor such calls and block attempts that are likely to fail, thus improving +overall performance. + +Circuit breakers start in a _closed_ state, letting calls to proceed normally; after +detecting a certain number of errors during a pre-defined processing window, they can _open_ to +prevent additional failures. After a circuit has been opened, it can transition +first to a _half-open_ state before finally transitioning back to a closed state. +The use of an intermediate state (half-open) +makes transitions from open to close more progressive, and prevents a circuit breaker +from eagerly transitioning to states without considering sufficient observations. + +NOTE: Any failure while a circuit breaker is in half-open state will immediately +cause it to transition back to an open state. + +Consider the following example in which `this::mayFail` is monitored by a +circuit breaker: +[source,java] +---- +CircuitBreaker breaker = CircuitBreaker.builder() + .volume(10) + .errorRatio(30) + .delay(Duration.ofMillis(200)) + .successThreshold(2) + .build(); +T result = breaker.invoke(this::mayFail); +---- + +The circuit breaker in this example defines a processing window of size 10, an error +ratio of 30%, a duration to transition to half-open state of 200 milliseconds, and +a success threshold to transition from half-open to closed state of 2 observations. +It follows that, + +* After completing the processing window, if at least 3 errors are detected, the +circuit breaker will transition to the open state, thus blocking the execution +of any subsequent calls. + +* After 200 millis, the circuit breaker will transition back to half-open and +allow calls to proceed again. + +* If the next two calls after transitioning to half-open are successful, the +circuit breaker will transition to closed state; otherwise, it will +transition back to open state, waiting for another 200 milliseconds +before attempting to transition to half-open again. + +A circuit breaker will throw a +`io.helidon.nima.faulttolerance.CircuitBreakerOpenException` +if an attempt to make an invocation takes place while it is in open state. + +=== Bulkheads + +Concurrent access to certain components may need to be limited to avoid +excessive use of resources. For example, if an invocation that opens +a network connection is allowed to execute concurrently without +any restriction, and if the service on the other end is slow responding, +it is possible for the rate at which network connections are opened +to exceed the maximum number of connections allowed. Faults of this +type can be prevented by guarding these invocations using a bulkhead. + +NOTE: The origin of the name _bulkhead_ comes from the partitions that +comprise a ship's hull. If some partition is somehow compromised +(e.g., filled with water) it can be isolated in a manner not to +affect the rest of the hull. + +A waiting queue can be associated with a bulkhead to handle tasks +that are submitted when the bulkhead is already at full capacity. + +[source,java] +---- +Bulkhead bulkhead = Bulkhead.builder() + .limit(3) + .queueLength(5) + .build(); +T result = bulkhead.invoke(this::usesResources); +---- + +This example creates a bulkhead that limits concurrent execution +to `this:usesResources` to at most 3, and with a queue of size 5. The +bulkhead will report a `io.helidon.nima.faulttolerance.BulkheadException` if unable +to proceed with the call: either due to the limit being reached or the queue +being at maximum capacity. + +=== Asynchronous + +Asynchronous tasks can be created or forked by using an `Async` instance. A supplier of type +`T` is provided as the argument when invoking this handler. For example: + +[source,java] +---- +CompletableFuture cf = Async.create().invoke(Thread::currentThread)); +cf.thenAccept(t -> System.out.println("Async task executed in thread " + t)); +---- + +The supplier `() -> Thread.currentThread()` is executed in a new thread and +the value it produces printed by the consumer and passed to `thenAccept`. + +By default, asynchronous tasks are executed using a _new virtual thread per +task_ based on the `ExecutorService` defined in +`io.helidon.nima.faulttolerance.FaultTolerance` and +configurable by an application. Alternatively, an `ExecutorService` can be specified +when building a non-standard `Async` instance. + +=== Handler Composition + +Method invocations can be guarded by any combination of the handlers +presented above. For example, an invocation that +times out can be retried a few times before resorting to a fallback value +—assuming it never succeeds. + +The easiest way to achieve handler composition is by using a builder in the +`FaultTolerance` class as shown in the following example: + +[source,java] +---- +FaultTolerance.TypedBuilder builder = FaultTolerance.typedBuilder(); + +Timeout timeout = Timeout.create(Duration.ofMillis(10)); +builder.addTimeout(timeout); + +Retry retry = Retry.builder() + .retryPolicy(Retry.JitterRetryPolicy.builder() + .calls(3) + .delay(Duration.ofMillis(100)) + .build()) + .build(); +builder.addRetry(retry); + +Fallback fallback = Fallback.create(throwable ->lastKnownValue); +builder.addFallback(fallback); + +T result = builder.build().invoke(this::mayTakeVeryLong); +---- + +The exact order in which handlers are added to a builder depends on the use case, +but generally the order starting from innermost to outermost should be: bulkhead, +timeout, circuit breaker, retry and fallback. That is, fallback is the first +handler in the chain (the last to executed once a value is returned) +and bulkhead is the last one (the first to be executed once a value is returned). + +NOTE: This is the ordering used by the MicroProfile Fault Tolerance implementation +in Helidon when a method is decorated with multiple annotations. + +== Examples + +See <> section for examples. + +== Additional Information + +For additional information, see the +link:{nima-faulttolerance-javadoc-base-url}/module-summary.html[Fault Tolerance Nima API Javadocs]. -== TO DO FOR NIMA \ No newline at end of file