Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New documenation for FT in Nima #6565

Merged
merged 3 commits into from
Apr 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/includes/attributes.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,7 @@ endif::[]
:security-provider-abac-base-url: {javadoc-base-url}/io.helidon.security.providers.abac
:security-provider-google-login-base-url: {javadoc-base-url}/io.helidon.security.providers.google.login
:security-provider-jwt-base-url: {javadoc-base-url}/io.helidon.security.providers.jwt
:nima-faulttolerance-javadoc-base-url: {javadoc-base-url}/io.helidon.nima.faulttolerance

// 3rd party versioned URLs
:jaeger-doc-base-url: https://www.jaegertracing.io/docs/{version-lib-jaeger}
Expand Down
277 changes: 276 additions & 1 deletion docs/nima/fault-tolerance.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,280 @@

include::{rootdir}/includes/nima.adoc[]

== Contents

- <<Overview, Overview>>
- <<Maven Coordinates, Maven Coordinates>>
- <<API, API>>
- <<Examples, Examples>>
- <<Additional Information, Additional Information>>

== Overview

Helidon Nima Fault Tolerance support is inspired by
spericas marked this conversation as resolved.
Show resolved Hide resolved
link:{microprofile-fault-tolerance-spec-url}[MicroProfile Fault Tolerance].
The API defines the notion of a _fault handler_ that can be combined with other handlers to
improve application robustness. Handlers are created to manage error conditions (faults)
that may occur in real-world application environments. Examples include service restarts,
network delays, temporal infrastructure instabilities, etc.

The interaction of multiple microservices bring some new challenges from distributed systems
that require careful planning. Faults in distributed systems should be compartmentalized
to avoid unnecessary service interruptions. For example, if comparable information can
be obtained from multiples sources, a user request _should not_ be denied when a subset
of these sources is unreachable or offline. Similarly, if a non-essential source has been
flagged as unreachable, an application should avoid continuous access to that source
as that would result in much higher response times.

In order to tackle the most common types of application faults, the Helidon Nima Fault
Tolerance API provides support for circuit breakers, retries, timeouts, bulkheads and fallbacks.
In addition, the API makes it very easy to create and monitor asynchronous tasks that
do not require explicit creation and management of threads or executors.

For more information, see
link:{nima-faulttolerance-javadoc-base-url}/module-summary.html[Fault Tolerance Nima API Javadocs].

include::{rootdir}/includes/dependencies.adoc[]

[source,xml]
----
<dependency>
<groupId>io.helidon.nima.fault-tolerance</groupId>
<artifactId>helidon-nima-fault-tolerance</artifactId>
</dependency>
----

== API

The Nima Fault Tolerance API is _blocking_ and based on the JDK's virtual thread model.
As a result, methods return _direct_ values instead of promises in the form of
`Single<T>` or `Multi<T>`.

In the sections that follow, we shall briefly explore each of the constructs provided
by this API.

=== Retries

Temporal networking problems can sometimes be mitigated by simply retrying
a certain task. A `Retry` handler is created using a `RetryPolicy` that
indicates the number of retries, delay between retries, etc.

[source,java]
----
Retry retry = Retry.builder()
.retryPolicy(Retry.JitterRetryPolicy.builder()
.calls(3)
.delay(Duration.ofMillis(100))
.build())
.build();
T result = retry.invoke(this::retryOnFailure);
----

The sample code above will retry calls to the supplier `this::retryOnFailure`
for up to 3 times with a 100 millisecond delay between them.

NOTE: The return type of method `retryOnFailure` in the example above must
be some `T` and the parameter to the retry handler's `invoke`
method `Supplier<? extends T>`.

If the call to the supplier provided completes exceptionally, it will be treated as
a failure and retried until the maximum number of attempts is reached; finer control
is possible by creating a retry policy and using methods such as
`applyOn(Class<? extends Throwable>... classes)` and
`skipOn(Class<? extends Throwable>... classes)` to control the exceptions
that must be retried and those that must be ignored.

=== Timeouts

A request to a service that is inaccessible or simply unavailable should be bounded
to ensure a certain quality of service and response time. Timeouts can be configured
to avoid excessive waiting times. In addition, a fallback action can be defined
if a timeout expires as we shall cover in the next section.

The following is an example of using `Timeout`:
[source,java]
----
T result = Timeout.create(Duration.ofMillis(10))
.invoke(this::mayTakeVeryLong);
----

NOTE: Using a handler's `create` method is an alternative to using a builder that is
more convenient when default settings are acceptable.

The example above monitors the call to method `mayTakeVeryLong` and reports a
`TimeoutException` if the execution takes more than 10 milliseconds to complete.

=== Fallbacks

A fallback to a _known_ result can sometimes be an alternative to
reporting an error. For example, if we are unable to access a service
we may fall back to the last result obtained from that service at an
earlier time.

A `Fallback` instance is created by providing a function that takes a `Throwable`
and produces some `T` to be used when the intended method failed to return a value:

[source,java]
----
T result = Fallback.create(throwable -> lastKnownValue)
.invoke(this::mayFail);
----

This example calls the method `mayFail` and if it produces a `Throwable`, it maps
it to the last known value using the fallback handler.

=== Circuit Breakers

Failing to execute a certain task or to call another service repeatedly can have a direct
impact on application performance. It is often preferred to avoid calls to non-essential
services by simply preventing that logic to execute altogether. A circuit breaker can be
configured to monitor such calls and block attempts that are likely to fail, thus improving
overall performance.

Circuit breakers start in a _closed_ state, letting calls to proceed normally; after
detecting a certain number of errors during a pre-defined processing window, they can _open_ to
prevent additional failures. After a circuit has been opened, it can transition
first to a _half-open_ state before finally transitioning back to a closed state.
The use of an intermediate state (half-open)
makes transitions from open to close more progressive, and prevents a circuit breaker
from eagerly transitioning to states without considering sufficient observations.

NOTE: Any failure while a circuit breaker is in half-open state will immediately
cause it to transition back to an open state.

Consider the following example in which `this::mayFail` is monitored by a
circuit breaker:
[source,java]
----
CircuitBreaker breaker = CircuitBreaker.builder()
.volume(10)
.errorRatio(30)
.delay(Duration.ofMillis(200))
.successThreshold(2)
.build();
T result = breaker.invoke(this::mayFail);
----

The circuit breaker in this example defines a processing window of size 10, an error
ratio of 30%, a duration to transition to half-open state of 200 milliseconds, and
a success threshold to transition from half-open to closed state of 2 observations.
It follows that,

* After completing the processing window, if at least 3 errors are detected, the
circuit breaker will transition to the open state, thus blocking the execution
of any subsequent calls.

* After 200 millis, the circuit breaker will transition back to half-open and
allow calls to proceed again.

* If the next two calls after transitioning to half-open are successful, the
circuit breaker will transition to closed state; otherwise, it will
transition back to open state, waiting for another 200 milliseconds
before attempting to transition to half-open again.

A circuit breaker will throw a
`io.helidon.nima.faulttolerance.CircuitBreakerOpenException`
if an attempt to make an invocation takes place while it is in open state.

=== Bulkheads

Concurrent access to certain components may need to be limited to avoid
excessive use of resources. For example, if an invocation that opens
a network connection is allowed to execute concurrently without
any restriction, and if the service on the other end is slow responding,
it is possible for the rate at which network connections are opened
to exceed the maximum number of connections allowed. Faults of this
type can be prevented by guarding these invocations using a bulkhead.

NOTE: The origin of the name _bulkhead_ comes from the partitions that
comprise a ship's hull. If some partition is somehow compromised
(e.g., filled with water) it can be isolated in a manner not to
affect the rest of the hull.

A waiting queue can be associated with a bulkhead to handle tasks
that are submitted when the bulkhead is already at full capacity.

[source,java]
----
Bulkhead bulkhead = Bulkhead.builder()
.limit(3)
.queueLength(5)
.build();
T result = bulkhead.invoke(this::usesResources);
----

This example creates a bulkhead that limits concurrent execution
to `this:usesResources` to at most 3, and with a queue of size 5. The
bulkhead will report a `io.helidon.nima.faulttolerance.BulkheadException` if unable
to proceed with the call: either due to the limit being reached or the queue
being at maximum capacity.

=== Asynchronous

Asynchronous tasks can be created or forked by using an `Async` instance. A supplier of type
`T` is provided as the argument when invoking this handler. For example:

[source,java]
----
CompletableFuture<Thread> cf = Async.create().invoke(Thread::currentThread));
cf.thenAccept(t -> System.out.println("Async task executed in thread " + t));
----

The supplier `() -> Thread.currentThread()` is executed in a new thread and
the value it produces printed by the consumer and passed to `thenAccept`.

By default, asynchronous tasks are executed using a _new virtual thread per
task_ based on the `ExecutorService` defined in
`io.helidon.nima.faulttolerance.FaultTolerance` and
configurable by an application. Alternatively, an `ExecutorService` can be specified
when building a non-standard `Async` instance.

=== Handler Composition

Method invocations can be guarded by any combination of the handlers
presented above. For example, an invocation that
times out can be retried a few times before resorting to a fallback value
&mdash;assuming it never succeeds.

The easiest way to achieve handler composition is by using a builder in the
`FaultTolerance` class as shown in the following example:

[source,java]
----
FaultTolerance.TypedBuilder<T> builder = FaultTolerance.typedBuilder();

Timeout timeout = Timeout.create(Duration.ofMillis(10));
builder.addTimeout(timeout);

Retry retry = Retry.builder()
.retryPolicy(Retry.JitterRetryPolicy.builder()
.calls(3)
.delay(Duration.ofMillis(100))
.build())
.build();
builder.addRetry(retry);

Fallback<T> fallback = Fallback.create(throwable ->lastKnownValue);
builder.addFallback(fallback);

T result = builder.build().invoke(this::mayTakeVeryLong);
----

The exact order in which handlers are added to a builder depends on the use case,
but generally the order starting from innermost to outermost should be: bulkhead,
timeout, circuit breaker, retry and fallback. That is, fallback is the first
handler in the chain (the last to executed once a value is returned)
and bulkhead is the last one (the first to be executed once a value is returned).

NOTE: This is the ordering used by the MicroProfile Fault Tolerance implementation
in Helidon when a method is decorated with multiple annotations.

== Examples

See <<API>> section for examples.

== Additional Information

For additional information, see the
link:{nima-faulttolerance-javadoc-base-url}/module-summary.html[Fault Tolerance Nima API Javadocs].

== TO DO FOR NIMA