Skip to content

Commit

Permalink
Add documentation on remote recovery (#39483)
Browse files Browse the repository at this point in the history
This is related to #35975. It adds documentation on the remote recovery
process. Additionally, it adds documentation about the various settings
that can impact the process.
  • Loading branch information
Tim-Brooks authored Mar 5, 2019
1 parent fef2153 commit ee41b22
Show file tree
Hide file tree
Showing 5 changed files with 89 additions and 1 deletion.
5 changes: 5 additions & 0 deletions docs/reference/ccr/getting-started.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,11 @@ PUT /server-metrics-copy/_ccr/follow?wait_for_active_shards=1
//////////////////////////

The follower index is initialized using the <<remote-recovery, remote recovery>>
process. The remote recovery process transfers the existing Lucene segment files
from the leader to the follower. When the remote recovery process is complete,
the index following begins.

Now when you index documents into your leader index, you will see these
documents replicated in the follower index. You can
inspect the status of replication using the
Expand Down
1 change: 1 addition & 0 deletions docs/reference/ccr/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,4 @@ include::overview.asciidoc[]
include::requirements.asciidoc[]
include::auto-follow.asciidoc[]
include::getting-started.asciidoc[]
include::remote-recovery.asciidoc[]
29 changes: 29 additions & 0 deletions docs/reference/ccr/remote-recovery.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
[role="xpack"]
[testenv="platinum"]
[[remote-recovery]]
== Remote recovery

When you create a follower index, you cannot use it until it is fully initialized.
The _remote recovery_ process builds a new copy of a shard on a follower node by
copying data from the primary shard in the leader cluster. {es} uses this remote
recovery process to bootstrap a follower index using the data from the leader index.
This process provides the follower with a copy of the current state of the leader index,
even if a complete history of changes is not available on the leader due to Lucene
segment merging.

Remote recovery is a network intensive process that transfers all of the Lucene
segment files from the leader cluster to the follower cluster. The follower
requests that a recovery session be initiated on the primary shard in the leader
cluster. The follower then requests file chunks concurrently from the leader. By
default, the process concurrently requests `5` large `1mb` file chunks. This default
behavior is designed to support leader and follower clusters with high network latency
between them.

There are dynamic settings that you can use to rate-limit the transmitted data
and manage the resources consumed by remote recoveries. See
{ref}/ccr-settings.html[{ccr-cap} settings].

You can obtain information about an in-progress remote recovery by using the
{ref}/cat-recovery.html[recovery API] on the follower cluster. Remote recoveries
are implemented using the {ref}/modules-snapshots.html[snapshot and restore] infrastructure. This means that on-going remote recoveries are labelled as type
`snapshot` in the recovery API.
52 changes: 52 additions & 0 deletions docs/reference/settings/ccr-settings.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
[role="xpack"]
[[ccr-settings]]
=== {ccr-cap} settings

These {ccr} settings can be dynamically updated on a live cluster with the
<<cluster-update-settings,cluster update settings API>>.

[float]
[[ccr-recovery-settings]]
==== Remote recovery settings

The following setting can be used to rate-limit the data transmitted during
{stack-ov}/remote-recovery.html[remote recoveries]:

`ccr.indices.recovery.max_bytes_per_sec` (<<cluster-update-settings,Dynamic>>)::
Limits the total inbound and outbound remote recovery traffic on each node.
Since this limit applies on each node, but there may be many nodes performing
remote recoveries concurrently, the total amount of remote recovery bytes may be
much higher than this limit. If you set this limit too high then there is a risk
that ongoing remote recoveries will consume an excess of bandwidth (or other
resources) which could destabilize the cluster. This setting is used by both the
leader and follower clusters. For example if it is set to `20mb` on a leader,
the leader will only send `20mb/s` to the follower even if the follower is
requesting and can accept `60mb/s`. Defaults to `40mb`.

[float]
[[ccr-advanced-recovery-settings]]
==== Advanced remote recovery settings

The following _expert_ settings can be set to manage the resources consumed by
remote recoveries:

`ccr.indices.recovery.max_concurrent_file_chunks` (<<cluster-update-settings,Dynamic>>)::
Controls the number of file chunk requests that can be sent in parallel per
recovery. As multiple remote recoveries might already running in parallel,
increasing this expert-level setting might only help in situations where remote
recovery of a single shard is not reaching the total inbound and outbound remote recovery traffic as configured by `ccr.indices.recovery.max_bytes_per_sec`.
Defaults to `5`. The maximum allowed value is `10`.

`ccr.indices.recovery.chunk_size`(<<cluster-update-settings,Dynamic>>)::
Controls the chunk size requested by the follower during file transfer. Defaults to
`1mb`.

`ccr.indices.recovery.recovery_activity_timeout`(<<cluster-update-settings,Dynamic>>)::
Controls the timeout for recovery activity. This timeout primarily applies on
the leader cluster. The leader cluster must open resources in-memory to supply
data to the follower during the recovery process. If the leader does not receive recovery requests from the follower for this period of time, it will close the resources. Defaults to 60 seconds.

`ccr.indices.recovery.internal_action_timeout` (<<cluster-update-settings,Dynamic>>)::
Controls the timeout for individual network requests during the remote recovery
process. An individual action timing out can fail the recovery. Defaults to
60 seconds.
3 changes: 2 additions & 1 deletion docs/reference/settings/configuring-xes.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
++++

include::{asciidoc-dir}/../../shared/settings.asciidoc[]
include::ccr-settings.asciidoc[]
include::license-settings.asciidoc[]
include::ml-settings.asciidoc[]
include::notification-settings.asciidoc[]
include::sql-settings.asciidoc[]
include::notification-settings.asciidoc[]

0 comments on commit ee41b22

Please sign in to comment.