skip_unavailable changes from true to false when remote connection fails #107125

asmith-elastic · 2024-04-04T20:11:07Z

Elasticsearch Version

8.12.2

Installed Plugins

No response

Java Version

bundled

OS Version

linux

Problem Description

There is a behavior in Elasticsearch where the skip_unavailable setting for a remote cluster connection is automatically reset to false when an incorrect remote cluster address is configured. After correcting the connection details, the skip_unavailable setting does not revert to true, even if it was previously set to that value. Instead, it requires an explicit reconfiguration to set it back to true.

Steps to Reproduce

Configure a remote cluster with skip_unavailable set to true:

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.ccs.mode": "proxy",
    "cluster.remote.ccs.proxy_address": "ccs.es.us-central1.gcp.cloud.es.io:9400",
    "cluster.remote.ccs.proxy_socket_connections": "18",
    "cluster.remote.ccs.server_name": "ccs.es.us-central1.gcp.cloud.es.ioo",
    "cluster.remote.ccs.skip_unavailable": "true"
  }
}

Verify the configuration, note that skip_unavailable is true.
Introduce an error by setting an incorrect remote cluster address:

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.ccs.proxy_address": "ccs-broken.es.us-central1.gcp.cloud.es.io:9400"
  }
}

Observe that the remote connection fails and skip_unavailable is automatically set to false.

{
  "ccs": {
    "connected": false,
    "mode": "proxy",
    "proxy_address": "ccs-broken.es.us-central1.gcp.cloud.es.io:9400",
    "server_name": "ccs-broken.es.us-central1.gcp.cloud.es.ioo",
    "num_proxy_sockets_connected": 0,
    "max_proxy_socket_connections": 18,
    "initial_connect_timeout": "30s",
    "skip_unavailable": false
  }
}

Correct the server address back to the initial correct value.
Notice that skip_unavailable remains false and does not revert back to true.

{
  "ccs": {
    "connected": true,
    "mode": "proxy",
    "proxy_address": "ccs.es.us-central1.gcp.cloud.es.io:9400",
    "server_name": "ccs.es.us-central1.gcp.cloud.es.ioo",
    "num_proxy_sockets_connected": 18,
    "max_proxy_socket_connections": 18,
    "initial_connect_timeout": "30s",
    "skip_unavailable": false
  }
}

Manually attempt to set skip_unavailable to true again:

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.ccs.skip_unavailable": "true"
  }
}

Observe how skip_unavailable does not change to true and remains set to false.
Set skip_unavailable to false while it is already set to a false value.

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.ccs.skip_unavailable": "false"
  }
}

Manually attempt to set skip_unavailable to true.

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.ccs.skip_unavailable": "true"
  }
}

The setting now updates successfully, verify that the remote connection works and skip_unavailable is set back to true.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-04-05T16:24:17Z

Pinging @elastic/es-distributed (Team:Distributed)

mhl-b · 2024-07-04T18:57:51Z

I believe the problem described above should be fixed by #105792. This PR changes default behaviour for skip_unavailable to true. It does not address steps 10 to 14 where skip_unavailable has to be set false and then true, which seems to be a different issue.

Original problem statement should be resolved now in 8.15, can you confirm please, @asmith-elastic?

naj-h · 2024-07-10T09:37:52Z

@mhl-b thanks for checking! While the mentioned PR will change the default value to true, we want to be sure that the issue described here won't change again the value to false in case the remote connection fails. If that happens and goes unnoticed, the users will now have skip_unavailable set to false in the remote clusters that failed, which is not the right default experience and why we are introducing the changes in 8.15.

nicktindall · 2024-08-28T05:59:53Z

@naj-h the PR I attached reproduces the steps in the description and demonstrates that the problem no longer exists in the current codebase. Are you satisfied that we can close this issue?

naj-h · 2024-08-28T09:52:12Z

@nicktindall Thanks much for your tests! If this issue is not reproduced in main, then I think we can close this out.

asmith-elastic added >bug needs:triage Requires assignment of a team area label labels Apr 4, 2024

demjened added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Apr 5, 2024

elasticsearchmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. and removed needs:triage Requires assignment of a team area label labels Apr 5, 2024

DaveCTurner added :Distributed Coordination/Network Http and internode communication implementations and removed :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Apr 8, 2024

nicktindall added a commit to nicktindall/elasticsearch that referenced this issue Aug 22, 2024

Demonstrate that elastic#107125 is fixed

b7102b0

nicktindall mentioned this issue Aug 22, 2024

Demonstrate that #107125 is fixed #112085

Closed

nicktindall closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip_unavailable changes from true to false when remote connection fails #107125

skip_unavailable changes from true to false when remote connection fails #107125

asmith-elastic commented Apr 4, 2024

elasticsearchmachine commented Apr 5, 2024

mhl-b commented Jul 4, 2024

naj-h commented Jul 10, 2024

nicktindall commented Aug 28, 2024

naj-h commented Aug 28, 2024

skip_unavailable changes from true to false when remote connection fails #107125

skip_unavailable changes from true to false when remote connection fails #107125

Comments

asmith-elastic commented Apr 4, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Apr 5, 2024

mhl-b commented Jul 4, 2024

naj-h commented Jul 10, 2024

nicktindall commented Aug 28, 2024

naj-h commented Aug 28, 2024