Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change skip_unavailable default value to true #105792

Merged
merged 20 commits into from
Apr 29, 2024

Conversation

quux00
Copy link
Contributor

@quux00 quux00 commented Feb 23, 2024

In order to improve the experience of cross-cluster search, we propose changing
the default value of the remote cluster skip_unavailable setting from false to true.

This setting causes any cross-cluster search to entirely fail when any remote cluster
with skip_unavailable=false is either unavailable (connection to it fails) or if the
search on it fails on all shards.

Setting skip_unavailable=true allows partial results from other clusters to be
returned. In that case, the search response cluster metadata will show a skipped
status, so the user can see that no data came in from that cluster. Kibana also
now leverages this metadata in the cross-cluster search responses to allow users
to see how many clusters returned data and drill down into which clusters did not
(including failure messages).

Currently, the user/admin has to specifically set the value to true in the configs, like so:

cluster:
    remote:
        remote1:
            seeds: 10.10.10.10:9300
            skip_unavailable: true

even though that is probably what search admins want in the vast majority of cases.

Setting skip_unavailable=false should be a conscious (and probably rare) choice
by an Elasticsearch admin that a particular cluster's results are so essential to a
search (or visualization in dashboard or Discover panel) that no results at all should
be shown if it cannot return any results.

@quux00 quux00 added >breaking :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.14.0 labels Feb 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @quux00, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@quux00 quux00 force-pushed the ccs/skip_unavailable-defaults-to-true branch 2 times, most recently from e634836 to 67c4f7c Compare March 6, 2024 16:12
@quux00 quux00 marked this pull request as ready for review March 7, 2024 16:36
@quux00 quux00 requested a review from a team as a code owner March 7, 2024 16:36
@quux00
Copy link
Contributor Author

quux00 commented Mar 7, 2024

In the process of doing this work (mostly fixing tests), I found some potential issues, both of which are pre-existing but now perhaps become more acute, so they may need to be addressed.

(1) For _search (and _async_search), when skip_unavailable=false and a cluster has an issue (such as a Security/permissions exception or IndexNotFound), that error is explicitly returned, along with an error status code such as 404 or 403. As expected, when skip_unavailable=true and the same issue occurs, an HTTP status of 200 is returned, but the error reported in the SearchResponse is NoSuchRemoteClusterException, not the IndexNotFoundException. This feels like a bug in the cluster resolution logic (es-distributed side?).

(2) The reindex API is also affected by the skip_unavailable when doing a remote reindexing operation. When skip_unavailable=false and a reindex targets an index that does not exist, then IndexNotFoundException is returned with an HTTP status error code.

However, when skip_unavailable=true it does not return an error status code, and worse it does not report any error in the response object:

{
    "took": 36,
    "timed_out": false,
    "total": 0,
    "updated": 0,
    "created": 0,
    "deleted": 0,
    "batches": 0,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
        "bulk": 0,
        "search": 0
    },
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until_millis": 0,
    "failures": []
}

Should the reindex API report an issue in the failures section of this response in this case? (It likely needs to parse the search response against the remote cluster.) Or should it return an error status code? Should the reindex API have a dependency on the skip_unavailable setting?

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@@ -87,7 +87,7 @@ public final class RemoteClusterService extends RemoteClusterAware implements Cl
public static final Setting.AffixSetting<Boolean> REMOTE_CLUSTER_SKIP_UNAVAILABLE = Setting.affixKeySetting(
"cluster.remote.",
"skip_unavailable",
(ns, key) -> boolSetting(key, false, new RemoteConnectionEnabled<>(ns, key), Setting.Property.Dynamic, Setting.Property.NodeScope)
(ns, key) -> boolSetting(key, true, new RemoteConnectionEnabled<>(ns, key), Setting.Property.Dynamic, Setting.Property.NodeScope)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only production code change. Everything else is tests and docs.

@quux00
Copy link
Contributor Author

quux00 commented Apr 24, 2024

@elasticmachine run elasticsearch-ci/part-1

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question about yaml tests and the upgrade path, LGTM otherwise

changed from `false` to `true`. Before Elasticsearch 8.15, if you want a cluster
to be treated as optional for a {ccs}, then you need to set that configuration.
From Elasticsearch 8.15 forward, you need to set the configuration in order to
make a cluster required for the {ccs}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe obvious, but should we be explicit about the upgrade path: once you upgrade the coordinating node / cluster, where the remotes are registered, you get the new behavior? Say that the coord cluster has multiple nodes that are upgraded one by one (rolling upgrade) what would be the behavior there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but that is very specific to a one-time upgrade, so should that go in the "permanent" API docs? Or should it go into some one-time doc like the changelog or deprecation announcement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we have migration guides for major upgrades and related changes. This is one of those changes but it goes out in a minor, I don't know if we a place for this scenario in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a blurb here about that and we'll also include in the 8.15 release notes. Thanks for flagging this.

@quux00
Copy link
Contributor Author

quux00 commented Apr 25, 2024

@elasticmachine test this please

@quux00
Copy link
Contributor Author

quux00 commented Apr 29, 2024

@elasticmachine run elasticsearch-ci/part-2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants