Better error-handling and cancellation of ON CLUSTER backups and restores #70027

vitlibar · 2024-09-26T16:04:35Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Better error-handling and cancellation of ON CLUSTER backups and restores:

If a backup or restore fails on one host then it'll be cancelled on other hosts automatically
No weird errors must be produced because some hosts failed while other hosts continued their work
If a backup or restore is cancelled on one host then it'll be cancelled on other hosts automatically
Fix issues with test_disallow_concurrency - now disabling of concurrency must work better
Backups and restores now are much more resistant to ZooKeeper disconnects

robot-ch-test-poll1 · 2024-10-02T08:03:30Z

This is an automated comment for commit b16a18e with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	❌ error
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure

Successful checks

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

vitlibar · 2024-10-05T10:56:37Z

src/Backups/BackupCoordinationStageSync.h

+        const String & current_host_,        /// the current host, or an empty string if it's the initiator of a BACKUP/RESTORE ON CLUSTER command
+        bool allow_concurrency_,             /// whether it's allowed to have concurrent backups or restores.
+        const WithRetries & with_retries_,
+        ThreadPoolCallbackRunnerUnsafe<void> schedule_,


The main change of this PR is that this BackupCoordinationStageSync was completely rewritten. This class creates some nodes in ZooKeeper and also tracks such nodes created by other hosts which are also executing the same BACKUP ON CLUSTER or RESTORE ON CLUSTER command. There are multiple nodes - started*, current*, alive*, and finished*.

The alive* nodes are ephemeral and they're used to tell other hosts that the current host is still working on the same BACKUP/RESTORE operation. Earlier those alive* nodes were recreated and checked from time to time quite randomly which caused weird errors. In this PR I changed it - now there is a separate thread which helps to both recreate those alive* nodes quickly and to react to an error happened on other host quickly. A cancellation with KILL QUERY is handled as the QUERY_WAS_CANCELLED error so as an another important result this PR also gives us an ability to cancel ON CLUSTER backups and restores.

vitlibar · 2024-10-05T11:02:35Z

src/Core/Settings.cpp

@@ -516,14 +516,15 @@ namespace ErrorCodes
    M(UInt64, max_temporary_data_on_disk_size_for_user, 0, "The maximum amount of data consumed by temporary files on disk in bytes for all concurrently running user queries. Zero means unlimited.", 0)\
    M(UInt64, max_temporary_data_on_disk_size_for_query, 0, "The maximum amount of data consumed by temporary files on disk in bytes for all concurrently running queries. Zero means unlimited.", 0)\
    \
-    M(UInt64, backup_restore_keeper_max_retries, 20, "Max retries for keeper operations during backup or restore", 0) \
+    M(UInt64, backup_restore_keeper_max_retries, 1000, "Max retries for [Zoo]Keeper operations in the middle of a BACKUP or RESTORE operation. Should be big enough so the whole operation won't fail because of a temporary [Zoo]Keeper failure", 0) \


20 retries (about 20*backup_restore_keeper_retry_max_backoff_ms = 100 seconds) may be enough for a quick operation, but we don't want a long backup to fail in the middle because of ZooKeeper server was restarting for a couple of minutes.

1000 retries with default backup_restore_keeper_retry_max_backoff_ms (5 second) gives more than 1 hour which seems to be enough to cover any temporary ZooKeeper failure.

vitlibar · 2024-10-05T11:07:14Z

src/Interpreters/ProcessList.h

-    CancellationCode cancelQuery(bool kill);
+    /// Cancels the current query.
+    /// Optional argument `exception` allows to set an exception which checkTimeLimit() will throw instead of "QUERY_WAS_CANCELLED".
+    CancellationCode cancelQuery(bool kill, std::exception_ptr exception = nullptr);


This change was made to keep the same error through the cluster when a backup fails.
If an error happens on a host then we want to cancel the operation on other hosts too - but it seems it's kinder (for anyone who will look at the logs) to keep the same error instead of introducing any other error codes.

vitlibar · 2024-10-05T11:08:46Z

base/base/chrono_io.h

+    {
+        return fmt::formatter<std::string>::format(::to_string(duration), ctx);
+    }
+};


Here is a small change to allow fmt::format() and LOG_INFO to print values like std::chrono::seconds.

vitlibar · 2024-10-05T11:12:18Z

src/Backups/WithRetries.h


-    struct KeeperSettings


This structure WithRetries::KeeperSettings was moved to BackupKeeperSettings.

…lQuery().

…ting.

…USTER. Fix concurrency check, implement cancelling of distributed backups/restores.

…est "test_stop_other_host_during_backup[True]" because it was replaced by new test "test_long_disconnection_stops_backup".

vitlibar · 2024-10-31T21:19:11Z

Test failures are unrelated:

test_scheduler/test.py::test_workload_entity_keeper_storage - see Fix test test_workload_entity_keeper_storage: add more retries #71325

vitlibar · 2024-11-01T07:37:33Z

@kssenii I've changed this PR a lot since your last review, do you want to look at it again before I merge it?

hanfei1991 · 2024-11-19T14:19:42Z

tests/integration/test_backup_restore_on_cluster/test_different_versions.py

@@ -0,0 +1,125 @@
+import random


I have problems on this test right now: I extened the backup meta file for lightweight backup, and update the file version to V2. The old_node will not recognize a v2 meta file, that means as long as we upgrade the file version, this test has to fail ...

Any suggestions?

vitlibar added the do not test disable testing on pull request label Sep 26, 2024

vitlibar mentioned this pull request Sep 26, 2024

[WIP] Improve recreating of ephemeral nodes during ON CLUSTER RESTORE #67679

Closed

kssenii self-assigned this Sep 26, 2024

vitlibar force-pushed the fix-restore-on-cluster-sync branch from f47f8e6 to ca9154f Compare October 2, 2024 07:57

robot-clickhouse-ci-1 added the pr-improvement Pull request with some product improvements label Oct 2, 2024

vitlibar removed the do not test disable testing on pull request label Oct 2, 2024

vitlibar force-pushed the fix-restore-on-cluster-sync branch 10 times, most recently from 47e8726 to b535d2f Compare October 3, 2024 11:30

vitlibar marked this pull request as ready for review October 3, 2024 11:30

vitlibar force-pushed the fix-restore-on-cluster-sync branch 7 times, most recently from 14a0fc6 to 0ac89a2 Compare October 5, 2024 10:45

vitlibar commented Oct 5, 2024

View reviewed changes

vitlibar added 6 commits October 30, 2024 22:18

Add support for chrono data types to the "fmt" formatter.

d24b029

Add support for a custom cancellation exception to QueryStatus::cance…

31402c5

…lQuery().

Make configurable the number of retries used by ZooKeeper when connec…

8fea878

…ting.

Add support for zookeeper retries to executeDDLQueryOnCluster().

982b67f

Rework coordination of hosts during BACKUP ON CLUSTER / RESTORE ON CL…

f6b5d27

…USTER. Fix concurrency check, implement cancelling of distributed backups/restores.

Correct test "test_stop_other_host_during_backup[False]" and remove t…

7c3ba93

…est "test_stop_other_host_during_backup[True]" because it was replaced by new test "test_long_disconnection_stops_backup".

vitlibar force-pushed the fix-restore-on-cluster-sync branch from d40e338 to 7c3ba93 Compare October 30, 2024 21:19

Add test for mixed version on hosts doing backup or restore.

b16a18e

vitlibar force-pushed the fix-restore-on-cluster-sync branch from 43153d0 to b16a18e Compare October 31, 2024 16:39

vitlibar added this pull request to the merge queue Oct 31, 2024

vitlibar removed this pull request from the merge queue due to a manual request Oct 31, 2024

vitlibar added this pull request to the merge queue Nov 1, 2024

Merged via the queue into ClickHouse:master with commit ae2eeb4 Nov 1, 2024
210 of 216 checks passed

vitlibar deleted the fix-restore-on-cluster-sync branch November 1, 2024 18:00

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 1, 2024

This was referenced Nov 10, 2024

Corrections after reworking backup/restore synchronization #71715

Merged

Corrections after reworking backup/restore synchronization #2 #71912

Merged

Corrections after reworking backup/restore synchronization #3 #72018

Merged

hanfei1991 reviewed Nov 19, 2024

View reviewed changes

nikitamikhaylov added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Dec 10, 2024

robot-clickhouse-ci-2 added the pr-backports-created-cloud label Dec 10, 2024

robot-clickhouse-ci-1 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Dec 10, 2024

fm4v mentioned this pull request Dec 16, 2024

used_privileges and missing_privileges for BACKUP and RESTORE commands are missing from system.query_log #73379

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error-handling and cancellation of ON CLUSTER backups and restores #70027

Better error-handling and cancellation of ON CLUSTER backups and restores #70027

vitlibar commented Sep 26, 2024 •

edited

Loading

robot-ch-test-poll1 commented Oct 2, 2024 •

edited by robot-ch-test-poll4

Loading

vitlibar Oct 5, 2024 •

edited

Loading

vitlibar Oct 5, 2024 •

edited

Loading

vitlibar Oct 5, 2024

vitlibar Oct 5, 2024

vitlibar Oct 5, 2024

vitlibar commented Oct 31, 2024

vitlibar commented Nov 1, 2024

hanfei1991 Nov 19, 2024

Better error-handling and cancellation of ON CLUSTER backups and restores #70027

Better error-handling and cancellation of ON CLUSTER backups and restores #70027

Conversation

vitlibar commented Sep 26, 2024 • edited Loading

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

robot-ch-test-poll1 commented Oct 2, 2024 • edited by robot-ch-test-poll4 Loading

vitlibar Oct 5, 2024 • edited Loading

Choose a reason for hiding this comment

vitlibar Oct 5, 2024 • edited Loading

Choose a reason for hiding this comment

vitlibar Oct 5, 2024

Choose a reason for hiding this comment

vitlibar Oct 5, 2024

Choose a reason for hiding this comment

vitlibar Oct 5, 2024

Choose a reason for hiding this comment

vitlibar commented Oct 31, 2024

vitlibar commented Nov 1, 2024

hanfei1991 Nov 19, 2024

Choose a reason for hiding this comment

vitlibar commented Sep 26, 2024 •

edited

Loading

robot-ch-test-poll1 commented Oct 2, 2024 •

edited by robot-ch-test-poll4

Loading

vitlibar Oct 5, 2024 •

edited

Loading

vitlibar Oct 5, 2024 •

edited

Loading