-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better error-handling and cancellation of ON CLUSTER backups and restores #70027
Better error-handling and cancellation of ON CLUSTER backups and restores #70027
Conversation
f47f8e6
to
ca9154f
Compare
This is an automated comment for commit b16a18e with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
47e8726
to
b535d2f
Compare
14a0fc6
to
0ac89a2
Compare
const String & current_host_, /// the current host, or an empty string if it's the initiator of a BACKUP/RESTORE ON CLUSTER command | ||
bool allow_concurrency_, /// whether it's allowed to have concurrent backups or restores. | ||
const WithRetries & with_retries_, | ||
ThreadPoolCallbackRunnerUnsafe<void> schedule_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change of this PR is that this BackupCoordinationStageSync
was completely rewritten. This class creates some nodes in ZooKeeper and also tracks such nodes created by other hosts which are also executing the same BACKUP ON CLUSTER
or RESTORE ON CLUSTER
command. There are multiple nodes - started*
, current*
, alive*
, and finished*
.
The alive*
nodes are ephemeral and they're used to tell other hosts that the current host is still working on the same BACKUP/RESTORE
operation. Earlier those alive*
nodes were recreated and checked from time to time quite randomly which caused weird errors. In this PR I changed it - now there is a separate thread which helps to both recreate those alive*
nodes quickly and to react to an error happened on other host quickly. A cancellation with KILL QUERY
is handled as the QUERY_WAS_CANCELLED
error so as an another important result this PR also gives us an ability to cancel ON CLUSTER
backups and restores.
src/Core/Settings.cpp
Outdated
@@ -516,14 +516,15 @@ namespace ErrorCodes | |||
M(UInt64, max_temporary_data_on_disk_size_for_user, 0, "The maximum amount of data consumed by temporary files on disk in bytes for all concurrently running user queries. Zero means unlimited.", 0)\ | |||
M(UInt64, max_temporary_data_on_disk_size_for_query, 0, "The maximum amount of data consumed by temporary files on disk in bytes for all concurrently running queries. Zero means unlimited.", 0)\ | |||
\ | |||
M(UInt64, backup_restore_keeper_max_retries, 20, "Max retries for keeper operations during backup or restore", 0) \ | |||
M(UInt64, backup_restore_keeper_max_retries, 1000, "Max retries for [Zoo]Keeper operations in the middle of a BACKUP or RESTORE operation. Should be big enough so the whole operation won't fail because of a temporary [Zoo]Keeper failure", 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20
retries (about 20*backup_restore_keeper_retry_max_backoff_ms = 100 seconds) may be enough for a quick operation, but we don't want a long backup to fail in the middle because of ZooKeeper server was restarting for a couple of minutes.
1000
retries with default backup_restore_keeper_retry_max_backoff_ms
(5 second) gives more than 1 hour which seems to be enough to cover any temporary ZooKeeper failure.
CancellationCode cancelQuery(bool kill); | ||
/// Cancels the current query. | ||
/// Optional argument `exception` allows to set an exception which checkTimeLimit() will throw instead of "QUERY_WAS_CANCELLED". | ||
CancellationCode cancelQuery(bool kill, std::exception_ptr exception = nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was made to keep the same error through the cluster when a backup fails.
If an error happens on a host then we want to cancel the operation on other hosts too - but it seems it's kinder (for anyone who will look at the logs) to keep the same error instead of introducing any other error codes.
{ | ||
return fmt::formatter<std::string>::format(::to_string(duration), ctx); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a small change to allow fmt::format()
and LOG_INFO
to print values like std::chrono::seconds
.
|
||
struct KeeperSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This structure WithRetries::KeeperSettings
was moved to BackupKeeperSettings
.
…USTER. Fix concurrency check, implement cancelling of distributed backups/restores.
…est "test_stop_other_host_during_backup[True]" because it was replaced by new test "test_long_disconnection_stops_backup".
d40e338
to
7c3ba93
Compare
43153d0
to
b16a18e
Compare
Test failures are unrelated: |
@kssenii I've changed this PR a lot since your last review, do you want to look at it again before I merge it? |
ae2eeb4
@@ -0,0 +1,125 @@ | |||
import random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have problems on this test right now: I extened the backup meta file for lightweight backup, and update the file version to V2. The old_node
will not recognize a v2 meta file, that means as long as we upgrade the file version, this test has to fail ...
Any suggestions?
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Better error-handling and cancellation of
ON CLUSTER
backups and restores:test_disallow_concurrency
- now disabling of concurrency must work better