Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrections after reworking backup/restore synchronization #3 #72018

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions src/Backups/BackupCoordinationStageSync.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -685,13 +685,13 @@ void BackupCoordinationStageSync::cancelQueryIfError()

{
std::lock_guard lock{mutex};
if (!state.host_with_error)
return;

exception = state.hosts.at(*state.host_with_error).exception;
if (state.host_with_error)
exception = state.hosts.at(*state.host_with_error).exception;
}

chassert(exception);
if (!exception)
return;

process_list_element->cancelQuery(false, exception);
state_changed.notify_all();
}
Expand Down Expand Up @@ -741,6 +741,11 @@ void BackupCoordinationStageSync::cancelQueryIfDisconnectedTooLong()
if (!exception)
return;

/// In this function we only pass the new `exception` (about that the connection was lost) to `process_list_element`.
/// We don't try to create the 'error' node here (because this function is called from watchingThread() and
/// we don't want the watching thread to try waiting here for retries or a reconnection).
/// Also we don't set the `state.host_with_error` field here because `state.host_with_error` can only be set
/// AFTER creating the 'error' node (see the comment for `State`).
process_list_element->cancelQuery(false, exception);
state_changed.notify_all();
}
Expand Down Expand Up @@ -870,6 +875,9 @@ bool BackupCoordinationStageSync::checkIfHostsReachStage(const Strings & hosts,
continue;
}

if (state.host_with_error)
std::rethrow_exception(state.hosts.at(*state.host_with_error).exception);

Comment on lines +878 to +880
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to fix the failure of test test_backup_restore_on_cluster/test_cancel_backup.py::test_cancel_restore.

if (host_info.finished)
throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
"{} finished without coming to stage {}", getHostDesc(host), stage_to_wait);
Expand Down Expand Up @@ -1150,6 +1158,9 @@ bool BackupCoordinationStageSync::checkIfOtherHostsFinish(
if ((host == current_host) || host_info.finished)
continue;

if (throw_if_error && state.host_with_error)
std::rethrow_exception(state.hosts.at(*state.host_with_error).exception);

String reason_text = reason.empty() ? "" : (" " + reason);

String host_status;
Expand Down
3 changes: 3 additions & 0 deletions src/Backups/BackupCoordinationStageSync.h
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,9 @@ class BackupCoordinationStageSync
};

/// Information about all the host participating in the current BACKUP or RESTORE operation.
/// This information is read from ZooKeeper.
/// To simplify the programming logic `state` can only be updated AFTER changing corresponding nodes in ZooKeeper
/// (for example, first we create the 'error' node, and only after that we set or read from ZK the `state.host_with_error` field).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added more comments.

struct State
{
std::map<String /* host */, HostInfo> hosts; /// std::map because we need to compare states
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -667,7 +667,7 @@ def test_long_disconnection_stops_backup():
# A backup is expected to fail, but it isn't expected to fail too soon.
print(f"Backup failed after {time_to_fail} seconds disconnection")
assert time_to_fail > 3
assert time_to_fail < 35
assert time_to_fail < 45
Copy link
Member Author

@vitlibar vitlibar Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to fix the failure of test_backup_restore_on_cluster/test_cancel_backup.py::test_long_disconnection_stops_backup.



# A backup must NOT be stopped if Zookeeper is disconnected shorter than `failure_after_host_disconnected_for_seconds`.
Expand Down
Loading