-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement Error mechanism for errors reporting and handling. #30544
Comments
Currently, we have made a minimal error handing mechanism for handling errors, by adding a Maybe we need farther design for pausing and report more detailed error message to the user. |
A draft design doc for this: Backup Stream Error HandlingBackgroundbackup-stream is controlled by the BR CLI. After BR started some 'Tasks', the remaining work would be done by TiKV, and the whole procedure then is asynchronous. That means, when some errors unrecoverable meet (e.g. External storage get unavailable), the user won't be notified at that time. Before the thing Currently, a minimal implementation for reporting errors is done, with a simple method report in the Error type. pub fn report(&self, context: impl Display) {
error!("backup stream meet error"; "context" => %context, "err" => %self);
metrics::STREAM_ERROR
.with_label_values(&[self.kind()])
.inc()
} It simply prints a log and increases a counter in Prometheus. This is not good enough for our scenario: the BR CLI cannot query further information about the error, and the task cannot be paused until were cover from this error. DesignError ReportingThere will be some new keys added to the metadata of task:
The message LastError {
// the unix epoch time (in millisecs?) of the time the error reported.
uint64 happen_at = 1;
// the unified error code of the error.
string error_code = 2;
// the user-friendly error message.
string error_message = 3;
} Some new methods will be added to the Endpoint type and MetaClient type, which supports reporting an error and pause corresponding task. fn report_and_pause(&self, err: Error, task: String, ctx: impl Display) {
err.report(ctx);
let r = self.meta_cli.report_last_error(LastError {
happen_at: Instant::now().unix_milli(),
error_code: err.err_code(),
error_message: err.to_string(),
});
if let Err(e) = r {
error!("failed to report err", ...)
}
self.on_pause(task);
} PausingWhen the status task is Paused, the Endpoint should deregister all regions belonging to this task. (For now, we only support running a single task for each cluster, so just removing all listening regions and resolvers would be fine? A problem here is they are distributing at Observer and Endpoint . Maybe we should make them more unified?) Once it gets Running again, we should do the same things as the task firstly gets registered: scan the regions with Leader role in the current store, register the observer over them, and do the initial scanning. The BR CLIAs the spec, the current status of task should be added into stream status command.
(Shall we make PAUSED and PAUSED_DUE_TO_ERROR distinct status?) The user may clear the error and retry starting the task via something like stream resume, maybe provide a flag --clear-error to remove the last error from the task. (Should / can the resume procedure be synchronous?) |
The text was updated successfully, but these errors were encountered: