implement Error mechanism for errors reporting and handling. #30544

3pointer · 2021-12-08T10:01:32Z

report error to prometheus.
build an mechanism to handle all errors.

YuJuncen · 2022-03-28T06:50:50Z

Currently, we have made a minimal error handing mechanism for handling errors, by adding a report method in the Error type.

https://github.com/3pointer/tikv/blob/d17d3db9a355965b0bf89314bbfa2028707a7ba7/components/backup-stream/src/errors.rs#L54-L60

Maybe we need farther design for pausing and report more detailed error message to the user.

YuJuncen · 2022-03-28T09:06:11Z

A draft design doc for this:

Backup Stream Error Handling

Background

backup-stream is controlled by the BR CLI. After BR started some 'Tasks', the remaining work would be done by TiKV, and the whole procedure then is asynchronous. That means, when some errors unrecoverable meet (e.g. External storage get unavailable), the user won't be notified at that time. Before the thing

Currently, a minimal implementation for reporting errors is done, with a simple method report in the Error type.

pub fn report(&self, context: impl Display) {
    error!("backup stream meet error"; "context" => %context, "err" => %self);
    metrics::STREAM_ERROR
        .with_label_values(&[self.kind()])
        .inc()
}

It simply prints a log and increases a counter in Prometheus. This is not good enough for our scenario: the BR CLI cannot query further information about the error, and the task cannot be paused until were cover from this error.

Design

Error Reporting

There will be some new keys added to the metadata of task:

{prefix}/last_error/{task_name:string}/{store_id:int} → LastError

The LastError type is defined by the following protocol buffer code:

message LastError {
    // the unix epoch time (in millisecs?) of the time the error reported.
    uint64 happen_at = 1;
    // the unified error code of the error.
    string error_code = 2;
    // the user-friendly error message.
    string error_message = 3;
}

Some new methods will be added to the Endpoint type and MetaClient type, which supports reporting an error and pause corresponding task.

fn report_and_pause(&self, err: Error, task: String, ctx: impl Display) {
    err.report(ctx);
    let r = self.meta_cli.report_last_error(LastError {
        happen_at: Instant::now().unix_milli(),
        error_code: err.err_code(),
        error_message: err.to_string(),
    });
    if let Err(e) = r {
        error!("failed to report err", ...)
    }
    self.on_pause(task);
}

Pausing

When the status task is Paused, the Endpoint should deregister all regions belonging to this task. (For now, we only support running a single task for each cluster, so just removing all listening regions and resolvers would be fine? A problem here is they are distributing at Observer and Endpoint . Maybe we should make them more unified?)

Once it gets Running again, we should do the same things as the task firstly gets registered: scan the regions with Leader role in the current store, register the observer over them, and do the initial scanning.

The BR CLI

As the spec, the current status of task should be added into stream status command.

name: little_schema
status: PAUSED
# If there are some error.
error_code: ErrPiTRFailToFlush
error_message: "the storage has reached its quota"
start: 2022-03-23 14:42:27.529 +0800 CST
...

(Shall we make PAUSED and PAUSED_DUE_TO_ERROR distinct status?)

The user may clear the error and retry starting the task via something like stream resume, maybe provide a flag --clear-error to remove the last error from the task.

(Should / can the resume procedure be synchronous?)

3pointer mentioned this issue Dec 8, 2021

Development Tasks for Log-based Increment Backup, directly on TiKV #29501

Closed

41 tasks

3pointer changed the title ~~implement Error storage for reporting errors.~~ implement Error mechanism for errors reporting and handling. Dec 8, 2021

This was referenced Apr 12, 2022

br/stream: added basic utilities for error handing #33884

Merged

br/stream: make status shows the log #34028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement Error mechanism for errors reporting and handling. #30544

implement Error mechanism for errors reporting and handling. #30544

3pointer commented Dec 8, 2021 •

edited

Loading

YuJuncen commented Mar 28, 2022

YuJuncen commented Mar 28, 2022

implement Error mechanism for errors reporting and handling. #30544

implement Error mechanism for errors reporting and handling. #30544

Comments

3pointer commented Dec 8, 2021 • edited Loading

YuJuncen commented Mar 28, 2022

YuJuncen commented Mar 28, 2022

Backup Stream Error Handling

Background

Design

Error Reporting

Pausing

The BR CLI

3pointer commented Dec 8, 2021 •

edited

Loading