Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Error mechanism for errors reporting and handling. #30544

Open
Tracked by #29501
3pointer opened this issue Dec 8, 2021 · 2 comments
Open
Tracked by #29501

implement Error mechanism for errors reporting and handling. #30544

3pointer opened this issue Dec 8, 2021 · 2 comments

Comments

@3pointer
Copy link
Contributor

3pointer commented Dec 8, 2021

  1. report error to prometheus.
  2. build an mechanism to handle all errors.
@3pointer 3pointer changed the title implement Error storage for reporting errors. implement Error mechanism for errors reporting and handling. Dec 8, 2021
@YuJuncen
Copy link
Contributor

Currently, we have made a minimal error handing mechanism for handling errors, by adding a report method in the Error type.

https://github.com/3pointer/tikv/blob/d17d3db9a355965b0bf89314bbfa2028707a7ba7/components/backup-stream/src/errors.rs#L54-L60

Maybe we need farther design for pausing and report more detailed error message to the user.

@YuJuncen
Copy link
Contributor

A draft design doc for this:

Backup Stream Error Handling

Background

backup-stream is controlled by the BR CLI. After BR started some 'Tasks', the remaining work would be done by TiKV, and the whole procedure then is asynchronous. That means, when some errors unrecoverable meet (e.g. External storage get unavailable), the user won't be notified at that time. Before the thing

Currently, a minimal implementation for reporting errors is done, with a simple method report in the Error type.

pub fn report(&self, context: impl Display) {
    error!("backup stream meet error"; "context" => %context, "err" => %self);
    metrics::STREAM_ERROR
        .with_label_values(&[self.kind()])
        .inc()
}

It simply prints a log and increases a counter in Prometheus. This is not good enough for our scenario: the BR CLI cannot query further information about the error, and the task cannot be paused until were cover from this error.

Design

Error Reporting

There will be some new keys added to the metadata of task:

{prefix}/last_error/{task_name:string}/{store_id:int} → LastError

The LastError type is defined by the following protocol buffer code:

message LastError {
    // the unix epoch time (in millisecs?) of the time the error reported.
    uint64 happen_at = 1;
    // the unified error code of the error.
    string error_code = 2;
    // the user-friendly error message.
    string error_message = 3;
}

Some new methods will be added to the Endpoint type and MetaClient type, which supports reporting an error and pause corresponding task.

fn report_and_pause(&self, err: Error, task: String, ctx: impl Display) {
    err.report(ctx);
    let r = self.meta_cli.report_last_error(LastError {
        happen_at: Instant::now().unix_milli(),
        error_code: err.err_code(),
        error_message: err.to_string(),
    });
    if let Err(e) = r {
        error!("failed to report err", ...)
    }
    self.on_pause(task);
}

Pausing

When the status task is Paused, the Endpoint should deregister all regions belonging to this task. (For now, we only support running a single task for each cluster, so just removing all listening regions and resolvers would be fine? A problem here is they are distributing at Observer and Endpoint . Maybe we should make them more unified?)

Once it gets Running again, we should do the same things as the task firstly gets registered: scan the regions with Leader role in the current store, register the observer over them, and do the initial scanning.

The BR CLI

As the spec, the current status of task should be added into stream status command.

name: little_schema
status: PAUSED
# If there are some error.
error_code: ErrPiTRFailToFlush
error_message: "the storage has reached its quota"
start: 2022-03-23 14:42:27.529 +0800 CST
...

(Shall we make PAUSED and PAUSED_DUE_TO_ERROR distinct status?)

The user may clear the error and retry starting the task via something like stream resume, maybe provide a flag --clear-error to remove the last error from the task.

(Should / can the resume procedure be synchronous?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants