Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for handling SAI failures in orchagent #762

Merged
merged 11 commits into from
Jul 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/SAI_failure_handling/Framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
123 changes: 123 additions & 0 deletions doc/SAI_failure_handling/SAI_failure_handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# HLD for handling SAI failures

## 1. Scope

This document describes the high-level design for orchagent in handling SAI failures

## 2. Failure Handling Framework

### 2.1 Requirements for failure handling functions in orchagent

1. Allow different handling for Create/Set/Remove/Get operations.
1. Allow each Orch to have its specific handling logic.
1. Adapt handling logic based on SAI API type and SAI status.
1. Escalate the failure to upper layers when the failure cannot be handled in orchagent.

### 2.2 An overview of the failure handling framework

An illustrative figure of the failure handling framework is shown below.
The orchagent generates SAI calls according to the information in APPL_DB given by upper layers.
In the case of SAI failures, the orchagent gets the failure status via the feedback mechanism in synchronous mode.
Based on the failure information, the failure handling functions in the orchagent make the first attempt to address the failure.
An ERROR_DB is also introduced to support escalation to upper layers.
In the scenario where the orchagent is unable to resolve the problem, the failure handling functions would escalate the failure to the upper layers by pushing the failure into the ERROR_DB.

<img src="Framework.png">

### 2.3 Failure handling functions in Orchagent

#### 2.3.1 Failure handling functions

To support a failure handling logic in general while also allow each orch to have its specific logic, we include the following virtual functions in Orch

1. `virtual task_process_status handleSaiCreateStatus(sai_api_t api, sai_status_t status, void *context = nullptr)`
2. `virtual task_process_status handleSaiSetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)`
3. `virtual task_process_status handleSaiRemoveStatus(sai_api_t api, sai_status_t status, void *context = nullptr)`
4. `virtual task_process_status handleSaiGetStatus(sai_api_t api, sai_status_t status, void *context = nullptr)`

The four functions handle SAI failures in create, set, remove, and get operations, respectively.
With the type of SAI API and SAI status as an input, the function could handle the failure according to the two pieces of information.

In the scenario where a specific logic is required in one of the Orchs, this design allows the Orch to inherit the function and include the specific login in the inherited function.

The function also allows an optional input `context`, which allows passing context (e.g., object entry, attribute, etc.) into the function so that it could escalate the information to the ERROR_DB and upper layers.

#### 2.3.2 Possible execution results

1. Return `task_success` -- No crash, no retry, handled successfully.

The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation).

2. Return `task_failed` -- No crash, no retry, although the failure is not handled successfully, orchagent will keep running without exit.

The failure handling function should return `task_failed` when the failure is unable to be handled in orchagent and another attempt is not likely to resolve the failure. In such a scenario, the function should prevent orchagent from retrying and escalate the failure to upper layers. Example scenarios where the functions should return this status include invalid user input (e.g., a wrong ACL, conflicting IP addresses), hardware permanent error, non-critical internal logic error, etc.

3. Return `task_need_retry` -- No crash, retry

The failure handling function should return `task_need_retry` when the failed SAI call may be resolved in a subsequent attempt.

4. exit(EXIT_FAILURE) -- Crash and trigger SwSS auto-restart

Some of the failures can be resolved by restarting SwSS.
In the scenario where such failures happen, the failure handling function in the orchagent should exit with `EXIT_FAILURE` and trigger SwSS auto restart.

### 2.4 DB changes

An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd.

The schema of ERROR_DB is designed as follows:
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema

If you accumulate them into a table, who will delete them? otherwise it will keep growing in memory. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a discussion in the design that it is necessary to avoid accumulating failures in ERROR_DB and consuming memory. To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. Also included a possible implementation of this behavior.


```
ERROR_{{DB_TYPE}}_{{TABLE_TYPE}}_TABLE|entry
"failed_orch": {{failed orch type}}
"failed_SAI": {{failed SAI type}}
"opcode": {{failed SAI operation type}}
"status": {{sai_status}}
"attributes": "attr_type0,attr_type1,..." (Optional)
"attr_values": "attr_value0,attr_value1,..." (Optional)
"counter": {{count}}
```

The table and key in ERROR_DB correspond to the DB, table, and key where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_APPL_ROUTE_TABLE:0.0.0.0/0`).

The field `failed_orch` indicates the type of orch where the SAI failure happens.

The field `failed_SAI` indicates the type of SAI in which the SAI failure happens.

The field `opcode` indicates the method that failed.
Possible values include `CREATE/SET/DELETE`.

The field `status` stores the status of the SAI operation (e.g., SAI_STATUS_NOT_SUPPORTED, SAI_STATUS_FAILURE).

The ERROR_DB entry also includes a list of attributes (comma separated) and the corresponding values (comma separated) that the failed operation tries to set.

The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure.

The upstream processes are expected to consume the ERROR_DB entries and remove the handled failures from the ERROR_DB.
Assuming the upstream processes have the proper consumption of ERROR_DB entries and failure handling logic (these are not currently available for upstreams processes and need to be added), the ERROR_DB should not keep accumulating failures in ERROR_DB and consuming memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an entry is deleted from APPL_DB, any ERROR_DB entries with the same key could be cleared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be true in most scenarios since the ERROR_DB entries are designed to track the failure in applying APPL_DB entries or basically mismatch between APPL_DB and the hardware. When an entry is removed from ERROR_DB, the mismatch between APPL_DB and the hardware should get cleared, and most likely the corresponding ERROR_DB entry is no longer needed. Yet if the upstream process believes that extra operation is needed for the failure, I think the upstream process could have the freedom to hold the entry until the operations are done.


## 3. Failure handling logic in orchagent

### 3.1 Failure status that could be handled in orchagent

| SAI status | Create | Set | Remove | Get
|-----|-----|-----|-----|-----|
| ITEM ALREADY EXISTS | Set the corresponding attribute instead. | Should not happen. No retry. | Should not happen. No retry. |Should not happen. No retry. |
| ITEM NOT FOUND  | Should not happen. No retry. | Create the item and set attribute. | Return success. No retry. | No retry. |
| OBJECT IN USE | Should not happen. No retry. | Should not happen. Retry after a while. | Retry after a while. | Should not happen. No retry. |Should not happen. No retry. |
| NOT_SUPPORTED | Should not happen. Crash orchagent. | Should not happen. Crash orchagent. | Should not happen. Crash orchagent. | Should not happen. Crash orchagent. |

TODO: Add handling logic for other SAI statuses.

### 3.2 SAI API specific handling logic

TODO: Add SAI API specific handling logic

### 3.3 Orch specific handling logic

TODO: Add Orch specific handling logic

## 4. Warm boot support

A warm reboot should not be issued in the scenario with unhandled SAI failures.
A check or ERROR_DB should be added to pre-warm-reboot check functions to prevent doing warm reboots with unhandled SAI failures.