Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a mechanism that to enable a NOS running SAI and SDK, to detect SDK and FW failures #1615

Closed

Conversation

amandyuk
Copy link
Contributor

@amandyuk amandyuk commented Oct 3, 2022

Add new health event, for SAI to notify severity, timestamp and description of a SDK/FW health event

Signed-off-by: Andrii Mandiuk [email protected]

@mikeberesford
Copy link
Contributor

@JaiOCP this looks to be partially overlapping with #1307

@rlhui
Copy link
Collaborator

rlhui commented Oct 6, 2022

Can we please add example usage?

@rlhui
Copy link
Collaborator

rlhui commented Oct 6, 2022

Intel/ashutosh: no way for nos to subscribe to specific ones they're interested in

@rlhui
Copy link
Collaborator

rlhui commented Oct 6, 2022

mike: this seems to take away control from NOS.

@rlhui
Copy link
Collaborator

rlhui commented Oct 6, 2022

better if nos doesn't need to parse string but the info is in a defined format. Please also checkout https://github.com/sonic-net/SONiC/blob/5677242770e3774a5eab0f951dc14642d81422ec/doc/event-alarm-framework/events-producer.md

@itaibaz
Copy link
Collaborator

itaibaz commented Oct 6, 2022

Community discussion comments :

  1. Add category, to allow the NOS to act based on category and not have to parse the free text description
    Review
    https://github.com/sonic-net/SONiC/blob/5677242770e3774a5eab0f951dc14642d81422ec/doc/event-alarm-framework/events-producer.md

  2. Add a knob, to control the minimum severity of event NOS wants to receive, for example receive just warning or above

  3. Consider rate limitation, so NOS isn't flooded with events

Copy link
Contributor

@reshmaintel reshmaintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itaibaz I like this proposal that this SDK failure detection and reporting is generic and gives the control to the vendor to push specific failure logs with related severity that help user to debug the given vendor SDK/HW. I think that if this feature is only severity level based it may resort to being an alternative to syslog. If there is additional categorization for the failures it may be more valuable to the user to monitor certain categories. Certain broad categories can be suggested. Given that using enums for categories seem restrictive and not having to add new enums for additional categories is one of your goals, please explore if we can make it extensible by adding the category as a field just like the description field.

@amandyuk amandyuk force-pushed the health_event_feature branch 2 times, most recently from d109ac8 to 256d47b Compare November 2, 2022 06:48
@amandyuk amandyuk force-pushed the health_event_feature branch from 256d47b to 79a75b0 Compare March 9, 2023 11:28
@amandyuk amandyuk closed this Mar 9, 2023
@amandyuk amandyuk force-pushed the health_event_feature branch from 79a75b0 to 2966cb3 Compare March 9, 2023 11:29
@itaibaz
Copy link
Collaborator

itaibaz commented Mar 9, 2023

opened #1777 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants