Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox] set select timeout to no more than 1 sec to make sure fast shutdown #13611

Merged
merged 1 commit into from
Feb 14, 2023

Conversation

Junchao-Mellanox
Copy link
Collaborator

Why I did it

Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue #13591.

How I did it

  1. The select call in get_change_event should use no more than 1 second as timeout parameter.
  2. Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

How to verify it

Manual test

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liat-grozovik
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor
Copy link
Contributor

prgeor commented Feb 7, 2023

@Junchao-Mellanox do you still want to keep in draft?

@Junchao-Mellanox Junchao-Mellanox marked this pull request as ready for review February 8, 2023 03:34
@liat-grozovik
Copy link
Collaborator

@keboliu please review

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with comments

@@ -387,26 +387,30 @@ def get_change_event(self, timeout=0):
self.sfp_event.initialize()

wait_for_ever = (timeout == 0)
# select timeout should be no more than 1000ms to ensure fast shutdown flow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox may be add more context in the comment why max timeout is kept 1sec?

@@ -387,26 +387,30 @@ def get_change_event(self, timeout=0):
self.sfp_event.initialize()

wait_for_ever = (timeout == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if you still want to keep this timeout=0 for future use

@liat-grozovik liat-grozovik merged commit 389b279 into sonic-net:master Feb 14, 2023
StormLiangMS pushed a commit to StormLiangMS/sonic-buildimage that referenced this pull request Mar 28, 2023
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
@Junchao-Mellanox Junchao-Mellanox deleted the get_change_event branch June 9, 2023 07:40
@Junchao-Mellanox
Copy link
Collaborator Author

Junchao-Mellanox commented Sep 5, 2023

This is a fix of issue #13591. The PR sonic-net/sonic-platform-daemons#326 which introduced the issue has been cherry-picked to 202205 and 202211, so we need to cherry-pick this to 202211 and 202205 as well.

Hi @StormLiangMS , could you please help cherry-pick to 202211?
Hi @yxieca , could you please help cherry-pick to 202205?

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 5, 2023
… shutdown (sonic-net#13611)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue sonic-net#13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 5, 2023
… shutdown (sonic-net#13611)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue sonic-net#13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #16448

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #16449

yxieca pushed a commit that referenced this pull request Sep 6, 2023
… shutdown (#13611) (#16449)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue #13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test

Co-authored-by: Junchao-Mellanox <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants