Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis][syncd][sai] Adjusting response timeout during syncd init #2159

Merged
merged 1 commit into from
Mar 1, 2022

Conversation

vganesan-nokia
Copy link
Contributor

What I did

Fix for syncd response time out for switch create request from orchagent

Why I did it

In VOQ based chassis where syncd uses VOQ SAI, if there are large number of front panel ports, SAI takes more than 1 minutes to complete the switch create initialization. Because of this, the switch create request sent by orchagent is not getting response within the default response wait time of 1 minute. So the orchagent declares switch create failure and crashes.

The number of ports need to be initialized by SAI depends on number of ports per asic and total number of system ports configured in the system. The total number of system ports in the system in turn depends on number of line cards supported, number of asics per line card and number of ports supported by each asic. Therefore in a fully populated system, which is an often expected scenario, this crashing will happen.

To fix this, in orchagent, the syncd response time out is set to 5 minutes for line (voq) card and 10 minutes for supervisor (fabric) card before sending request for switch create and is set back to default wait time after the switch create.

How I verified it

  • In a VOQ based chassis that uses BCM SAI, populate the chassis with a line card that has asics with more than 62 ports.-
  • Configure 192+ system ports
  • Reboot the line card
  • Observe that the orchagent does not crash with swith create failure
  • Observe that the syncd does not show reponse time out error
  • Observe that swss and syncd dockers are up and running and all the interfaces come up.

Details if related

In VOQ based chassis where syncd uses VOQ SAI, if there are large
number of front panel ports, SAI takes more than 1 minutes to complete
the switch create initialization. Because of this, the switch create
request sent by orchagent is not getting response within the default
response wait time of 1 minute. So the orchagent declares switch create
failure and crashes.

The number of ports need to be initialized by SAI depends on number of
ports per asic and total number of system ports configured in the
system. The total number of system ports in the system in turn depends
on number of line cards supported, number of asics per line card and
number of ports supported by each asic. Therefore in a fully populated
system, which is an often expected scenario, this crashing will happen.

To fix this, in orchagent, the syncd response time out is set to 5
minutes for line (voq) card and 10 minutes for supervisor (fabric) card
before sending request for switch create and is set back to default wait
time after the switch create.

Signed-off-by: vedganes <[email protected]>
@judyjoseph
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@judyjoseph
Copy link
Contributor

/azp run

1 similar comment
@judyjoseph
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny merged commit 6e5ed1c into sonic-net:master Mar 1, 2022
judyjoseph pushed a commit that referenced this pull request Mar 1, 2022
…2159)

In VOQ based chassis where syncd uses VOQ SAI, if there are large
number of front panel ports, SAI takes more than 1 minutes to complete
the switch create initialization. Because of this, the switch create
request sent by orchagent is not getting response within the default
response wait time of 1 minute. So the orchagent declares switch create
failure and crashes.

To fix this, in orchagent, the syncd response time out is set to 5
minutes for line (voq) card and 10 minutes for supervisor (fabric) card
before sending request for switch create and is set back to default wait
time after the switch create.

Signed-off-by: vedganes <[email protected]>
preetham-singh pushed a commit to preetham-singh/sonic-swss that referenced this pull request Aug 6, 2022
…onic-net#2159)

In VOQ based chassis where syncd uses VOQ SAI, if there are large
number of front panel ports, SAI takes more than 1 minutes to complete
the switch create initialization. Because of this, the switch create
request sent by orchagent is not getting response within the default
response wait time of 1 minute. So the orchagent declares switch create
failure and crashes.

To fix this, in orchagent, the syncd response time out is set to 5
minutes for line (voq) card and 10 minutes for supervisor (fabric) card
before sending request for switch create and is set back to default wait
time after the switch create.

Signed-off-by: vedganes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants