Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202405]: featured enable/start failed for pmon, snmp, lldp services with SIGPIPE #20662

Closed
anamehra opened this issue Oct 31, 2024 · 1 comment · Fixed by sonic-net/sonic-host-services#177
Labels
Cisco Triaged this issue has been triaged

Comments

@anamehra
Copy link
Contributor

Description

During some reboots on supervisor on chassis it was observed that featured did not start pmon, snmp, lldp, gnmi, mgmt-framework services. This caused all sonic service docker on supervisor to remain down.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started

Though the issue looks generic, have been observed on Supervisor so far.

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

sonic-beuildimage 202405 sha. bf0d9fafd13af26e6500a04c0b031254adf8fc73

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@anamehra
Copy link
Contributor Author

Hi @abdosi , I am working on a PR for this.

@arlakshm arlakshm added Triaged this issue has been triaged Cisco labels Nov 6, 2024
qiluo-msft pushed a commit to sonic-net/sonic-host-services that referenced this issue Nov 18, 2024
Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started
The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”
featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.
mssonicbld pushed a commit to mssonicbld/sonic-host-services that referenced this issue Nov 22, 2024
…ic-net#177)

Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started
The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”
featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.
mssonicbld pushed a commit to sonic-net/sonic-host-services that referenced this issue Nov 22, 2024
Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started
The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”
featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cisco Triaged this issue has been triaged
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants