Skip to content

Commit

Permalink
featured: use run() to run cli commands in place of check_call() (#177)
Browse files Browse the repository at this point in the history
Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started
The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”
featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.
  • Loading branch information
anamehra authored and mssonicbld committed Nov 22, 2024
1 parent dbb063a commit b98abec
Show file tree
Hide file tree
Showing 3 changed files with 192 additions and 187 deletions.
6 changes: 5 additions & 1 deletion scripts/featured
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,11 @@ PORT_INIT_TIMEOUT_SEC = 180

def run_cmd(cmd, log_err=True, raise_exception=False):
try:
subprocess.check_call(cmd)
result = subprocess.run(cmd,
capture_output=True,
check=True, text=True)
syslog.syslog(syslog.LOG_INFO, "Output: {} , Stderr: {}"
.format(result.stdout, result.stderr))
except Exception as err:
if log_err:
syslog.syslog(syslog.LOG_ERR, "{} - failed: return code - {}, output:\n{}"
Expand Down
123 changes: 62 additions & 61 deletions tests/featured/featured_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,9 +175,9 @@ def test_sync_state_field(self, test_scenario_name, config_data, fs):
feature_table_state_db_calls = self.get_state_db_set_calls(feature_table)

self.checks_systemd_config_file(device_type, config_data['config_db']['FEATURE'], feature_systemd_name_map)
mocked_subprocess.check_call.assert_has_calls(config_data['enable_feature_subprocess_calls'],
mocked_subprocess.run.assert_has_calls(config_data['enable_feature_subprocess_calls'],
any_order=True)
mocked_subprocess.check_call.assert_has_calls(config_data['daemon_reload_subprocess_call'],
mocked_subprocess.run.assert_has_calls(config_data['daemon_reload_subprocess_call'],
any_order=True)
feature_state_table_mock.set.assert_has_calls(feature_table_state_db_calls)
self.checks_systemd_config_file(device_type, config_data['config_db']['FEATURE'], feature_systemd_name_map)
Expand Down Expand Up @@ -227,9 +227,9 @@ def test_handler(self, test_scenario_name, config_data, fs):
feature_systemd_name_map[feature_name] = feature_names

self.checks_systemd_config_file(device_type, config_data['config_db']['FEATURE'], feature_systemd_name_map)
mocked_subprocess.check_call.assert_has_calls(config_data['enable_feature_subprocess_calls'],
mocked_subprocess.run.assert_has_calls(config_data['enable_feature_subprocess_calls'],
any_order=True)
mocked_subprocess.check_call.assert_has_calls(config_data['daemon_reload_subprocess_call'],
mocked_subprocess.run.assert_has_calls(config_data['daemon_reload_subprocess_call'],
any_order=True)

def test_feature_config_parsing(self):
Expand Down Expand Up @@ -375,15 +375,15 @@ def test_feature_events(self, mock_syslog, get_runtime):
daemon.start(time.time())
except TimeoutError as e:
pass
expected = [call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'mux.service']),
call(['sudo', 'systemctl', 'enable', 'mux.service']),
call(['sudo', 'systemctl', 'start', 'mux.service'])]
mocked_subprocess.check_call.assert_has_calls(expected)
expected = [call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'mux.service'], capture_output=True, check=True, text=True)]
mocked_subprocess.run.assert_has_calls(expected, any_order=True)

# Change the state to disabled
MockSelect.reset_event_queue()
Expand All @@ -393,10 +393,10 @@ def test_feature_events(self, mock_syslog, get_runtime):
daemon.start(time.time())
except TimeoutError:
pass
expected = [call(['sudo', 'systemctl', 'stop', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'disable', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'mask', 'dhcp_relay.service'])]
mocked_subprocess.check_call.assert_has_calls(expected)
expected = [call(['sudo', 'systemctl', 'stop', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'disable', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'mask', 'dhcp_relay.service'], capture_output=True, check=True, text=True)]
mocked_subprocess.run.assert_has_calls(expected, any_order=True)

def test_delayed_service(self, mock_syslog, get_runtime):
MockSelect.set_event_queue([('FEATURE', 'dhcp_relay'),
Expand All @@ -417,20 +417,20 @@ def test_delayed_service(self, mock_syslog, get_runtime):
daemon.start(time.time())
except TimeoutError:
pass
expected = [call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'mux.service']),
call(['sudo', 'systemctl', 'enable', 'mux.service']),
call(['sudo', 'systemctl', 'start', 'mux.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'telemetry.service']),
call(['sudo', 'systemctl', 'enable', 'telemetry.service']),
call(['sudo', 'systemctl', 'start', 'telemetry.service'])]

mocked_subprocess.check_call.assert_has_calls(expected)
expected = [call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'telemetry.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'telemetry.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'telemetry.service'], capture_output=True, check=True, text=True)]

mocked_subprocess.run.assert_has_calls(expected, any_order=True)

def test_advanced_reboot(self, mock_syslog, get_runtime):
MockRestartWaiter.advancedReboot = True
Expand All @@ -442,25 +442,26 @@ def test_advanced_reboot(self, mock_syslog, get_runtime):
daemon = featured.FeatureDaemon()
daemon.render_all_feature_states()
daemon.register_callbacks()
try:
try:
daemon.start(time.time())
except TimeoutError:
pass
expected = [call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'mux.service']),
call(['sudo', 'systemctl', 'enable', 'mux.service']),
call(['sudo', 'systemctl', 'start', 'mux.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'telemetry.service']),
call(['sudo', 'systemctl', 'enable', 'telemetry.service']),
call(['sudo', 'systemctl', 'start', 'telemetry.service'])]

mocked_subprocess.check_call.assert_has_calls(expected)

pass
expected = [
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'telemetry.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'telemetry.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'telemetry.service'], capture_output=True, check=True, text=True)]

mocked_subprocess.run.assert_has_calls(expected, any_order=True)

def test_portinit_timeout(self, mock_syslog, get_runtime):
print(MockConfigDb.CONFIG_DB)
MockSelect.NUM_TIMEOUT_TRIES = 1
Expand All @@ -479,16 +480,16 @@ def test_portinit_timeout(self, mock_syslog, get_runtime):
daemon.start(0.0)
except TimeoutError:
pass
expected = [call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'mux.service']),
call(['sudo', 'systemctl', 'enable', 'mux.service']),
call(['sudo', 'systemctl', 'start', 'mux.service']),
call(['sudo', 'systemctl', 'daemon-reload']),
call(['sudo', 'systemctl', 'unmask', 'telemetry.service']),
call(['sudo', 'systemctl', 'enable', 'telemetry.service']),
call(['sudo', 'systemctl', 'start', 'telemetry.service'])]
mocked_subprocess.check_call.assert_has_calls(expected)
expected = [call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'mux.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'unmask', 'telemetry.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'enable', 'telemetry.service'], capture_output=True, check=True, text=True),
call(['sudo', 'systemctl', 'start', 'telemetry.service'], capture_output=True, check=True, text=True)]
mocked_subprocess.run.assert_has_calls(expected, any_order=True)
Loading

0 comments on commit b98abec

Please sign in to comment.