Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202012] Fix race condition between networking service and interface-config service #142

Closed
wants to merge 1 commit into from

Conversation

Junchao-Mellanox
Copy link
Owner

Why I did it

The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

  1. Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
  2. Systemd starts interface-config service who depends on networking service
  3. Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
  4. Interface-config service updates /etc/networking/interface to static configuration.
  5. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
  6. When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.

The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it

Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it

Manual test.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

…rvice (sonic-net#10573)

Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
Conflicts:
	files/image_config/interfaces/interfaces-config.sh
@Junchao-Mellanox Junchao-Mellanox requested a review from keboliu May 6, 2022 01:42
Junchao-Mellanox pushed a commit that referenced this pull request Oct 25, 2022
79edf66 Longxiang Lyu Wed Aug 17 08:12:37 2022 +0800 Fix azure pipeline (#118)
8e0f2c6 Longxiang Lyu Wed Aug 17 08:36:07 2022 +0800 Update linkmgr health after getting default route update (#117)
b14ffb8 Jing Zhang Wed Aug 17 15:44:37 2022 -0700 [active-active] post mux metrics events (#123)
a30dbb3 Jing Zhang Thu Aug 18 18:16:04 2022 -0700 Update handleMuxConfigNotification logic (#125)
e14aaba Jing Zhang Tue Aug 23 10:02:17 2022 -0700 [active-active] Remove unnecessary mux wait timeout logs (#122)
cc83717 Longxiang Lyu Fri Sep 2 02:17:53 2022 +0800 Fix mux config (#128)
5429281 Mai Bui Thu Sep 1 17:44:04 2022 -0400 [linkmgrd] Replace memset function in link_prober (#126)
b5aaec1 Jing Zhang Fri Sep 9 14:01:03 2022 -0700 [active-active] shutdown link prober when starting as isolated (#130)
75f02cf Jing Zhang Tue Sep 13 10:34:32 2022 -0700 [active-standby] update warmboot reconciliation logic (#129)
a5a9f90 Hua Liu Fri Sep 16 09:54:32 2022 +0800 Install libyang to azure pipeline (#132)
6fe4f0f Jing Zhang Tue Sep 20 10:10:16 2022 -0700 [Active-Active] flaky LinkmgrdBootupSequence unit tests (#134)
ea68e8c Jing Zhang Wed Sep 21 10:52:18 2022 -0700 Post switchover reasons to STATE DB (#131)
60c35b5 Jing Zhang Thu Sep 22 13:00:41 2022 -0700 [Active-Active] server side admin forwarding state sync up (#133)
08e1be5 Jing Zhang Mon Sep 26 10:59:27 2022 -0700 [Active-Active] avoid being stuck in unknown after process init (#136)
2579988 Jing Zhang Mon Oct 3 09:40:55 2022 -0700 [Active-Standby] fix syslog flood caused by unkown -> standby switchovers (#137)
7e9f670 Jing Zhang Wed Oct 5 10:03:45 2022 -0700 [Active-Active] Retry config mux mode standby (#139)
23feb3b Jing Zhang Wed Oct 5 15:22:58 2022 -0700 [Active-Active] Post link prober stats to state db (#140)
e650098 Jing Zhang Fri Oct 7 15:27:17 2022 -0700 [Active-Active] Update default route shutdown heartbeat logic (#141)
d0653e7 Jing Zhang Tue Oct 11 10:22:02 2022 -0700 [Active-Standby] avoid posting mux metrics event when receiving unsolicited mux state notification (#142)

dcf6460 Longxiang Lyu Fri Oct 21 12:15:42 2022 +0800 [active-active] Add support to send/handle mux probe request (#147)
fdf42ed Longxiang Lyu Fri Oct 21 10:34:47 2022 +0800 Fix link prober state event report twice issue (#149)
5fd19a3 Longxiang Lyu Mon Oct 17 09:20:27 2022 +0800 [active-active] Fix config reload (#145)

sign-off: Jing Zhang [email protected]
Junchao-Mellanox pushed a commit that referenced this pull request Nov 4, 2022
…rm-common] advance submodule head (sonic-net#12492)

linkmgrd:
* d7d6635 2022-10-21 | Fix link prober state event report twice issue (#149) (HEAD -> 202205) [Longxiang Lyu]
* 0ef3296 2022-10-21 | [active-active] Add support to send/handle mux probe request (#147) [Longxiang Lyu]
* a66fa34 2022-10-17 | [active-active] Fix config reload (#145) [Longxiang Lyu]
* 7e1c820 2022-10-11 | [Active-Standby] avoid posting mux metrics event when receiving unsolicited mux state notification  (#142) [Jing Zhang]
* 237cfd2 2022-10-07 | [Active-Active] Update default route shutdown heartbeat logic (#141) [Jing Zhang]

utilities:
* 415d30e 2022-10-23 | [techsupport] Adding FRR EVPN dumps (sonic-net#2442) (HEAD -> 202205) [Sudharsan Dhamal Gopalarathnam]
* b3ffe45 2022-10-21 | [show][muxcable] add support for show mux firmware version all (sonic-net#2441) [vdahiya12]
* 7d68534 2022-10-19 | [app_ext] [auto-ts] Add available_mem_threshold option (sonic-net#2423) [Vivek]
* 52b9c16 2022-10-07 | [muxcable][config] add CLI support for mux mode detach (sonic-net#2425) [Jing Zhang]
* 14646ff 2022-10-10 | [show priority-group drop counters] Remove backup with cached PG drop counters after 'config reload' (sonic-net#2386) [Andriy Yurkiv]
* dffcc53 2022-10-11 | Add a subcommand to display a hexdump of transceiver EEPROM page (sonic-net#2379) [mihirpat1]
* 86175c2 2022-10-17 | [chassis]Add fabric counter cli commands (sonic-net#1860) [Maxime Lorrillere]

swss:
* 6fe0afd 2022-10-25 | [portsorch] remove port OID from saiOidToAlias map on port deletion (sonic-net#2483) (HEAD -> 202205, github/202205) [Stepan Blyshchak]
* 7290d66 2022-10-07 | [vlanmgr] Disable `arp_evict_nocarrier` for vlan host intf (sonic-net#2469) [Longxiang Lyu]
* d074001 2022-10-05 | [chassis][voq]Collect counters for fabric links (sonic-net#1944) [Maxime Lorrillere]
* 3a0353a 2022-10-18 | [counters][202205] Improve performance by polling only configured ports buffer queue/pg counters (sonic-net#2474) [Vadym Hlushko]
* 2feb39d 2022-10-14 | [202205] [crm] Fix issue with continues EXCEEDED and CLEAR logs for ACL group/table counters (sonic-net#2482) [Volodymyr Samotiy]

sairedis:
* 326b630 2022-10-21 | [gbsyncd] Add asic db prefix for channel NOTIFICATIONS (sonic-net#1129) (HEAD -> 202205) [Junhua Zhai]

platform-daemon:
* 6dbda9b 2022-10-25 | [ycabled] fix no port/state returned by grpc server (sonic-net#308) (HEAD -> 202205) [vdahiya12]
* 3d1228a 2022-10-20 | Fix xcvrd to support 400G ZR optic (sonic-net#293) [Bohan Yang]

platform-common:
* c04d710 2022-09-29 | Read CMIS data path state duration (sonic-net#312) (HEAD -> 202205) [Bohan Yang]

Signed-off-by: Ying Xie <[email protected]>

Signed-off-by: Ying Xie <[email protected]>
Junchao-Mellanox pushed a commit that referenced this pull request Jan 12, 2023
commit aa8fe6deff466909909430f00598d2dba9490904 (HEAD -> 202012, origin/202012)
Author: Jing Zhang [email protected]
Date: Tue Oct 11 10:22:02 2022 -0700

[Active-Standby] avoid posting mux metrics event when receiving unsolicited mux state notification  (#142)

Description of PR
Summary:
Fixes # (issue)

This PR is to fix incorrect mux metrics timestamps caused by unsolicited mux state notification.

Sign-off: Jing Zhang [email protected]
sign-off: Jing Zhang [email protected]
Junchao-Mellanox pushed a commit that referenced this pull request Jul 25, 2023
…ically (sonic-net#15886)

src/sonic-restapi

* a69ba06 - (HEAD -> 202205, origin/master, origin/HEAD, origin/202205, master) [actions] Support Semgrep by Github Actions (#144) (3 weeks ago) [Mai Bui]
* 6b242a3 - [Ci] Upgrade python 2 to python 3 (#145) (3 weeks ago) [xumia]
* 1c50caa - prevent downcasting of 64-bit integer (#142) (2 months ago) [Mai Bui]
* de26989 - Use -race detector when building and testing (#141) (3 months ago) [Lawrence Lee]
* 9fe2eff - [go] Update Go to version 1.15 (#140) (3 months ago) [Lawrence Lee]
Junchao-Mellanox pushed a commit that referenced this pull request Aug 20, 2024
…utomatically (sonic-net#19897)

#### Why I did it
src/sonic-host-services
```
* 39e31a9 - (HEAD -> master, origin/master, origin/HEAD) Fix modify_single_file generate empty file issue (#145) (26 hours ago) [Hua Liu]
* 1891b0a - Add dbus service to read file stat (#142) (2 days ago) [isabelmsft]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit that referenced this pull request Dec 11, 2024
…omatically (sonic-net#20409)

#### Why I did it
src/sonic-mgmt-common
```
* b91a4df - (HEAD -> master, origin/master, origin/HEAD) PortChannel Interface Static Support - OpenConfig Yang (#142) (9 hours ago) [Satoru Shinohara]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants