Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interfaces-config.service may hang at sonic-cfggen -d #1873

Closed
jeromesun14 opened this issue Jul 25, 2018 · 11 comments
Closed

interfaces-config.service may hang at sonic-cfggen -d #1873

jeromesun14 opened this issue Jul 25, 2018 · 11 comments
Assignees

Comments

@jeromesun14
Copy link
Contributor

Description

interfaces-config.service may hang at sonic-cfggen -d -t /usr/share/sonic/templates/interfaces.j2 > /etc/network/interfaces

Steps to reproduce the issue:

It's hard to reproduce this issue through keep rebooting system time and time.
But we can reproduce sonic-cfggen -d -t /usr/share/sonic/templates/interfaces.j2 > /etc/network/interfaces hang up.

  1. redis-cli -n 4 FLUSHDB
  2. sonic-cfggen -d -t /usr/share/sonic/templates/interfaces.j2

We know the dependency: interfaces-config.service -> database.service -> updategraph.service.
And database.service load config db at docker container with configdb-load.sh.

# cat /etc/supervisor/conf.d/supervisord.conf 
[supervisord]
logfile_maxbytes=1MB
logfile_backups=2
nodaemon=true

[program:rsyslogd]
command=/bin/bash -c "rm -f /var/run/rsyslogd.pid && /usr/sbin/rsyslogd -n"
priority=1
autostart=true
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog

[program:redis-server]
command=/usr/bin/redis-server /etc/redis/redis.conf
priority=2
autostart=true
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog

[program:configdb-load.sh]
command=/usr/bin/configdb-load.sh
priority=3
autostart=true
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog

database.service does not wait configdb-load.sh load all confib db data into redis db 4 and it quits after redis-server is OK.

function postStartAction()
{
    until [[ $(/usr/bin/docker exec database redis-cli ping | grep -c PONG) -gt 0 ]]; do
      sleep 1;
    done
}

So when interfaces-config.service runs, there may be no entries in redis db 4. It causes interfaces-config.sh hang at sonic-cfggen -d -t /usr/share/sonic/templates/interfaces.j2 > /etc/network/interfaces, and keep interfaces-config.service in running status.

xxx@switch:~$ ps aux | grep inter
root       806  0.0  0.0  20044  2780 ?        Ss   04:59   0:00 /bin/bash /usr/bin/interfaces-config.sh
root       816  0.0  0.6  87020 25080 ?        S    04:59   0:00 /usr/bin/python /usr/local/bin/sonic-cfggen -d -t /usr/share/sonic/templates/interfaces.j2

xxx@switch:~$ sudo systemctl list-jobs 
JOB UNIT                                 TYPE  STATE  
  1 graphical.target                     start waiting
  2 multi-user.target                    start waiting
 53 systemd-update-utmp-runlevel.service start waiting
 68 dhcp_relay.service                   start waiting
 69 swss.service                         start waiting
 73 interfaces-config.service            start running
 78 radv.service                         start waiting
 85 snmp.service                         start waiting

8 jobs listed.
xxx@switch:~$ docker ps
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS               NAMES
812df0938d44        docker-lldp-sv2:latest           "/usr/bin/supervisord"   3 days ago          Up 6 hours                              lldp
bd37c386fde5        docker-platform-monitor:latest   "/usr/bin/supervisord"   3 days ago          Up 6 hours                              pmon
0dd9a559a4ed        docker-teamd:latest              "/usr/bin/supervisord"   3 days ago          Up 6 hours                              teamd
fc28bf86eabf        docker-fpm-quagga:latest         "/usr/bin/supervisord"   3 days ago          Up 6 hours                              bgp
9d23ca7b50e9        docker-database:latest           "/usr/bin/supervisord"   3 days ago          Up 6 hours                              database
xxx@switch:~$ 

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

@jeromesun14
Copy link
Contributor Author

I think this is an critical issue. I have fix it by reading configuration from config_db.json or minigraph.xml. e.g. sonic-cfggen -j /etc/sonic/config_db.json -t /usr/share/sonic/templates/interfaces.j2 > /etc/network/interfaces

@taoyl-ms
Copy link
Contributor

sonic-cfggen -d is designed to wait until configDB is initialized correctly (indicated by CONFIG_DB_INITIALIZED field). In your reproduction steps, DB is flushed without anyone creating that field, therefore sonic-cfggen will be waiting forever by design.

In the real scenario though, config-load-sh will set that field and interfaces-config should be able to proceed after that. Thus the case described in your latter part sounds interesting. Do you have more information on how that situation could be triggered?

@jeromesun14
Copy link
Contributor Author

@taoyl-ms , I cannot reproduce this issue in current version. I remember that in the previous environment interfaces-config cannot recover even if CONFIG_DB had data. I need to back trace the old version and reproduce this issue.

@jeromesun14
Copy link
Contributor Author

Close issue as I cannot reproduce it.

@richard28530
Copy link
Contributor

image

this issue can be produced, I met several times (2 or more).

I checked configdb, the flag CONFIG_DB_INITIALIZED is set
image

And I also able to re-execute the same command, but the original command just not return.
image

@taoyl-ms
Copy link
Contributor

taoyl-ms commented Nov 7, 2018

Interesting. Do you have the syslog file? Can I have a copy at [email protected]?

@taoyl-ms taoyl-ms reopened this Nov 7, 2018
@richard28530
Copy link
Contributor

syslog and tech-support dump sent by mail

@richard28530
Copy link
Contributor

admin@S9130-32X:/proc/1221$ sudo cat stack
[] kfree_skb_partial+0x13/0x40
[] tcp_rcv_established+0x40e/0x6e0
[] tcp_v4_do_rcv+0x1af/0x4e0
[] sk_wait_data+0xc9/0xd0
[] autoremove_wake_function+0x0/0x30
[] tcp_recvmsg+0x65d/0xb70
[] inet_recvmsg+0x70/0x90
[] sock_recvmsg+0xa2/0xe0
[] do_wp_page+0x3f4/0x900
[] SYSC_recvfrom+0xcb/0x140
[] __do_page_fault+0x1ab/0x470
[] system_call_fast_compare_end+0x1c/0x21
[] 0xffffffffffffffff

this process is wait to recv msg from socket, maybe it waiting the publish info ?

maybe the CONFIG_DB_INITIALIZED is set just after we do the judgement, but before we succeed subscribe ?

@richard28530
Copy link
Contributor

richard28530 commented Dec 5, 2018

diff --git a/src/swsssdk/configdb.py b/src/swsssdk/configdb.py
index 6bffad9..e0b9411 100644
--- a/src/swsssdk/configdb.py
+++ b/src/swsssdk/configdb.py
@@ -39,18 +39,24 @@ class ConfigDBConnector(SonicV2Connector):
     def __wait_for_db_init(self):
         client = self.redis_clients[self.CONFIG_DB]
         pubsub = client.pubsub()
+
         initialized = client.get(self.INIT_INDICATOR)
-        if not initialized:
-            pattern = "__keyspace@{}__:{}".format(self.db_map[self.CONFIG_DB]['db'], self.INIT_INDICATOR)
-            pubsub.psubscribe(pattern)
-            for item in pubsub.listen():
-                if item['type'] == 'pmessage':
-                    key = item['channel'].split(':', 1)[1]
-                    if key == self.INIT_INDICATOR:
-                        initialized = client.get(self.INIT_INDICATOR)
-                        if initialized:
-                            break
-            pubsub.punsubscribe(pattern)
+
+        while (not initialized):
+            initialized = client.get(self.INIT_INDICATOR)
+            time.sleep(1)
+
+        #if not initialized:
+        #    pattern = "__keyspace@{}__:{}".format(self.db_map[self.CONFIG_DB]['db'], self.INIT_INDICATOR)
+        #    pubsub.psubscribe(pattern)
+        #    for item in pubsub.listen():
+        #        if item['type'] == 'pmessage':
+        #            key = item['channel'].split(':', 1)[1]
+        #            if key == self.INIT_INDICATOR:
+        #                initialized = client.get(self.INIT_INDICATOR)
+        #                if initialized:
+        #                    break
+        #    pubsub.punsubscribe(pattern)


     def connect(self, wait_for_init=True, retry_on=False):

I change this file, and do over 3000 reboot test. no hang again.
before I do the modification, occurs once about serval hundred times.

@yxieca yxieca assigned jleveque and unassigned taoyl-ms Sep 19, 2019
@jleveque
Copy link
Contributor

jleveque commented Sep 20, 2019

@richard28530: Is this still an issue? If so, is your suggested code above a potential solution? Please feel free to submit a PR if this is the case. If this is no longer an issue, please close this.

@jleveque jleveque reopened this Sep 20, 2019
@jleveque
Copy link
Contributor

Closing as no response for 1 year.

yxieca added a commit to yxieca/sonic-buildimage that referenced this issue Oct 15, 2021
snmpagent
* 187aa10 2021-09-16 | [201811][RFC1213]: Initialize lag oid map in reinit_data (sonic-net#233) (github/201811) [SuvarnaMeenakshi]

swss:
* 3503705 2021-09-05 | [201811][Cherry-pick] [acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net#1898) (HEAD -> 201811, github/201811) [bingwang-ms]

utilities:
* f3f8667 2021-10-15 | [201811] disk_check.py: Allow remote user access when disk is read-only (sonic-net#1873) (HEAD -> 201811, github/201811) [Renuka Manavalan]
* 6b351c9 2021-10-14 | [201811]  Remove exec from platform_reboot_plugin call to handle any hang issue. (sonic-net#1880) [Sujin Kang]
* d8d0461 2021-07-29 | [minigraph][port_config] Consume port_config.json while reloading minigraph (sonic-net#1726) [Blueve]

Signed-off-by: Ying Xie <[email protected]>
yxieca added a commit that referenced this issue Oct 16, 2021
snmpagent
* 187aa10 2021-09-16 | [201811][RFC1213]: Initialize lag oid map in reinit_data (#233) (github/201811) [SuvarnaMeenakshi]

swss:
* 3503705 2021-09-05 | [201811][Cherry-pick] [acl mirror action] Mirror session ref count fix at acl rule attachment (#1898) (HEAD -> 201811, github/201811) [bingwang-ms]

utilities:
* f3f8667 2021-10-15 | [201811] disk_check.py: Allow remote user access when disk is read-only (#1873) (HEAD -> 201811, github/201811) [Renuka Manavalan]
* 6b351c9 2021-10-14 | [201811]  Remove exec from platform_reboot_plugin call to handle any hang issue. (#1880) [Sujin Kang]
* d8d0461 2021-07-29 | [minigraph][port_config] Consume port_config.json while reloading minigraph (#1726) [Blueve]

Signed-off-by: Ying Xie <[email protected]>
yxieca pushed a commit to yxieca/sonic-buildimage that referenced this issue Oct 25, 2021
* Add DHCPv6 minigraph parsing support

Co-authored-by: shlomibitton <[email protected]>

Logrotate for wtmp and btmp files to fix size getting too large. (sonic-net#8744)

Signed-off-by: Abhishek Dosi <[email protected]>

[201811][utilities][swss][snmpagent] advance sub module head

snmpagent
* 187aa10 2021-09-16 | [201811][RFC1213]: Initialize lag oid map in reinit_data (sonic-net#233) (github/201811) [SuvarnaMeenakshi]

swss:
* 3503705 2021-09-05 | [201811][Cherry-pick] [acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net#1898) (HEAD -> 201811, github/201811) [bingwang-ms]

utilities:
* f3f8667 2021-10-15 | [201811] disk_check.py: Allow remote user access when disk is read-only (sonic-net#1873) (HEAD -> 201811, github/201811) [Renuka Manavalan]
* 6b351c9 2021-10-14 | [201811]  Remove exec from platform_reboot_plugin call to handle any hang issue. (sonic-net#1880) [Sujin Kang]
* d8d0461 2021-07-29 | [minigraph][port_config] Consume port_config.json while reloading minigraph (sonic-net#1726) [Blueve]

Signed-off-by: Ying Xie <[email protected]>

[201811] Invoke disk check periodically (sonic-net#8951)

* Invoke disk check periodically. (sonic-net#7374)

Why I did it
Helps with periodic scan of disk for RO state.
If found, this script makes transient fix and raise error message.

Save DB dump after warm/fast reboot (sonic-net#8913)

Back porting the master branch change - sonic-net#8803

Save the redis DB dump after warm reboot.

[201811][swss] advance swss submodule head (sonic-net#9049)

* e0b115a 2021-10-22 | [copp] add dhcpv6 copp rules (sonic-net#1979) (HEAD -> 201811, github/201811) [Ying Xie]

Signed-off-by: Ying Xie <[email protected]>

[swssconfig] load dhcpv6 copp rules by default (sonic-net#9047)

Why I did it
Need to enable DHCPv6 copp rule

How I did it
Add a separate DHCPv6 copp rule config file and load it during cold reboot.

How to verify it
cold reboot, and verify config being loaded and dhcpv6 rules got installed.

Signed-off-by: Ying Xie [email protected]

[warmboot finalizer] load dhcpv6 copp rules when missing (sonic-net#9048)

Why I did it
Need to enable DHCPv6 COPP rules.

How I did it
Load the separate DHCPv6 COPP rules after warm reboot if the rules are missing.

How to verify it
Warm reboot from an image doesn't have DHCPv6 COPP rules installed.
Warm reboot from an image have DHCPv6 COPP rules already installed.
In either case, the script did the right thing and only install the COPP rules if it is missing.

Signed-off-by: Ying Xie [email protected]
yxieca pushed a commit that referenced this issue Oct 26, 2021
* Add DHCPv6 minigraph parsing support

Co-authored-by: shlomibitton <[email protected]>

Logrotate for wtmp and btmp files to fix size getting too large. (#8744)

Signed-off-by: Abhishek Dosi <[email protected]>

[201811][utilities][swss][snmpagent] advance sub module head

snmpagent
* 187aa10 2021-09-16 | [201811][RFC1213]: Initialize lag oid map in reinit_data (#233) (github/201811) [SuvarnaMeenakshi]

swss:
* 3503705 2021-09-05 | [201811][Cherry-pick] [acl mirror action] Mirror session ref count fix at acl rule attachment (#1898) (HEAD -> 201811, github/201811) [bingwang-ms]

utilities:
* f3f8667 2021-10-15 | [201811] disk_check.py: Allow remote user access when disk is read-only (#1873) (HEAD -> 201811, github/201811) [Renuka Manavalan]
* 6b351c9 2021-10-14 | [201811]  Remove exec from platform_reboot_plugin call to handle any hang issue. (#1880) [Sujin Kang]
* d8d0461 2021-07-29 | [minigraph][port_config] Consume port_config.json while reloading minigraph (#1726) [Blueve]

Signed-off-by: Ying Xie <[email protected]>

[201811] Invoke disk check periodically (#8951)

* Invoke disk check periodically. (#7374)

Why I did it
Helps with periodic scan of disk for RO state.
If found, this script makes transient fix and raise error message.

Save DB dump after warm/fast reboot (#8913)

Back porting the master branch change - #8803

Save the redis DB dump after warm reboot.

[201811][swss] advance swss submodule head (#9049)

* e0b115a 2021-10-22 | [copp] add dhcpv6 copp rules (#1979) (HEAD -> 201811, github/201811) [Ying Xie]

Signed-off-by: Ying Xie <[email protected]>

[swssconfig] load dhcpv6 copp rules by default (#9047)

Why I did it
Need to enable DHCPv6 copp rule

How I did it
Add a separate DHCPv6 copp rule config file and load it during cold reboot.

How to verify it
cold reboot, and verify config being loaded and dhcpv6 rules got installed.

Signed-off-by: Ying Xie [email protected]

[warmboot finalizer] load dhcpv6 copp rules when missing (#9048)

Why I did it
Need to enable DHCPv6 COPP rules.

How I did it
Load the separate DHCPv6 COPP rules after warm reboot if the rules are missing.

How to verify it
Warm reboot from an image doesn't have DHCPv6 COPP rules installed.
Warm reboot from an image have DHCPv6 COPP rules already installed.
In either case, the script did the right thing and only install the COPP rules if it is missing.

Signed-off-by: Ying Xie [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants