Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Techsupport] Event driven Techsupport Changes #15

Closed
wants to merge 29 commits into from

Conversation

vivekrnv
Copy link
Owner

@vivekrnv vivekrnv commented Aug 13, 2021

Why I did it

Changes required for feature "Event Driven TechSupport Invocation & CoreDump Mgmt" HLD.

How I did it

How to verify it

##### No Core files and ts dumps currently
admin@sonic:~$ ls /var/core/
admin@sonic:~$ ls /var/dump
ls: cannot access '/var/dump': No such file or directory

#### verify auto-techsupport status
admin@sonic:~$ show auto-techsupport global
admin@sonic:~$ show auto-techsupport global
STATE      RATE LIMIT INTERVAL    MAX TECHSUPPORT SIZE    MAX CORE SIZE  SINCE
-------  ---------------------  ----------------------  ---------------  ----------
enabled                    180                      10                5  2 days ago


#### Kill a critical Process t and trigger a coredump
admin@sonic:~$ docker exec -it snmp ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          23  1.6  0.5 138316 42296 pts/0    Sl   19:09   0:15 python3 -m sonic_ax_impl
...............

admin@sonic:~$ docker exec -it snmp kill -11 23

#### Coredump is created
admin@sonic:~$ ls /var/core/
python3.1629401152.23.core.gz 

#### Techsupport Dump creation in progress
admin@sonic:~$ ls /var/dump/
sonic_dump_sonic_20210819_192558  sonic_dump_sonic_20210819_192558.tar

admin@sonic:~$ ps -aux | grep coredump_gen
root       17823  0.1  0.2  30960 16736 ?        S    19:25   0:00 python3 /usr/local/bin/coredump_gen_handler.py python3.1629401152.23.core.gz

#### Wait until the techsupport dump execution has finished
admin@sonic:~$ ls /var/dump/
sonic_dump_sonic_20210819_192558.tar.gz


admin@r-lionfish-16:~$ show auto-techsupport history
Techsupport Dump                                 Triggered By                   Critical Process
-----------------------------------------------  -----------------------------  ------------------
sonic_dump_sonic_20210819_192558.tar.gz  python3.1629401152.23.core.gz  snmp-subagent

Note on changes made to supervisor-proc-exit-listener script:
Changes were made are backward compatible with python2. Tested this on docker running python3 & python2

admin@sonic:~$ docker exec -it snmp ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.5  0.3  35676 24316 pts/0    Ss+  21:43   0:00 /usr/bin/pyth
root           9  0.1  0.3  43608 24888 pts/0    S    21:43   0:00 python3 /usr/
root          17  0.0  0.0 225856  5616 pts/0    Sl   21:43   0:00 /usr/sbin/rsy
Debian-+      21  0.9  0.1  32932 12524 pts/0    S    21:43   0:00 /usr/sbin/snm
root          23  8.4  0.4 133116 36764 pts/0    Sl   21:43   0:06 python3 -m so
root          26  0.0  0.0  11248  3036 pts/1    Rs+  21:44   0:00 ps -aux
admin@r-lionfish-16:~$ docker exec -it restapi ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.5  0.2  59112 21552 pts/0    Ss+  21:43   0:00 /usr/bin/pyth
root           9  0.1  0.2  84932 22932 pts/0    S    21:43   0:00 python2 /usr/
root          14  0.0  0.0 262988  3596 pts/0    Sl   21:43   0:00 /usr/sbin/rsy
root          20  0.0  0.0  17964  2948 pts/0    S    21:43   0:00 bash /usr/bin
root          31  0.0  0.0   4188   660 pts/0    S    21:44   0:00 sleep 60
root          32  0.0  0.0  36636  2840 pts/1    Rs+  21:44   0:00 ps -aux
admin@sonic:~$ docker exec -it restapi kill -11 20
admin@sonic:~$ docker exec -it snmp kill -11 23
admin@sonic:~$ redis-cli -n 6 hgetall "AUTO_TECHSUPPORT|FEATURE_PROC_INFO"
1) "restapi;restapi"
2) "20;bash"
3) "snmp;snmp-subagent"
4) "23;python3"

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
@vivekrnv vivekrnv changed the title Event driven Techsupport Changes [Auto Techsupport] Event driven Techsupport Changes Aug 19, 2021
Signed-off-by: Vivek Reddy Karri <[email protected]>
files/build_templates/init_cfg.json.j2 Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
}
}

leaf coredump_cleanup {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vivekreddynv suggest adding a mandatory constraint to those leaves which don't have default values

Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
@vivekrnv vivekrnv closed this Sep 3, 2021
vivekrnv pushed a commit that referenced this pull request Jan 18, 2022
* [BFN] Updated platform APIs impl

Signed-off-by: Andriy Kokhan <[email protected]>

* Extended BFN platform SFP APIs implementation

* Update sfp.py

* [BFN] Extended SFP platform plugin implementation

Signed-off-by: Andriy Kokhan <[email protected]>

* [BFN] Extended Fans platform plugin implementation

* [BFN] divided classes Fan and  FanDrawer into 2 files

* Signed-off-by: Vadym Yashchenko <[email protected]>

What I did
	Add get_model() function
	Add get_low_critical_threshold() function
	Change __get(...) function.
How I did it
	Differnece from previous implementation of __get(...) function is return real value or -9999.9 if value is not provided by thrift API

* Add get_presence() function and revised __get() function

Signed-off-by: Vadym Yashchenko <[email protected]>

* [BFN] Updated PSU platform APIs impl

Signed-off-by: Dmytro Lytvynenko <[email protected]>

* Added BFN PSU cache (#9)

Signed-off-by: Andriy Kokhan <[email protected]>

* [BFN]  Fans and Fantray platform APIs update (#7)

* [BFN] Updated SFP platform APIs (#10)

Signed-off-by: Volodymyr Boyko <[email protected]>

* [BFN] Updated platform API for thermal (#8)

* Signed-off-by: Vadym Yashchenko <[email protected]>

* Revert "[BFN]  Fans and Fantray platform APIs update (#7)" (#11)

This reverts commit c62a733.

* Add support health monitor system (#15)

Signed-off-by: Petro Bratash <[email protected]>

* Update chassis.py

* [BFN] Updated FANs and FAN Tray platform API (#14)

* Fix fix_alignment (#17)

Signed-off-by: Petro Bratash <[email protected]>

* [BFN] Improvement show environment (#16)

* Added PSU temperature skip into platform.json (#18)

Signed-off-by: Andriy Kokhan <[email protected]>

* Do not skip psud on Newport

Signed-off-by: Andriy Kokhan <[email protected]>

* [BFN] fix fan status from Not OK to Ok (#19)

* [BFN] Updated SFP platform plugin (#13)

Signed-off-by: Volodymyr Boyko <[email protected]>

* [DPB] Fix typo for Ethernet0 2x200G[100G,40G] breakout mode (#21)

Signed-off-by: Mykola Gerasymenko <[email protected]>

* [barefoot] Tmp fix vendor_rev (#22)

Signed-off-by: Volodymyr Boyko <[email protected]>

* Fixed python issues in sonic_platform/fan_drawer.py

Signed-off-by: Andriy Kokhan <[email protected]>

* Updated fan_drawer.py

* Fixing trailing white spaces in fan_drawer.py

* [BFN] Fix thrift for SFPs API

Signed-off-by: Volodymyr Boyko <[email protected]>

* In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue

Signed-off-by: Andriy Kokhan <[email protected]>

* [Newport] Thermal manager  (#23)

* Signed-off-by: Vadym Yashchenko <[email protected]>

* Revert "In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue"

This reverts commit 1e73127.

* Removed 'controllable' options from platform.json to fix factory default config generation

Signed-off-by: Andriy Kokhan <[email protected]>

* Update thermal_manager.py

* Migrated SFP plugin to sonic_xcvr API (#30)

Signed-off-by: Andriy Kokhan <[email protected]>

Co-authored-by: KostiantynYarovyiBf <[email protected]>
Co-authored-by: Vadym Yashchenko <[email protected]>
Co-authored-by: Dmytro Lytvynenko <[email protected]>
Co-authored-by: Volodymyr Boiko <[email protected]>
Co-authored-by: Petro Bratash <[email protected]>
Co-authored-by: Mykola Gerasymenko <[email protected]>
vivekrnv pushed a commit that referenced this pull request Apr 2, 2022
[sonic-linkmgrd][master] submodule update

Commits added:
0c23756 Jing Zhang      2022-01-19      Linkmgrd subscribing State DB route event  (#13)
12b9951 Longxiang Lyu   2021-12-13      Add TLV support to ICMP payload (#11)
3eedda3 Longxiang Lyu   2022-01-06      Add missing intermediate states (#16)
8da4982 Ying Xie        2022-01-04      [linkmgrd] update README, set coding style guidance (#15)
a897cf8 Longxiang Lyu   2021-12-13      Improve PR template (#16)
6fec701 Jing Zhang      2021-12-06      Add pull request template for linkmgrd repo (#9)


signed-off-by: Jing Zhang [email protected]
vivekrnv pushed a commit that referenced this pull request Apr 4, 2022
[sonic-linkmgrd][master] submodule update

Commits added:
0c23756 Jing Zhang      2022-01-19      Linkmgrd subscribing State DB route event  (#13)
12b9951 Longxiang Lyu   2021-12-13      Add TLV support to ICMP payload (#11)
3eedda3 Longxiang Lyu   2022-01-06      Add missing intermediate states (#16)
8da4982 Ying Xie        2022-01-04      [linkmgrd] update README, set coding style guidance (#15)
a897cf8 Longxiang Lyu   2021-12-13      Improve PR template (#16)
6fec701 Jing Zhang      2021-12-06      Add pull request template for linkmgrd repo (#9)


signed-off-by: Jing Zhang [email protected]
vivekrnv pushed a commit that referenced this pull request Aug 26, 2022
**What I did**
It could be that queue counters are populated to FLEX COUNTER DB before the COUNTER_QUEUE_INDEX_MAP is populated which is in accordance to PortsOrch implementation. Fixing it by replacing the order - first the COUNTER_QUEUE_INDEX_MAP then the FLEX COUNTER DB might produce other issues if readers start reading the map and try to find the corresponding entry in COUNTERS DB. Such an order is also in alignment with other maps and counters. So seems like fixing it in the reader is more suitable fix.

Errors observed only for short period of time at system start when PFC WD is enabled:
```
Nov 10 15:12:19.001472 tgs-sonic-n2-s1 ERR syncd#SDK: :- guard: RedisReply catches system_error: command: *135#015#012$7#015#012EVALSHA#015#012$40#015#012b62081cc93943a4cbfec30de42638b435d31197c#015#012$3#015#012128#015#012$20#015#012oid:0x15000000000290#015#012$20#015#012oid:0x15000000000291#015#012$20#015#012oid:0x150000000002b1#015#012$20#015#012oid:0x150000000002b2#015#012$20#015#012oid:0x150000000002d2#015#012$20#015#012oid:0x150000000002d3#015#012$20#015#012oid:0x150000000002f3#015#012$20#015#012oid:0x150000000002f4#015#012$20#015#012oid:0x15000000000314#015#012$20#015#012oid:0x15000000000315#015#012$20#015#012oid:0x15000000000335#015#012$20#015#012oid:0x15000000000336#015#012$20#015#012oid:0x15000000000356#015#012$20#015#012oid:0x15000000000357#015#012$20#015#012oid:0x15000000000377#015#012$20#015#012oid:0x15000000000378#015#012$20#015#012oid:0x15000000000398#015#012$20#015#012oid:0x15000000000399#015#012$20#015#012oid:0x150000000003b9#015#012$20#015#012oid:0x150000000003ba#015#012$20#015#012oid:0x150000000003da#015#012$20#015#012oid:0x150000000003db#015#012$20#015#012oid:0x150000000003fb#015#012$20#015#012oid:0x150000000003fc#015#012$20#015#012oid:0x1500000000041c#015#012$20#015#012oid:0x1500000000041d#015#012$20#015#012oid:0x1500000000043d#015#012$20#015#012oid:0x1500000000043e#015#012$20#015#012oid:0x1500000000045e#015#012$20#015#012oid:0x1500000000045f#015#012$20#015#012oid:0x1500000000047f#015#012$20#015#012oid:0x15000000000480#015#012$20#015#012oid:0x150000000004a0#015#012$20#015#012oid:0x150000000004a1#015#012$20#015#012oid:0x150000000004c1#015#012$20#015#012oid:0x150000000004c2#015#012$20#015#012oid:0x150000000004e2#015#012$20#015#012oid:0x150000000004e3#015#012$20#015#012oid:0x15000000000503#015#012$20#015#012oid:0x15000000000504#015#012$20#015#012oid:0x15000000000524#015#012$20#015#012oid:0x15000000000525#015#012$20#015#012oid:0x15000000000545#015#012$20#015#012oid:0x15000000000546#015#012$20#015#012oid:0x15000000000566#015#012$20#015#012oid:0x15000000000567#015#012$20#015#012oid:0x15000000000587#015#012$20#015#012oid:0x15000000000588#015#012$20#015#012oid:0x150000000005a8#015#012$20#015#012oid:0x150000000005a9#015#012$20#015#012oid:0x150000000005c9#015#012$20#015#012oid:0x150000000005ca#015#012$20#015#012oid:0x150000000005ea#015#012$20#015#012oid:0x150000000005eb#015#012$20#015#012oid:0x1500000000060b#015#012$20#015#012oid:0x1500000000060c#015#012$20#015#012oid:0x1500000000062c#015#012$20#015#012oid:0x1500000000062d#015#012$20#015#012oid:0x1500000000064d#015#012$20#015#012oid:0x1500000000064e#015#012$20#015#012oid:0x1500000000066e#015#012$20#015#012oid:0x1500000000066f#015#012$20#015#012oid:0x1500000000068f#015#012$20#015#012oid:0x15000000000690#015#012$20#015#012oid:0x150000000006b0#015#012$20#015#012oid:0x150000000006b1#015#012$20#015#012oid:0x150000000006d1#015#012$20#015#012oid:0x150000000006d2#015#012$20#015#012oid:0x150000000006f2#015#012$20#015#012oid:0x150000000006f3#015#012$20#015#012oid:0x15000000000713#015#012$20#015#012oid:0x15000000000714#015#012$20#015#012oid:0x15000000000734#015#012$20#015#012oid:0x15000000000735#015#012$20#015#012oid:0x15000000000755#015#012$20#015#012oid:0x15000000000756#015#012$20#015#012oid:0x15000000000776#015#012$20#015#012oid:0x15000000000777#015#012$20#015#012oid:0x15000000000797#015#012$20#015#012oid:0x15000000000798#015#012$20#015#012oid:0x150000000007b8#015#012$20#015#012oid:0x150000000007b9#015#012$20#015#012oid:0x150000000007d9#015#012$20#015#012oid:0x150000000007da#015#012$20#015#012oid:0x150000000007fa#015#012$20#015#012oid:0x150000000007fb#015#012$20#015#012oid:0x1500000000081b#015#012$20#015#012oid:0x1500000000081c#015#012$20#015#012oid:0x1500000000083c#015#012$20#015#012oid:0x1500000000083d#015#012$20#015#012oid:0x1500000000085d#015#012$20#015#012oid:0x1500000000085e#015#012$20#015#012oid:0x1500000000087e#015#012$20#015#012oid:0x1500000000087f#015#012$20#015#012oid:0x1500000000089f#015#012$20#015#012oid:0x150000000008a0#015#012$20#015#012oid:0x150000000008c0#015#012$20#015#012oid:0x150000000008c1#015#012$20#015#012oid:0x150000000008e1#015#012$20#015#012oid:0x150000000008e2#015#012$20#015#012oid:0x15000000000902#015#012$20#015#012oid:0x15000000000903#015#012$20#015#012oid:0x15000000000923#015#012$20#015#012oid:0x15000000000924#015#012$20#015#012oid:0x15000000000944#015#012$20#015#012oid:0x15000000000945#015#012$20#015#012oid:0x15000000000965#015#012$20#015#012oid:0x15000000000966#015#012$20#015#012oid:0x15000000000986#015#012$20#015#012oid:0x15000000000987#015#012$20#015#012oid:0x150000000009a7#015#012$20#015#012oid:0x150000000009a8#015#012$20#015#012oid:0x150000000009c8#015#012$20#015#012oid:0x150000000009c9#015#012$20#015#012oid:0x150000000009e9#015#012$20#015#012oid:0x150000000009ea#015#012$20#015#012oid:0x15000000000a0a#015#012$20#015#012oid:0x15000000000a0b#015#012$20#015#012oid:0x15000000000a2b#015#012$20#015#012oid:0x15000000000a2c#015#012$20#015#012oid:0x15000000000a4c#015#012$20#015#012oid:0x15000000000a4d#015#012$20#015#012oid:0x15000000000a6d#015#012$20#015#012oid:0x15000000000a6e#015#012$20#015#012oid:0x15000000000a8e#015#012$20#015#012oid:0x15000000000a8f#015#012$20#015#012oid:0x15000000000aaf#015#012$20#015#012oid:0x15000000000ab0#015#012$1#015#0122#015#012$8#015#012COUNTERS#015#012$3#015#012100#015#012$2#015#012''#15#012, reason: ERR Error running script (call to f_b62081cc93943a4cbfec30de42638b435d31197c): @user_script:39: user_script:39: attempt to concatenate local 'queue_index' (a boolean value) : Input/output error

```

**Why I did it**

If queue_index or port_id is not defined don't run the rest of the LUA script logic.

**How I verified it**

Running it on the switch and verify no errors.
vivekrnv pushed a commit that referenced this pull request Nov 22, 2023
SAI 9.x requires a SYNCD_SHM_SIZE specified otherwise it will default to 64mb which is insufficient for syncd.

E.G. of a few failures seen when insufficient shmem was set

ha_init:  The file: warmboot_data_0 is of size=762[MB] and is beyond the directory: /dev/shm available storage of size=64[MB]#15
syncd.sh[26074]: Cannot get SYNCD_SHM_SIZE for chip: [869] in /usr/share/sonic/device/x86_64-broadcom_common/syncd_shm.ini. Skip set SYNCD_SHM_SIZE.

Syncd hangs here:

syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_shr_ha_section_resize:536 start=0x7f6e641b4000, end=0x7f6e645b4000, len=302276608, free=0x7f6e641b4000
Broadcom recommended using 1gb for DNX devices.

Since currently we don't use SAI9.x on master and 202305 this change won't fix anything until we upgrade the SAI on those branches.
vivekrnv pushed a commit that referenced this pull request Jul 25, 2024
…-net#19637)

Broadcom requires that programmability_ucode_relative_path is set in SAI11.
This soc property replaces the legacy custom_feature_ucode_path

Without this we get the following error:

syncd#supervisord: syncd 0:dbx_file_get_db_location: DB Resource not defined#015
syncd#supervisord: syncd #15#015
syncd#supervisord: syncd 0:dnx_init_pemla_get_ucode_filepath:  Error 'Invalid parameter' indicated ; #15#015
vivekrnv pushed a commit that referenced this pull request Aug 21, 2024
…-net#19637)

Broadcom requires that programmability_ucode_relative_path is set in SAI11.
This soc property replaces the legacy custom_feature_ucode_path

Without this we get the following error:

syncd#supervisord: syncd 0:dbx_file_get_db_location: DB Resource not defined#015
syncd#supervisord: syncd #15#015
syncd#supervisord: syncd 0:dnx_init_pemla_get_ucode_filepath:  Error 'Invalid parameter' indicated ; #15#015
vivekrnv pushed a commit that referenced this pull request Nov 12, 2024
…7250E platform (sonic-net#20367)

Update sonic-platform submodule for Nokia-IXR7250E:
Fixes Nokia-ION/ndk#57

cdfbbe2 [H4-32D]Update platform modules after OC tests (Update README.md #17)
f28eff0 [H4-64D]Fix SFP+ port, eeprom, reboot-cause, thermal algorithm, add PSU input voltage check (Fix rules in Makefiles #15)
178e15a Minor watchdog change for better retention of last kick stamp
c479392 Remove rogue platform_reboot file
331abe0 Enhance watchdog script to detect fsde device hung signature
4c6b7c1 Fixed update temperature issue
5002fb7 Remove average and maximum
c620130 No PSU Master status led in IMM. No need to set it

Signed-off-by: mlok <[email protected]>
vivekrnv pushed a commit that referenced this pull request Nov 25, 2024
…ly (sonic-net#20847)

#### Why I did it
src/sonic-bmp
```
* bfbd47b - (HEAD -> master, origin/master, origin/HEAD) Merge pull request #15 from FengPan-Frank/makefile (13 hours ago) [Feng-msft]
* ad31f5b - Create makefile for build image flow (17 hours ago) [Feng Pan]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants