- Revision
- About this Manual
- 1. Overview
- 2. High Level Requirements
- 3. Core Dump Generation in SONiC
- 4. Schema Additions
- 5. CLI Enhancements
- 6. Design
- 7. Test Plan
Rev | Date | Author | Change Description |
---|---|---|---|
1.0 | 06/22/2021 | Vivek Reddy Karri | Auto Invocation of Techsupport, triggered by a core dump |
2.0 | TBD | Vivek Reddy Karri | Extending Support for Kernel Dumps |
This document describes the details of the system which facilitates the auto techsupport invocation support in SONiC. The auto invocation is triggered when any critical process inside the docker crashes and a core dump is generated.
Currently, techsupport is run by invoking show techsupport
either by orchestration tools like Jenkins or manually. The techsupport dump also collects any core dump files available in the /var/core/
directory.
However if the techsupport invocation can be made event-driven based on core dump generation, that would definitely improve the debuggability. That is the overall idea behind this HLD. All the high-level requirements are summarized in the next section
- Techsupport invocation should also be made event-driven based on core dump generation.
- This is only applicable for the critical processes running inside the dockers. Does not apply for other processes.
- init_cfg.json will be enhanced to include the "global CONFIG" required for this feature (described in section 4) and is enabled by default.
- To provide flexibility, a compile time flag "ENABLE_AUTO_TECH_SUPPORT" should be provided to enable/disable the "CONFIG" for this feature.
- Users should have the abiliity to globally enable/disable this capability through CLI.
- A configurable "cooloff" should be introduced to limit the number of techsupport invocations.
- The existing "--since" option in techsupport should be leveraged and this should be a configurable parameter for this feature
- Should provide a per-docker configurable granularity for this feature.
- Per-docker enable/disable capability should be achieved through FEATURE table.
- Per-docker cooloff capability should is achieved through FEATURE table.
- Changes to per-docker config's will apply to all the critical processes inside the corresponding docker.
- Existing FEATURE CLI & Table should be used to apply the Configuration
- Auto techsupport invocation should only happen when both the global cooloff and per-docker cooloff period is passed.
- Feature should be enabled globally and also per-docker, for this to apply on any of the critical processes running inside that docker.
- Core dump & techsupport dump cleanup mechanism should also be introduced.
- Size-based cleanup should be performed for both of these.
- Individual configurable options should be provided for each of these.
In SONiC, the core dumps generated from any process crashes are directed to the location /var/core
and will have the naming format /var/core/*.core.gz
.
The naming format and compression is governed by the script /usr/local/bin/coredump-compress
.
key = "AUTO_TECHSUPPORT|global"
state = enabled|disabled; # Enable/Disable the feature globally
cooloff = 300; # Minimum Time in seconds, between two successive techsupport invocations.
Manual Invocations will be considered as well in the cooloff calculation
max_techsupport_size = 10; # A perentage value should be specified.
This signifies maximum Size to which /var/dump/ directory can be grown until.
The actual value in bytes is calculate based on the available space in the filesystem hosting /var/dump
When the limit is crossed, the older techsupport dumps are incrementally deleted
core_usage = 5; # A perentage value should be specified.
This signifies maximum Size to which /var/core directory can be grown until.
The actual value in bytes is calculate based on the available space in the filesystem hosting /var/core
When the limit is crossed, the older core files are incrementally deleted
since = "2 days ago"; # This limits the auto-invoked techsupport to only collect the logs & core-dumps generated since the time provided.
Any valid date string of the formats specified here (https://www.gnu.org/software/coreutils/manual/html_node/Date-input-formats.html)
can be used.
If this value is not explicitly configured or a non-valid string is provided, a default value of "2 days ago" is used.
.............
.............
cooloff = 600; # Minimum Time in seconds, between two successive techsupport invocations because of the same process
The idea here is not to let a periodically crashing process to invoke the techsupport until a cooloff is met
auto_techsupport = enabled|disabled; # Enable/Disable this feature per-docker
module sonic-auto_techsupport {
yang-version 1.1;
namespace "http://github.com/Azure/sonic-auto_techsupport";
prefix auto_techsupport;
description "Auto Techsupport Capability in SONiC OS";
revision 2021-06-17 {
description "First Revision";
}
container sonic-auto_techsupport {
container AUTO_TECHSUPPORT {
description "AUTO_TECHSUPPORT part of config_db.json";
container global {
leaf status {
description "AUTO_TECHSUPPORT status";
type enumeration {
enum disable;
enum enable;
}
default disable;
}
leaf cooloff {
description "Minimum Time in seconds, between two successive techsupport invocations by the script.";
type uint16;
default "300";
}
leaf max_techsupport_size {
description "A perentage value should be specified.
This signifies maximum Size to which /var/core directory can be grown until.
The actual value in bytes is calculate based on the available space in the filesystem hosting /var/core
When the limit is crossed, the older core files are incrementally deleted";
type uint8{
range "0..100" {
error-message "Can only be between 1 to 100";
}
}
default "10";
}
leaf core_usage {
description "A perentage value should be specified.
This signifies maximum Size to which /var/core directory can be grown until
The actual value in bytes is calculate based on the available space in the filesystem hosting /var/core
When the limit is crossed, the older core files are deleted."
Disabled by default. Configure '0' to explicitly disable";;
type uint8 {
range "0..100" {
error-message "Can only be between 1 to 100";
}
}
default "5";
}
leaf since {
description "This limits the auto-invoked techsupport to only collect the logs & core-dumps generated since the time provided.
Any valid date string of the formats specified here (https://www.gnu.org/software/coreutils/manual/html_node/Date-input-formats.html)
can be used.
If this value is not explicitly configured or a non-valid string is provided, a default value of "2 days ago" is used";
type string {
length 1..255;
}
default "2 days ago";
}
}
/* end of container global */
}
/* end of container AUTO_TECHSUPPORT */
}
/* end of top level container */
}
Note: The "cooloff" & "auto_techsupport" will be added to the YANG Model for FEATURE Table
key = "AUTO_TECHSUPPORT|TS_CORE_MAP"
<dump_name> = <core_dump_name;timestamp_as_epoch>
Eg:
hgetall "AUTO_TECHSUPPORT|TS_CORE_MAP"
sonic_dump_sonic_20210412_223645 = orchagent.1599047232.39.core;1599047233
sonic_dump_sonic_20210405_202756 = syncd.1617684247.17.core;1617684249
config auto-techsupport state <enabled/disabled>
config auto-techsupport cooloff <uint16>
config auto-techsupport max-techsupport <uints8>
config auto-techsupport core-usage <0..100>
config auto-techsupport since <string>
config feature auto-techsupport <name> enabled|disabled>
config feature cooloff <name> <uint16>
admin@sonic:~$ show auto-techsupport global
STATUS COOLOFF MAX_TECHSUPPORT_DUMPS MAX_CORE_DUMP_USAGE_SIZE SINCE
------- ------- --------------------- ------------------------ ----------
Enabled 300 sec 3 200000 KB / 2% 2 days ago
admin@sonic:~$ show auto-techsupport history
TECHSUPPORT DUMP INVOCATION REASON
-------------------------------- ---------------------------
sonic_dump_sonic_20210412_223645 orchagent.1599047232.39.core
sonic_dump_sonic_20210405_202756 syncd.1617684247.17.core
sonic_dump_sonic_20210329_183626 Unknown
sonic_dump_sonic_20210412_223645 snmpd.1617916877.41.core
admin@sonic:~$ show feature status
Feature State AutoRestart SetOwner cooloff Auto-techsupport
-------------- -------- ---------- -------- ------- ----------------
swss enabled enabled 600 enabled
.....
A script under the name coredump_gen_handler
is added to /usr/local/bin/
directory which will be invoked after a coredump is generated. The script first checks if this feature is enabled by the user. The script then verifies if a core dump file is created within the last 20 sec and if yes, it moves forward.
The script invokes the show techsupport command, if the global cooloff & the per-docker cooloff period has passed. The script will also independently check if the Max Size configured by the user has already exceeded and if yes deletes the core files incrementally.
Potential Syslog messages which can be logged are:
DATE sonic INFO coredump_gen_handler[pid]: Cooloff period has not yet passed. No Techsupport Invocation is performed
DATE sonic NOTICE coredump_gen_handler[pid]: Techsupport Invocation is successful, sonic_dump_sonic_20210721_235228.tar.gz is created in response to the coredump orchagent.1626916631.117644.core.gz
DATE sonic ERR coredump_gen_handler[pid]: Techsupport Invocation failed, No techsupport dump was created in the /var/dump directory
DATE sonic INFO coredump_gen_handler[pid]: No Cleanup process is initiated since the core-usage param is not configured
DATE sonic NOTICE coredump_gen_handler[pid]: /var/core cleanup performed. 12456 bytes are cleared.
A script under the name techsupport_cleanup
is added to /usr/local/bin/
directory which will be invoked after a techsupport dump is created. The script first checks if the feature is enabled by the user. It then checks if the limit configured by the user has crossed and deletes the old techsupport files, if any.
Potential Syslog messages which can be logged are:
DATE sonic NOTICE techsupport_cleanup[pid]: /var/dump/ cleanup is performed. current number of dumps: 4
The coredump-compress script is updated to invoke the coredump_gen_handler
script once it is done writing the core file to /var/core.
The generate_dump script is updated to invoke the techsupport_cleanup
script to handle the cleanup of techsupport files
AUTO_TECHSUPPORT|TS_CORE_MAP table in the State DB will be preserved across Warmboot
Firstly, Size-based cleanup design was inspired from MaxUse= Argument in the systemd-coredump.conf https://www.freedesktop.org/software/systemd/man/coredump.conf.html
admin@sonic-nvda-spc:/var/core$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
root-overlay 14928328 3106572 11040396 22% /
admin@sonic-nvda-spc2:/var/core$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
root-overlay 28589288 2922160 24191796 11% /
admin@sonic-nvda-spc3:/var/core$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
root-overlay 32896880 5460768 25742008 18% /
/var/core & /var/dum directories are hosted on root-overlay filesystem and this usually ranges from 10G to 25G+. A default value of 5% would amount to a minimum of 500 MB which is a already a decent space for coredumps. For techsupport a default value of 10% would amount to a minium of 1G, which might accomodate from 5-10 techsupports.
Although if the admin feels otherwise, these values are configurable.
Enhance the existing techsupport sonic-mgmt test with the following cases.
S.No | Test case synopsis |
---|---|
1 | Check if the coredump_gen_handler script is infact invoking the techsupport cmd, when configured |
2 | Check if the techsupport cleanup is working as expected |
3 | Check if the global cooloff & per-process cooloff is honoured |
4 | Check if the core-dump cleanup & techsupport-cleanup mechanisms are working as expected |