Skip to content

Releases: mej/nhc

LBNL NHC version 1.4.3

09 Mar 19:00
@mej mej
Compare
Choose a tag to compare

Please note: As with future releases in the 1.4.x series, this release has largely been limited to bugfixes and incremental improvements of existing code. Active feature development has already started (and has been going on for awhile now, truthfully) on what will eventually be the development branch for a future 1.5 release.

There will, however, be a 1.4.4 release coming up in the very near future (hopefully) to address some of the still-outstanding bugs/issues in the 1.4.x tree. 1.4.3 is being released largely "as-is" due to the massive volume of real-world production testing the current codebase has received, making it as "rock-solid" reliable as one could hope!

New Features:

  • Toggle BASH tracing or NHC debugging via SIGUSR1/SIGUSR2, respectively
  • check_nvsmi_healthmon(): New check from CSC for GPU health monitoring via nvidia-smi

Fixes/Improvements:

  • Corrections/cleanups of SGE integration support
  • Provide added detail to tracing info (-x mode)
  • Based on feedback from Moe Jette of SchedMD, pull node job data directly from Slurm via squeue instead of the previous method that only worked for single-node jobs.
  • Support for recent additions to the Slurm node states (e.g., "planned")
  • Pathname expansion has been disabled on startup, and re-enabled only when being actively used, to avoid "unintended" expansions of wildcards at random points throughout the code.
  • Correct clobbering of BASH built-in variables and add tests to prevent future recurrence
  • Switch "system UID" boundary handling to a more accurate source of truth, and ensure that the code matches the math, naming, and intent.
  • Reorder resource manager detection to improve accurate detection, especially with respect to Slurm vs. PBS (all variants)

Full Changelog: 1.4.2...1.4.3

LBNL NHC version 1.4.2

11 Nov 22:35
@mej mej
Compare
Choose a tag to compare

The LBNL Node Health Check (NHC) Development Team is pleased to announce the release of LBNL Node Health Check (formerly Warewulf Node Health Check) version 1.4.2. This release is mostly a maintenance release, other than the name change, but it does offer the following new features:

  • New check: check_cmd_dmesg for watching the dmesg output for particular strings
  • New check: check_net_ping for sending PING requests to remote hosts to verify functional networking
  • Enhanced check: check_ps_service now offers the -v and -V flags to add service action verification (e.g., making sure a service restart actually worked)
  • External match string checking: User-defined delimiter characters and commands allow for customized handling of match strings. Enable/disable checks based on node groups, tags, subclusters, labels, or any other categorization feature offered by Warewulf, Clustershell, PDSH, and more! Example: <login-nodes> || check_ps_service -u root -v -S sshd
  • More efficient range checking, plus a bugfix to prevent node range endpoints from being interpreted as octal numbers
  • New CLI options -x for bash-level debugging output and -e <check> for single-check eval-and-exit behavior
  • Plus the usual bugfixes and performance enhancements!

See https://github.com/mej/nhc/commits/master for the complete commit log!