Skip to content

AWS ParallelCluster v2.4.0

Compare
Choose a tag to compare
@lukeseawalker lukeseawalker released this 11 Jun 15:26
· 283 commits to master since this release
1c53ad5

We're excited to announce the release of AWS ParallelCluster 2.4.0.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

Docs

New docs are available here: https://docs.aws.amazon.com/parallelcluster/latest/ug/

Enhancements

  • Add support for EFA on Centos 7, Amazon Linux and Ubuntu 1604
  • Add support for Ubuntu in China region cn-northwest-1
  • SGE:
    • process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • add support for jobs in hold/suspended state (this includes job dependencies)
    • automatically terminate and replace faulty or unresponsive compute nodes
    • add retries in case of failures when adding or removing nodes
    • configure scheduler to handle rescheduling and cancellation of jobs running on failing or terminated nodes
  • Slurm:
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • automatically terminate and replace faulty or unresponsive compute nodes
    • decrease SlurmdTimeout to 120 seconds to speed up replacement of faulty nodes
  • Automatically replace compute instances that fail initialization and dump logs to shared home directory.
  • Dynamically fetch compute instance type and cluster size in order to support updates in scaling daemons
  • Always use full master FQDN when mounting NFS on compute nodes. This solves some issues occurring with some networking setups and custom DNS configurations
  • List the version and status during pcluster list
  • Remove double quoting of the post_install args
  • awsbsub: use override option to set the number of nodes rather than creating multiple JobDefinitions
  • Add support for AWS_PCLUSTER_CONFIG_FILE env variable to specify pcluster config file

Changes

  • Update Open MPI library to version 3.1.4 on Centos 7, Amazon Linux and Ubuntu 1604. This also changes the default openmpi path to /opt/amazon/efa/bin/ and the openmpi module name to openmpi/3.1.4
  • Set soft and hard ulimit on open files to 10000 for all supported OSs
  • For a better security posture, we're removing AWS credentials from the parallelcluster config file. Credentials can be now setup following the canonical procedure used for the aws cli
  • When using FSx or EFS do not enforce in sanity check that the compute security group is open to 0.0.0.0/0
  • When updating an existing cluster, the same template version is now used, no matter the pcluster cli version
  • SQS messages that fail to be processed in sqswatcher are now re-queued only 3 times and not forever
  • Reset nodewatcher idletime to 0 when the host becomes essential for the cluster (because of min size of ASG or because there are pending jobs in the scheduler queue)
  • SGE: a node is considered as busy when in one of the following states "u", "C", "s", "d", "D", "E", "P", "o". This allows a quick replacement of the node without waiting for the nodewatcher to terminate it.
  • Do not update DynamoDB table on cluster updates in order to avoid hitting strict API limits (1 update per day).

Bug Fixes

  • Fix issue that was preventing Torque from being used on Centos 7
  • Start node daemons at the end of instance initialization. The time spent for post-install script and node initialization is not counted as part of node idletime anymore.
  • Fix issue which was causing an additional and invalid EBS mount point to be added in case of multiple EBS
  • Install Slurm libpmpi/libpmpi2 that is distributed in a separate package since Slurm 17
  • pcluster ssh command now works for clusters with use_public_ips = false
  • Slurm: add "BeginTime", "NodeDown", "Priority" and "ReqNodeNotAvail" to the pending reasons that trigger a cluster scaling
  • Add a timeout on remote commands execution so that the daemons are not stuck if the compute node is unresponsive
  • Fix an edge case that was causing the nodewatcher to hang forever in case the node had become essential to the cluster during a call to self_terminate.
  • Fix pcluster start/stop commands when used with an awsbatch cluster

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192