Skip to content

Commit

Permalink
Deployed 58f70e7 with MkDocs version: 1.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
cartalla committed Mar 22, 2024
1 parent 39986a9 commit f97f764
Show file tree
Hide file tree
Showing 19 changed files with 178 additions and 521 deletions.
16 changes: 5 additions & 11 deletions 404.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@
<li class="navitem">
<a href="/soca_integration/" class="nav-link">SOCA Integration</a>
</li>
<li class="navitem">
<a href="/custom-amis/" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="/run_jobs/" class="nav-link">Run Jobs</a>
</li>
Expand All @@ -60,19 +63,10 @@
<a href="/rest_api/" class="nav-link">Slurm REST API</a>
</li>
<li class="navitem">
<a href="/onprem/" class="nav-link">On-Premises Integration (legacy)</a>
</li>
<li class="navitem">
<a href="/custom-amis/" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="/federation/" class="nav-link">Federation (legacy)</a>
</li>
<li class="navitem">
<a href="/delete-cluster/" class="nav-link">Delete Cluster (legacy)</a>
<a href="/onprem/" class="nav-link">On-Premises Integration</a>
</li>
<li class="navitem">
<a href="/implementation/" class="nav-link">Implementation Details (legacy)</a>
<a href="/delete-cluster/" class="nav-link">Delete Cluster</a>
</li>
<li class="navitem">
<a href="/debug/" class="nav-link">Debug</a>
Expand Down
16 changes: 5 additions & 11 deletions CONTRIBUTING/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@
<li class="navitem">
<a href="../soca_integration/" class="nav-link">SOCA Integration</a>
</li>
<li class="navitem">
<a href="../custom-amis/" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="../run_jobs/" class="nav-link">Run Jobs</a>
</li>
Expand All @@ -60,19 +63,10 @@
<a href="../rest_api/" class="nav-link">Slurm REST API</a>
</li>
<li class="navitem">
<a href="../onprem/" class="nav-link">On-Premises Integration (legacy)</a>
</li>
<li class="navitem">
<a href="../custom-amis/" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="../federation/" class="nav-link">Federation (legacy)</a>
</li>
<li class="navitem">
<a href="../delete-cluster/" class="nav-link">Delete Cluster (legacy)</a>
<a href="../onprem/" class="nav-link">On-Premises Integration</a>
</li>
<li class="navitem">
<a href="../implementation/" class="nav-link">Implementation Details (legacy)</a>
<a href="../delete-cluster/" class="nav-link">Delete Cluster</a>
</li>
<li class="navitem">
<a href="../debug/" class="nav-link">Debug</a>
Expand Down
20 changes: 7 additions & 13 deletions custom-amis/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@
<li class="navitem">
<a href="../soca_integration/" class="nav-link">SOCA Integration</a>
</li>
<li class="navitem active">
<a href="./" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="../run_jobs/" class="nav-link">Run Jobs</a>
</li>
Expand All @@ -60,19 +63,10 @@
<a href="../rest_api/" class="nav-link">Slurm REST API</a>
</li>
<li class="navitem">
<a href="../onprem/" class="nav-link">On-Premises Integration (legacy)</a>
</li>
<li class="navitem active">
<a href="./" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="../federation/" class="nav-link">Federation (legacy)</a>
</li>
<li class="navitem">
<a href="../delete-cluster/" class="nav-link">Delete Cluster (legacy)</a>
<a href="../onprem/" class="nav-link">On-Premises Integration</a>
</li>
<li class="navitem">
<a href="../implementation/" class="nav-link">Implementation Details (legacy)</a>
<a href="../delete-cluster/" class="nav-link">Delete Cluster</a>
</li>
<li class="navitem">
<a href="../debug/" class="nav-link">Debug</a>
Expand All @@ -86,12 +80,12 @@
</a>
</li>
<li class="nav-item">
<a rel="prev" href="../onprem/" class="nav-link">
<a rel="prev" href="../soca_integration/" class="nav-link">
<i class="fa fa-arrow-left"></i> Previous
</a>
</li>
<li class="nav-item">
<a rel="next" href="../federation/" class="nav-link">
<a rel="next" href="../run_jobs/" class="nav-link">
Next <i class="fa fa-arrow-right"></i>
</a>
</li>
Expand Down
175 changes: 16 additions & 159 deletions debug/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@
<li class="navitem">
<a href="../soca_integration/" class="nav-link">SOCA Integration</a>
</li>
<li class="navitem">
<a href="../custom-amis/" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="../run_jobs/" class="nav-link">Run Jobs</a>
</li>
Expand All @@ -60,19 +63,10 @@
<a href="../rest_api/" class="nav-link">Slurm REST API</a>
</li>
<li class="navitem">
<a href="../onprem/" class="nav-link">On-Premises Integration (legacy)</a>
</li>
<li class="navitem">
<a href="../custom-amis/" class="nav-link">Custom AMIs for ParallelCluster</a>
</li>
<li class="navitem">
<a href="../federation/" class="nav-link">Federation (legacy)</a>
</li>
<li class="navitem">
<a href="../delete-cluster/" class="nav-link">Delete Cluster (legacy)</a>
<a href="../onprem/" class="nav-link">On-Premises Integration</a>
</li>
<li class="navitem">
<a href="../implementation/" class="nav-link">Implementation Details (legacy)</a>
<a href="../delete-cluster/" class="nav-link">Delete Cluster</a>
</li>
<li class="navitem active">
<a href="./" class="nav-link">Debug</a>
Expand All @@ -86,7 +80,7 @@
</a>
</li>
<li class="nav-item">
<a rel="prev" href="../implementation/" class="nav-link">
<a rel="prev" href="../delete-cluster/" class="nav-link">
<i class="fa fa-arrow-left"></i> Previous
</a>
</li>
Expand Down Expand Up @@ -118,33 +112,13 @@

<li class="nav-item" data-level="1"><a href="#debug" class="nav-link">Debug</a>
<ul class="nav flex-column">
<li class="nav-item" data-level="2"><a href="#log-files-on-file-system" class="nav-link">Log Files on File System</a>
<ul class="nav flex-column">
</ul>
</li>
<li class="nav-item" data-level="2"><a href="#slurm-ami-nodes" class="nav-link">Slurm AMI Nodes</a>
<ul class="nav flex-column">
</ul>
</li>
<li class="nav-item" data-level="2"><a href="#slurm-controller" class="nav-link">Slurm Controller</a>
<ul class="nav flex-column">
<li class="nav-item" data-level="3"><a href="#slurm-controller-log-files" class="nav-link">Slurm Controller Log Files</a>
<ul class="nav flex-column">
</ul>
</li>
</ul>
</li>
<li class="nav-item" data-level="2"><a href="#slurm-accounting-database-slurmdbd" class="nav-link">Slurm Accounting Database (slurmdbd)</a>
<ul class="nav flex-column">
<li class="nav-item" data-level="3"><a href="#log-files" class="nav-link">Log Files</a>
<li class="nav-item" data-level="2"><a href="#slurm-head-node" class="nav-link">Slurm Head Node</a>
<ul class="nav flex-column">
</ul>
</li>
</ul>
</li>
<li class="nav-item" data-level="2"><a href="#compute-nodes" class="nav-link">Compute Nodes</a>
<ul class="nav flex-column">
<li class="nav-item" data-level="3"><a href="#log-files_1" class="nav-link">Log Files</a>
<li class="nav-item" data-level="3"><a href="#log-files" class="nav-link">Log Files</a>
<ul class="nav flex-column">
</ul>
</li>
Expand All @@ -166,151 +140,34 @@
<div class="col-md-9" role="main">

<h1 id="debug">Debug</h1>
<h2 id="log-files-on-file-system">Log Files on File System</h2>
<p>Most of the key log files are stored on the Slurm file system so that they can be accessed from any instance with the file system mounted.</p>
<table>
<thead>
<tr>
<th>Logfile</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/nodes/{{node-name}}/slurmd.log</code></td>
<td>Slurm daemon (slurmd) logfile</td>
</tr>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/nodes/{{node-name}}/spot_monitor.log</code></td>
<td>Spot monitor logfile</td>
</tr>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/cloudwatch.log</code></td>
<td>Cloudwatch cron (slurm_ec2_publish_cw.py) logfile</td>
</tr>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/power_save.log</code></td>
<td>Power saving API logfile</td>
</tr>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/slurmctld.log</code></td>
<td>Slurm controller daemon (slurmctld) logfile</td>
</tr>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/slurmctl[1-2]/terminate_old_instances.log</code></td>
<td>Terminate old instances cron (terminate_old_instances.py) logfile</td>
</tr>
<tr>
<td><code>/opt/slurm/{{ClusterName}}/logs/slurmdbd/slurmdbd.log</code></td>
<td>Slurm database daemon (slurmdbd) logfile</td>
</tr>
</tbody>
</table>
<h2 id="slurm-ami-nodes">Slurm AMI Nodes</h2>
<p>The Slurm AMI nodes build the Slurm binaries for all of the configured operating system (OS) variants.
The Amazon Linux 2 build is a prerequisite for the Slurm controllers and slurmdbd instances.
The other builds are prerequisites for compute nodes and submitters.</p>
<p>First check for errors in the user data script. The following command will show the output:</p>
<p><code>grep cloud-init /var/log/messages | less</code></p>
<p>The most common problem is that the ansible playbook failed.
Check the ansible log file to see what failed.</p>
<p><code>less /var/log/ansible.log</code></p>
<p>The following command will rerun the user data.
It will download the playbooks from the S3 deployment bucket and then run it to configure the instance.</p>
<p><code>/var/lib/cloud/instance/scripts/part-001</code></p>
<p>If the problem is with the ansible playbook, then you can edit it in /root/playbooks and then run
your modified playbook by running the following command.</p>
<p><code>/root/slurm_node_ami_config.sh</code></p>
<h2 id="slurm-controller">Slurm Controller</h2>
<p>For ParallelCluster and Slurm issues, refer to the official <a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html">AWS ParallelCluster Troubleshooting documentation</a>.</p>
<h2 id="slurm-head-node">Slurm Head Node</h2>
<p>If slurm commands hang, then it's likely a problem with the Slurm controller.</p>
<p>The first thing to check is the controller's logfile which is stored on the Slurm file system.</p>
<p><code>/opt/slurm/{{ClusterName}}/logs/nodes/slurmctl[1-2]/slurmctld.log</code></p>
<p>If the logfile doesn't exist or is empty then you will need to connect to the slurmctl instance using SSM Manager or ssh and switch to the root user.</p>
<p>Connect to the head node from the EC2 console using SSM Manager or ssh and switch to the root user.</p>
<p><code>sudo su</code></p>
<p>The first thing to do is to ensure that the Slurm controller daemon is running:</p>
<p><code>systemctl status slurmctld</code></p>
<p>If it isn't then first check for errors in the user data script. The following command will show the output:</p>
<p><code>grep cloud-init /var/log/messages | less</code></p>
<p>The most common problem is that the ansible playbook failed.
Check the ansible log file to see what failed.</p>
<p><code>less /var/log/ansible.log</code></p>
<p>The following command will rerun the user data.
It will download the playbooks from the S3 deployment bucket and then run it to configure the instance.</p>
<p>Then check the controller's logfile.</p>
<p><code>/var/log/slurmctld.log</code></p>
<p>The following command will rerun the user data.</p>
<p><code>/var/lib/cloud/instance/scripts/part-001</code></p>
<p>If the problem is with the ansible playbook, then you can edit it in /root/playbooks and then run
your modified playbook by running the following command.</p>
<p><code>/root/slurmctl_config.sh</code></p>
<p>The daemon may also be failing because of some other error.
Check the <code>slurmctld.log</code> for errors.</p>
<p>Another way to debug the <code>slurmctld</code> daemon is to launch it interactively with debug set high.
The first thing to do is get the path to the slurmctld binary.</p>
<pre><code>slurmctld=$(cat /etc/systemd/system/slurmctld.service | awk -F '=' '/ExecStart/ {print $2}')
</code></pre>
<p>Then you can run slurmctld:</p>
<pre><code>$slurmctld -D -vvvvv
</code></pre>
<h3 id="slurm-controller-log-files">Slurm Controller Log Files</h3>
<table>
<thead>
<tr>
<th>Logfile</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>/var/log/ansible.log</code></td>
<td>Ansible logfile</td>
</tr>
<tr>
<td><code>/var/log/slurm/cloudwatch.log</code></td>
<td>Logfile for the script that uploads CloudWatch events.</td>
</tr>
<tr>
<td><code>/var/log/slurm/slurmctld.log</code></td>
<td>slurmctld logfile</td>
</tr>
<tr>
<td><code>/var/log/slurm/power_save.log</code></td>
<td>Slurm plugin logfile with power saving scripts that start, stop, and terminated instances.</td>
</tr>
<tr>
<td><code>/var/log/slurm/terminate_old_instances.log</code></td>
<td>Logfile for the script that terminates stopped instances.</td>
</tr>
</tbody>
</table>
<h2 id="slurm-accounting-database-slurmdbd">Slurm Accounting Database (slurmdbd)</h2>
<p>If you are having problems with the slurm accounting database connect to the slurmdbd instance using SSM Manager.</p>
<p>Check for cloud-init and ansible errors the same way as for the slurmctl instance.</p>
<p>Also check the <code>slurmdbd.log</code> for errors.</p>
<h3 id="log-files">Log Files</h3>
<table>
<thead>
<tr>
<th>Logfile</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>/var/log/ansible.log</code></td>
<td>Ansible logfile</td>
</tr>
<tr>
<td><code>/var/log/slurm/slurmdbd.log</code></td>
<td>slurmctld logfile</td>
</tr>
</tbody>
</table>
<h2 id="compute-nodes">Compute Nodes</h2>
<p>If there are problems with the compute nodes, connect to them using SSM Manager.</p>
<p>Check for cloud-init errors the same way as for the slurmctl instance.
The compute nodes do not run ansible; their AMIs are configured using ansible.</p>
<p>Also check the <code>slurmd.log</code>.</p>
<p>Check that the slurm daemon is running.</p>
<p><code>systemctl status slurmd</code></p>
<h3 id="log-files_1">Log Files</h3>
<h3 id="log-files">Log Files</h3>
<table>
<thead>
<tr>
Expand All @@ -320,7 +177,7 @@ <h3 id="log-files_1">Log Files</h3>
</thead>
<tbody>
<tr>
<td><code>/var/log/slurm/slurmd.log</code></td>
<td><code>/var/log/slurmd.log</code></td>
<td>slurmctld logfile</td>
</tr>
</tbody>
Expand Down
Loading

0 comments on commit f97f764

Please sign in to comment.