Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-AZ and multi-region support #36

Merged
merged 2 commits into from
Jul 9, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,7 @@ site/
# Jekyll
Gemfile.lock
.jekyll-cache
cartalla marked this conversation as resolved.
Show resolved Hide resolved
.mkdocs_venv/
_site
site/
.vscode/
21 changes: 15 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,15 +1,24 @@

.PHONY: help local-docs test clean

help:
@echo "Usage: make [ help | clean ]"
@echo "Usage: make [ help | local-docs | github-docs | clean ]"

.mkdocs_venv/bin/activate:
rm -rf .mkdocs_venv
python3 -m venv .mkdocs_venv
source .mkdocs_venv/bin/activate; pip install mkdocs

local-docs: .mkdocs_venv/bin/activate
source .mkdocs_venv/bin/activate; mkdocs serve&
firefox http://127.0.0.1:8000/

github-docs: .mkdocs_venv/bin/activate
source .mkdocs_venv/bin/activate; mkdocs gh-deploy --strict

test:
pytest -x -v tests

jekyll:
gem install jekyll bundler
bundler install
bundle exec jekyll serve

clean:
git clean -d -f -x
# -d: Recurse into directories
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# AWS EDA Slurm Cluster

[View on GitHub Pages](https://aws-samples.github.io/aws-eda-slurm-cluster/)

This repository contains an AWS Cloud Development Kit (CDK) application that creates a SLURM cluster that is suitable for running production EDA workloads on AWS.
This repository contains an AWS Cloud Development Kit (CDK) application that creates a Slurm cluster that is suitable for running production EDA workloads on AWS.
Key features are:

* Automatic scaling of AWS EC2 instances based on demand
Expand All @@ -11,7 +9,7 @@ Key features are:
* Batch and interactive partitions (queues)
* Managed tool licenses as a consumable resource
* User and group fair share scheduling
* SLURM accounting database
* Slurm accounting database
* CloudWatch dashboard
* Job preemption
* Multi-cluster federation
Expand All @@ -21,7 +19,7 @@ Key features are:

## Operating System and Processor Architecture Support

This SLURM cluster supports the following OSes:
This Slurm cluster supports the following OSes:

* Alma Linux 8
* Amazon Linux 2
Expand All @@ -32,7 +30,7 @@ This SLURM cluster supports the following OSes:
RedHat stopped supporting CentOS 8, so for a similar RedHat 8 binary compatible distribution we support Alma Linux and
Rocky Linux as replacements for CentOS.

This SLURM cluster supports both Intel/AMD (x86_64) based instances and ARM Graviton2 (arm64/aarch64) based instances.
This Slurm cluster supports both Intel/AMD (x86_64) based instances and ARM Graviton2 (arm64/aarch64) based instances.

[Graviton 2 instances require](https://github.com/aws/aws-graviton-getting-started/blob/main/os.md) Amazon Linux 2, RedHat 8, AlmaLinux 8, or RockyLinux 8 operating systems.
RedHat 7 and CentOS 7 do not support Graviton 2.
Expand All @@ -52,7 +50,9 @@ This provides the following different combinations of OS and processor architect

## Documentation

To view the docs, clone the repository and run mkdocs:
[View on GitHub Pages](https://aws-samples.github.io/aws-eda-slurm-cluster/)

To view the docs locally, clone the repository and run mkdocs:

The docs are in the docs directory. You can view them in an editor or using the mkdocs tool.

Expand All @@ -74,10 +74,16 @@ firefox http://127.0.0.1:8000/ &

Open a browser to: http://127.0.0.1:8000/

Or you can simply let make do this for you.

```
make local-docs
```

## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License

This library is licensed under the MIT-0 License. See the LICENSE file.
This library is licensed under the MIT-0 License. See the [LICENSE](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/LICENSE) file.
4 changes: 0 additions & 4 deletions _config.yml

This file was deleted.

1 change: 0 additions & 1 deletion docs/_config.yml

This file was deleted.

40 changes: 19 additions & 21 deletions docs/deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,17 +75,15 @@ Add the nodjs bin directory to your path.
Note that the version of aws-cdk changes frequently.
The version that has been tested is in the CDK_VERSION variable in the install script.

```
The install script will try to install the prerequisites if they aren't already installed.
```

## Configuration File

The first step in deploying your cluster is to create a configuration file.
A default configuration file is found in [source/resources/config/default_config.yml](source/config/default_config.yml).
A default configuration file is found in [source/resources/config/default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml).
You should create a new config file and update the parameters for your cluster.

The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](source/cdk/config_schema.py).
The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py).
The schema is defined in python, but the actual config file should be in yaml format.

The following are key parameters that you will need to update.
Expand Down Expand Up @@ -115,7 +113,7 @@ The defaults for the following parameters are generally acceptable, but may be m
## Configure the Compute Instances

The InstanceConfig configuration parameter configures the base operating systems, CPU architectures, instance families,
and instance types that the SLURM cluster should support.
and instance types that the Slurm cluster should support.
The supported OSes and CPU architectures are:

| Base OS | CPU Architectures
Expand Down Expand Up @@ -204,7 +202,7 @@ If you want to use the latest base OS AMIs, then configure your AWS cli credenti
the tested version.

```
source/create-ami-map.py > source/resources/config/ami_map.yml
./source/create-ami-map.py > source/resources/config/ami_map.yml
```

## Use Your Own AMIs (Optional)
Expand Down Expand Up @@ -240,13 +238,13 @@ This is useful if the root volume needs additional space to install additional p

## Configure Fair Share Scheduling (Optional)

SLURM supports [fair share scheduling](https://slurm.schedmd.com/fair_tree.html), but it requires the fair share policy to be configured.
Slurm supports [fair share scheduling](https://slurm.schedmd.com/fair_tree.html), but it requires the fair share policy to be configured.
By default, all users will be put into a default group that has a low fair share.
The configuration file is at **source/resources/playbooks/roles/SlurmCtl/templates/tools/slurm/etc/accounts.yml.example**
The configuration file is at [source/resources/playbooks/roles/SlurmCtl/templates/opt/slurm/cluster/etc/accounts.yml.example](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/playbooks/roles/SlurmCtl/templates/opt/slurm/cluster/etc/accounts.yml.example)
in the repository and is deployed to **/opt/slurm/{{ClusterName}}/conf/accounts.yml**.

The file is a simple yaml file that allows you to configure groups, the users that belong to the group, and a fair share weight for the group.
Refer to the SLURM documentation for details on how the fair share weight is calculated.
Refer to the Slurm documentation for details on how the fair share weight is calculated.
The scheduler can be configured so that users who aren't getting their fair share of resources get
higher priority.
The following shows 3 top level groups.
Expand Down Expand Up @@ -322,13 +320,13 @@ These weights can be adjusted based on your needs to control job priorities.

## Configure Licenses

SLURM supports [configuring licenses as a consumable resource](https://slurm.schedmd.com/licenses.html).
Slurm supports [configuring licenses as a consumable resource](https://slurm.schedmd.com/licenses.html).
It will keep track of how many running jobs are using a license and when no more licenses are available
then jobs will stay pending in the queue until a job completes and frees up a license.
Combined with the fairshare algorithm, this can prevent users from monopolizing licenses and preventing others from
being able to run their jobs.

The configuration file is at **source/resources/playbooks/roles/SlurmCtl/templates/tools/slurm/etc/accounts.yml.example**
The configuration file is at [source/resources/playbooks/roles/SlurmCtl/templates/tools/slurm/etc/slurm_licenses.conf.example](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/playbooks/roles/SlurmCtl/templates/opt/slurm/cluster/etc/slurm_licenses.conf.example)
in the repository and is deployed to **/opt/slurm/{{ClusterName}}/conf/accounts.yml**.

The example configuration shows how the number of licenses can be configured as just a comma separated list.
Expand All @@ -351,11 +349,11 @@ with command line arguments, however it is better to specify all of the paramete
## Use the Cluster

Configuring your environment for users requires root privileges.
The configuration commands are found in the outputs of the SLURM cloudformation stack.
The configuration commands are found in the outputs of the Slurm cloudformation stack.

### Configure SLURM Users and Groups
### Configure Slurm Users and Groups

The SLURM cluster needs to configure the users and groups of your environment.
The Slurm cluster needs to configure the users and groups of your environment.
For efficiency, it does this by capturing the users and groups from your environment
and saves them in a json file.
When the compute nodes start they create local unix users and groups using this json file.
Expand All @@ -364,18 +362,18 @@ Choose a single instance in your VPC that will always be running and that is joi
so that it can list all users and groups.
For SOCA this would be the Scheduler instance.
Connect to that instance and run the commands in the **MountCommand** and **ConfigureSyncSlurmUsersGroups** outputs
of the SLURM stack.
These commands will mount the SLURM file system at **/opt/slurm/{{ClusterName}}** and then create
of the Slurm stack.
These commands will mount the Slurm file system at **/opt/slurm/{{ClusterName}}** and then create
a cron job that runs every 5 minutes and updates **/opt/slurm/{{ClusterName}}/config/users_groups.json**.

### Configure SLURM Submitter Instances
### Configure Slurm Submitter Instances

Instances that need to submit to SLURM need to have their security group IDs in the **SubmitterSecurityGroupIds** configuration parameter
so that the security groups allow communication between the submitter instances and the SLURM cluster.
They also need to be configured by mounting the file system with the SLURM tools and
Instances that need to submit to Slurm need to have their security group IDs in the **SubmitterSecurityGroupIds** configuration parameter
so that the security groups allow communication between the submitter instances and the Slurm cluster.
They also need to be configured by mounting the file system with the Slurm tools and
configuring their environment.
Connect to the submitter instance and run the commands in the **MountCommand** and **ConfigureSubmitterCommand** outputs
of the SLURM stack.
of the Slurm stack.
If all users need to use the cluster then it is probably best to create a custom AMI that is configured with the configuration
commands.

Expand Down
6 changes: 3 additions & 3 deletions docs/federation.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ If you need to run jobs in more than one AZ then you can use the [federation fea

The config directory has example configuration files that demonstrate how deploy federated cluster into 3 AZs.

* [source/config/slurm_eda_az1.yml](source/config/slurm_eda_az1.yml)
* [source/config/slurm_eda_az2.yml](source/config/slurm_eda_az2.yml)
* [source/config/slurm_eda_az3.yml](source/config/slurm_eda_az3.yml)
* [source/config/slurm_eda_az1.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/config/slurm_eda_az1.yml)
* [source/config/slurm_eda_az2.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/config/slurm_eda_az2.yml)
* [source/config/slurm_eda_az3.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/config/slurm_eda_az3.yml)

These clusters should be deployed sequentially.
The first cluster creates a cluster and a slurmdbd instance.
Expand Down
17 changes: 0 additions & 17 deletions docs/mkdocs.md

This file was deleted.

Loading