This document explains how to use Magic Castle with Terraform Cloud.
Terraform Cloud is HashiCorp’s managed service that allows to provision infrastructure using a web browser or a REST API instead of the command-line. This also means that the provisioned infrastructure parameters can be modified by a team and the state is stored in the cloud instead of a local machine.
When provisioning in commercial cloud, Terraform Cloud can also provide a cost estimate of the resources.
- Create a Terraform Cloud account
- Create an organization, join one or choose one available to you
- Create a git repository in GitHub, GitLab, or any of the version control system provider supported by Terraform Cloud
- In this git repository, add a copy of the Magic Castle example
main.tf
available for the cloud of your choice - Log in Terraform Cloud account
- Create a new workspace
- Choose Type: "Version control workflow"
- Connect to VCS: choose the version control provider that hosts your repository
- Choose the repository that contains your
main.tf
- Configure settings: tweak the name and description to your liking
- Click on "Create workspace"
You will be redirected automatically to your new workspace.
Terraform Cloud will invoke Terraform command-line in a remote virtual environment. For the CLI to be able to communicate with your cloud provider API, we need to define environment variables that Terraform will use to authenticate. The next sections explain which environment variables to define for each cloud provider and how to retrieve the values of the variable from the provider.
If you plan on using these environment variables with multiple workspaces, it is recommended to create a credential variable set in Terraform Cloud.
You need to define these environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
(sensitive)
The value of these variables can either correspond to the value of access key created on the AWS Security Credentials - Access keys page, or you can add user dedicated to Terraform Cloud in AWS IAM Users, and use its access key.
You need to define these environment variables:
ARM_CLIENT_ID
ARM_CLIENT_SECRET
(sensitive)ARM_SUBSCRIPTION_ID
ARM_TENANT_ID
Refer to Terraform Azure Provider - Creating a Service Principal to know how to create a Service Principal and retrieve the values for these environment variables.
You need to define this environment variable:
GOOGLE_CLOUD_KEYFILE_JSON
(sensitive)
The value of the variable will be the content of a Google Cloud service account JSON key file expressed a single line string. Example:
{"type": "service_account","project_id": "project-id-1234","private_key_id": "abcd1234",...}
You can use jq
to format the string from the JSON
file provided by Google:
jq . -c project-name-123456-abcdefjg.json
You need to define these environment variables:
OS_AUTH_URL
OS_PROJECT_ID
OS_REGION_NAME
OS_INTERFACE
OS_IDENTITY_API_VERSION
OS_USER_DOMAIN_NAME
OS_USERNAME
OS_PASSWORD
(sensitive)
Apart from OS_PASSWORD
, the values for these variables are available in
OpenStack RC file provided for your project.
If you prefer to use OpenStack application credentials, you need to define at least these variables:
OS_AUTH_TYPE
OS_AUTH_URL
OS_APPLICATION_CREDENTIAL_ID
OS_APPLICATION_CREDENTIAL_SECRET
and potentially these too:
OS_IDENTITY_API_VERSION
OS_REGION_NAME
OS_INTERFACE
The values for these variables are available in OpenStack RC file provided when creating the application credentials.
Terraform Cloud will invoke Terraform command-line in a remote virtual environment. For the CLI to be able to communicate with your DNS provider API, we need to define environment variables that Terraform will use to authenticate. The next sections explain which environment variables to define for each DNS provider and how to retrieve the values of the variable from the provider.
Refer to DNS - CloudFlare section of Magic Castle main documentation to determine which environment variables needs to be set.
Refer to DNS - Google Cloud section of Magic Castle main documentation to determine which environment variables needs to be set.
It is possible to use Terraform Cloud web interface to define variable
values in your main.tf
. For example, you could want to define a guest
password without writing it directly in main.tf
to avoid displaying
publicly.
To manage a variable with Terraform Cloud:
-
edit your
main.tf
to define the variables you want to manage. In the following example, we want to manage the number of nodes and the guest password.Add the variables at the beginning of the
main.tf
:variable "nb_nodes" {} variable "password" {}
Then replace the static value by the variable in our
main.tf
,compute node count
node = { type = "p2-3gb", tags = ["node"], count = var.nb_nodes }
guest password
guest_passwd = var.password
-
Commit and push this changes to your git repository.
-
In Terraform Cloud workspace associated with that repository, go in "Variables.
-
Under "Terraform Variables", click the "Add variable" button and create a variable for each one defined previously in the
main.tf
. Check "Sensitive" if the variable content should not never be shown in the UI or the API.
You may edit the variables at any point of your cluster lifetime.
To create your cluster, apply changes made to your main.tf
or the variables,
you will need to queue a plan. When you push to the default branch of the linked
git repository, a plan will be automatically created. You can also create a
plan manually. To do so, click on the "Queue plan manually"
button inside your workspace, then "Queue plan".
Once the plan has been successfully created, you can apply it using the "Runs" section. Click on the latest queued plan, then on the "Apply plan" button at the bottom of the plan page.
It is possible to apply automatically a successful plan. Go in the "Settings" section, and under "Apply method" select "Auto apply". Any following successful plan will then be automatically applied.
Terraform cloud only allows to apply or destroy the plan as stated in the main.tf,
but sometimes it can be useful to run some other terraform commands that are only
available through the command-line interface, for example terraform taint
.
It is possible to import the terraform state of a cluster on your local computer and then use the CLI on it.
- Log in Terraform cloud:
terraform login
- Create a folder where the terraform state will be stored:
mkdir my-cluster-1
- Create a file named
cloud.tf
with the following content in your cluster folder:
terraform {
cloud {
organization = "REPLACE-BY-YOUR-TF-CLOUD-ORG"
workspaces {
name = "REPLACE-BY-THE-NAME-OF-YOUR-WORKSPACE"
}
}
}
replace the values of organization
and name
with the appropriate value
for your cluster.
- Initialize the folder and retrieve the state:
terraform init
To confirm the workspace has been properly imported locally, you can list the resources using:
terraform state list
Magic Castle in combination with Terraform Cloud (TFE) can be configured to give Slurm the ability to create and destroy instances based on the job queue content.
To enable this feature:
-
Create a TFE API Token and save it somewhere safe.
1.1. If you subscribe to Terraform Cloud Team & Governance plan, you can generate a Team API Token. The team associated with this token requires no access to organization and can be secret. It does not have to include any member. Team API token is preferable as its permissions can be restricted to the minimum required for autoscale purpose.
-
2.1. Make sure the repo is private as it will contain the API token.
2.2. If you generated a Team API Token in 1, provide access to the workspace to the team:
- Workspace Settings -> Team Access -> Add team and permissions
- Select the team
- Click on "Customize permissions for this team"
- Under "Runs" select "Apply"
- Under "Variables" select "Read and write"
- Leave the rest as is and click on "Assign custom permissions"
2.3 In Configure settings, under Advanced options, for Apply method, select Auto apply.
-
Create the environment variables of the cloud provider credentials in TFE
-
Create a variable named
pool
in TFE. Set value to[]
and check HCL. -
Add a file named
data.yaml
in your git repo with the following content:--- profile::slurm::controller::tfe_token: <TFE API token> profile::slurm::controller::tfe_workspace: <TFE workspace id>
Complete the file by replacing
<TFE API TOKEN>
with the token generated at step 1 and<TFE workspace id>
(i.e.:ws-...
) by the id of the workspace created at step 2. It is recommended to encrypt the TFE API token before committingdata.yaml
in git. Refer to section 4.15 of README.md to know how to encrypt the token. -
Add
data.yaml
in git and push. -
Modify
main.tf
:- If not already present, add the following definition of the pool variable at the beginning of your
main.tf
.
variable "pool" { description = "Slurm pool of compute nodes" }
- Add instances to
instances
with the tagspool
andnode
. These are the nodes that Slurm will able to create and destroy. - If not already present, add the following line after the instances definition to pass the list of compute nodes from Terraform cloud workspace variable to the provider module:
pool = var.pool
- On the right-hand-side of
public_keys =
, replace[file("~/.ssh/id_rsa.pub")]
by a list of SSH public keys that will have admin access to the cluster. - After the line
public_keys = ...
, addhieradata = file("data.yaml")
. - Stage changes, commit and push to git repo.
- If not already present, add the following definition of the pool variable at the beginning of your
-
Go to your workspace in TFE, click on Actions -> Start a new run -> Plan and apply -> Start run. Then, click on "Confirm & Apply" and "Confirm Plan".
-
Compute nodes defined in step 8 can be modified at any point in the cluster lifetime and more pool compute nodes can be added or removed if needed.
To reduce the time required for compute nodes to become available in Slurm, consider creating a compute node image.
JupyterHub will time out by default after 300 seconds if a node is not spawned yet. Since it may take longer than this to spawn a node, even with an image created, consider increasing the timeout by adding the following to your YAML configuration file:
jupyterhub::jupyterhub_config_hash:
SlurmFormSpawner:
start_timeout: 900
Slurm 23 adds the possibility for sinfo
to report nodes that are not yet spawned. This is useful
if you want JupyterHub to be aware of those nodes, for example if you want to allow to use GPU nodes
without keeping them online at all time. To use that version of Slurm, add the following to your YAML
configuration file:
profile::slurm::base::slurm_version: '23.02'
If after enabling autoscaling with Terraform Cloud for your Magic Castle cluster, the number of nodes does not increase when submitting jobs, verify the following points:
- Go to the Terraform Cloud workspace webpage, and look for errors in the runs. If the runs were only triggered by changes to the git repo, it means scaling signals from the cluster do not reach the Terraform cloud workspace or no signals were sent at all.
- Make sure the Terraform Cloud workspace id matches with the value of
profile::slurm::controller::tfe_workspace
indata.yaml
. - Execute
squeue
on the cluster, and verify the reasons why jobs are still in the queue. If under the column(Reason)
, there is the keywordReqNodeNotAvail
, it implies Slurm tried to boot the listed nodes, but they would not show up before the timeout, therefore Slurm marked them as down. It can happen if your cloud provider is slow to build the instances, or following a configuration problem like in 2. When Slurm marks a node as down, a trace is left in slurmctld's log - using zgrep on the slurm controller node (typicallymgmt1
):To tell Slurm these nodes are available again, enter the following command:sudo zgrep "marking down" /var/log/slurm/slurmctld.log*
Replacesudo /opt/software/slurm/bin/scontrol update nodename=node[Y-Z] state=IDLE
node[Y-Z]
by the hostname range listed next toReqNodeNotAvail
insqueue
. - Under
mgmt1:/var/log/slurm
, look for errors in the fileslurm_resume.log
.