Skip to content

Commit

Permalink
Add controller state save disk
Browse files Browse the repository at this point in the history
  • Loading branch information
alyssa-sm committed Feb 14, 2025
1 parent 0107923 commit 7e8e622
Show file tree
Hide file tree
Showing 14 changed files with 137 additions and 12 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
| <a name="input_advanced_machine_features"></a> [advanced\_machine\_features](#input\_advanced\_machine\_features) | See https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#nested_advanced_machine_features | <pre>object({<br/> enable_nested_virtualization = optional(bool)<br/> threads_per_core = optional(number)<br/> turbo_mode = optional(string)<br/> visible_core_count = optional(number)<br/> performance_monitoring_unit = optional(string)<br/> enable_uefi_networking = optional(bool)<br/> })</pre> | n/a | yes |
| <a name="input_bandwidth_tier"></a> [bandwidth\_tier](#input\_bandwidth\_tier) | Tier 1 bandwidth increases the maximum egress bandwidth for VMs.<br/>Using the `virtio_enabled` setting will only enable VirtioNet and will not enable TIER\_1.<br/>Using the `tier_1_enabled` setting will enable both gVNIC and TIER\_1 higher bandwidth networking.<br/>Using the `gvnic_enabled` setting will only enable gVNIC and will not enable TIER\_1.<br/>Note that TIER\_1 only works with specific machine families & shapes and must be using an image that supports gVNIC. See [official docs](https://cloud.google.com/compute/docs/networking/configure-vm-with-high-bandwidth-configuration) for more details. | `string` | `"platform_default"` | no |
| <a name="input_can_ip_forward"></a> [can\_ip\_forward](#input\_can\_ip\_forward) | Enable IP forwarding, for NAT instances for example. | `bool` | `false` | no |
| <a name="input_controller_save_disk_self_link"></a> [controller\_save\_disk\_self\_link](#input\_controller\_save\_disk\_self\_link) | The id of the encryption key that is stored in Google Cloud KMS to use to encrypt all the disks on this instance | `string` | `null` | no |
| <a name="input_disk_auto_delete"></a> [disk\_auto\_delete](#input\_disk\_auto\_delete) | Whether or not the boot disk should be auto-deleted. | `bool` | `true` | no |
| <a name="input_disk_labels"></a> [disk\_labels](#input\_disk\_labels) | Labels to be assigned to boot disk, provided as a map. | `map(string)` | `{}` | no |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB. | `number` | `100` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ module "instance_template" {

project_id = var.project_id

controller_save_disk_self_link = var.controller_save_disk_self_link

# Network
can_ip_forward = var.can_ip_forward
network_ip = var.network_ip
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,12 @@ variable "disk_auto_delete" {
default = true
}

variable "controller_save_disk_self_link" {
description = "The id of the encryption key that is stored in Google Cloud KMS to use to encrypt all the disks on this instance"
type = string
default = null
}

variable "additional_disks" {
type = list(object({
disk_name = string
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ No modules.
| <a name="input_auto_delete"></a> [auto\_delete](#input\_auto\_delete) | Whether or not the boot disk should be auto-deleted | `string` | `"true"` | no |
| <a name="input_automatic_restart"></a> [automatic\_restart](#input\_automatic\_restart) | (Optional) Specifies whether the instance should be automatically restarted if it is terminated by Compute Engine (not terminated by a user). | `bool` | `true` | no |
| <a name="input_can_ip_forward"></a> [can\_ip\_forward](#input\_can\_ip\_forward) | Enable IP forwarding, for NAT instances for example | `string` | `"false"` | no |
| <a name="input_controller_save_disk_self_link"></a> [controller\_save\_disk\_self\_link](#input\_controller\_save\_disk\_self\_link) | The id of the encryption key that is stored in Google Cloud KMS to use to encrypt all the disks on this instance | `string` | `null` | no |
| <a name="input_disk_encryption_key"></a> [disk\_encryption\_key](#input\_disk\_encryption\_key) | The id of the encryption key that is stored in Google Cloud KMS to use to encrypt all the disks on this instance | `string` | `null` | no |
| <a name="input_disk_labels"></a> [disk\_labels](#input\_disk\_labels) | Labels to be assigned to boot disk, provided as a map | `map(string)` | `{}` | no |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `string` | `"100"` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,4 +204,14 @@ resource "google_compute_instance_template" "tpl" {
count = guest_accelerator.value.count
}
}

dynamic "disk" {
for_each = var.controller_save_disk_self_link != null ? ["unit"] : []
content {
source = var.controller_save_disk_self_link
device_name = "controller-state-save"
auto_delete = false
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ variable "auto_delete" {
default = "true"
}

variable "controller_save_disk_self_link" {
description = "The id of the encryption key that is stored in Google Cloud KMS to use to encrypt all the disks on this instance"
type = string
default = null
}

variable "additional_disks" {
description = "List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name"
type = list(object({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ limitations under the License.

| Name | Type |
|------|------|
| [google_compute_disk.controller_disk](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk) | resource |
| [google_compute_instance_from_template.controller](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_from_template) | resource |
| [google_secret_manager_secret.cloudsql](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/secret_manager_secret) | resource |
| [google_secret_manager_secret_iam_member.cloudsql_secret_accessor](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/secret_manager_secret_iam_member) | resource |
Expand Down Expand Up @@ -291,6 +292,7 @@ limitations under the License.
| <a name="input_compute_startup_scripts_timeout"></a> [compute\_startup\_scripts\_timeout](#input\_compute\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in compute\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_startup_script"></a> [controller\_startup\_script](#input\_controller\_startup\_script) | Startup script used by the controller VM. | `string` | `"# no-op"` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> type = string<br/> size = number<br/> })</pre> | <pre>{<br/> "size": 50,<br/> "type": "pd-ssd"<br/>}</pre> | no |
| <a name="input_create_bucket"></a> [create\_bucket](#input\_create\_bucket) | Create GCS bucket instead of using an existing one. | `bool` | `true` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | Name of the deployment. | `string` | n/a | yes |
| <a name="input_disable_controller_public_ips"></a> [disable\_controller\_public\_ips](#input\_disable\_controller\_public\_ips) | DEPRECATED: Use `enable_controller_public_ips` instead. | `bool` | `null` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,15 @@ locals {
)
}

resource "google_compute_disk" "controller_disk" {
count = var.controller_state_disk != null ? 1 : 0

name = "controller-state-save"
type = var.controller_state_disk.type
size = var.controller_state_disk.size
zone = var.zone
}

# INSTANCE TEMPLATE
module "slurm_controller_template" {
source = "../../internal/slurm-gcp/instance_template"
Expand All @@ -64,10 +73,11 @@ module "slurm_controller_template" {
disk_type = var.disk_type
additional_disks = local.additional_disks

bandwidth_tier = var.bandwidth_tier
slurm_bucket_path = module.slurm_files.slurm_bucket_path
can_ip_forward = var.can_ip_forward
advanced_machine_features = var.advanced_machine_features
controller_save_disk_self_link = var.controller_state_disk != null ? google_compute_disk.controller_disk[0].name : null
bandwidth_tier = var.bandwidth_tier
slurm_bucket_path = module.slurm_files.slurm_bucket_path
can_ip_forward = var.can_ip_forward
advanced_machine_features = var.advanced_machine_features

enable_confidential_vm = var.enable_confidential_vm
enable_oslogin = var.enable_oslogin
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ No modules.
| <a name="input_compute_startup_scripts_timeout"></a> [compute\_startup\_scripts\_timeout](#input\_compute\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in compute\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_startup_scripts"></a> [controller\_startup\_scripts](#input\_controller\_startup\_scripts) | List of scripts to be ran on controller VM startup. | <pre>list(object({<br/> filename = string<br/> content = string<br/> }))</pre> | `[]` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> type = string<br/> size = number<br/> })</pre> | <pre>{<br/> "size": 50,<br/> "type": "pd-ssd"<br/>}</pre> | no |
| <a name="input_disable_default_mounts"></a> [disable\_default\_mounts](#input\_disable\_default\_mounts) | Disable default global network storage from the controller<br/>- /usr/local/etc/slurm<br/>- /etc/munge<br/>- /home<br/>- /apps<br/>If these are disabled, the slurm etc and munge dirs must be added manually,<br/>or some other mechanism must be used to synchronize the slurm conf files<br/>and the munge key across the cluster. | `bool` | `false` | no |
| <a name="input_enable_bigquery_load"></a> [enable\_bigquery\_load](#input\_enable\_bigquery\_load) | Enables loading of cluster job usage into big query.<br/><br/>NOTE: Requires Google Bigquery API. | `bool` | `false` | no |
| <a name="input_enable_debug_logging"></a> [enable\_debug\_logging](#input\_enable\_debug\_logging) | Enables debug logging mode. Not for production use. | `bool` | `false` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,15 @@ locals {
tp = "${local.bucket_dir}/" # prefix to trim from the bucket path to get a "file name"

config = {
enable_bigquery_load = var.enable_bigquery_load
cloudsql_secret = var.cloudsql_secret
cluster_id = random_uuid.cluster_id.result
project = var.project_id
slurm_cluster_name = var.slurm_cluster_name
bucket_path = local.bucket_path
enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
enable_bigquery_load = var.enable_bigquery_load
cloudsql_secret = var.cloudsql_secret
cluster_id = random_uuid.cluster_id.result
project = var.project_id
slurm_cluster_name = var.slurm_cluster_name
bucket_path = local.bucket_path
enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
controller_state_disk = var.controller_state_disk

# storage
disable_default_mounts = var.disable_default_mounts
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,49 @@ def run_custom_scripts():
log.exception(f"script {script} encountered an exception")
raise e

def mount_save_state_disk():
disk_name = f"/dev/disk/by-id/google-controller-state-save"
mount_point = "/var/spool/slurm"
fs_type = "xfs"

rdevice = util.run(f"realpath {disk_name}").stdout.strip()
file_output = util.run(f"file -s {rdevice}").stdout.strip()
if "filesystem" not in file_output:
util.run(f"mkfs -t {fs_type} -q {rdevice}")

fstab_entry = f"{disk_name}\t{mount_point}\t{fs_type}\tdefaults\t0 0\n"
with open("/etc/fstab", "r") as f:
fstab = f.readlines()
if fstab_entry not in fstab:
with open("/etc/fstab", "a") as f:
f.write(fstab_entry)
util.run(f"systemctl daemon-reload")

os.makedirs(mount_point, exist_ok=True)
util.run(f"mount {mount_point}")

current_user = util.run(f"stat -c %U {mount_point}").stdout.strip()
if current_user != "slurm":
util.run(f"chown -R slurm:slurm {mount_point}")

def mount_munge_key_disk():
state_disk_dir = "/var/spool/slurm/munge"
mount_point = "/etc/munge"

os.makedirs(state_disk_dir, exist_ok=True)

util.run(f"mount --bind {state_disk_dir} {mount_point}")

fstab_entry = f"{state_disk_dir} {mount_point} none bind 0 0\n"
with open("/etc/fstab", "r") as f:
fstab = f.readlines()

if fstab_entry not in fstab:
with open("/etc/fstab", "a") as f:
f.write(fstab_entry)

util.run(f"systemctl daemon-reload")

def setup_jwt_key():
jwt_key = Path(slurmdirs.state / "jwt_hs256.key")

Expand Down Expand Up @@ -329,6 +372,11 @@ def setup_controller():
util.chown_slurm(dirs.scripts / "config.yaml", mode=0o600)
install_custom_scripts()
conf.gen_controller_configs(lookup())

if lookup().cfg.controller_state_disk != None:
mount_save_state_disk()
mount_munge_key_disk()

setup_jwt_key()
setup_munge_key()
setup_sudoers()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,24 @@ variable "slurm_cluster_name" {
}
}

variable "controller_state_disk" {
description = <<EOD
A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.
To disable this feature, set this variable to null.
NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally.
EOD
type = object({
type = string
size = number
})

default = {
type = "pd-ssd"
size = 50
}
}

variable "enable_bigquery_load" {
description = <<EOD
Enables loading of cluster job usage into big query.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ module "slurm_files" {
compute_startup_scripts_timeout = var.compute_startup_scripts_timeout
login_startup_scripts = local.login_startup_scripts
login_startup_scripts_timeout = var.login_startup_scripts_timeout
controller_state_disk = var.controller_state_disk

enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,24 @@ EOD
# SLURM #
#########

variable "controller_state_disk" {
description = <<EOD
A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.
To disable this feature, set this variable to null.
NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally.
EOD
type = object({
type = string
size = number
})

default = {
type = "pd-ssd"
size = 50
}
}

variable "enable_debug_logging" {
type = bool
description = "Enables debug logging mode."
Expand Down

0 comments on commit 7e8e622

Please sign in to comment.