Skip to content

Latest commit

 

History

History
177 lines (143 loc) · 12.5 KB

File metadata and controls

177 lines (143 loc) · 12.5 KB

Description

This module provisions a highly available HTCondor access point using a Managed Instance Group (MIG) with auto-healing.

Usage

Although this provisions an HTCondor access point with standard configuration, for a functioning node, you must supply Toolkit runners as described below:

Reference implementations for each are included in the Toolkit modules htcondor-pool-secrets and htcondor-execute-point. You may substitute implementations (e.g. alternative secret management) so long as they duplicate the functionality in these references. Their usage is demonstrated in the HTCondor example.

Behavior of Managed Instance Group (MIG)

A regional MIG is used to provision the Access Point, although only 1 node will ever be active at a time. By default, the node will be provisioned in any of the zones available in that region, however, it can be constrained to run in fewer zones (or a single zone) using var.zones.

When the configuration of the Central Manager is changed, the MIG can be configured to replace the VM using a "proactive" or "opportunistic" policy. By default, the Access Point replacement policy is opportunistic. In practice, this means that the Access Point will NOT be automatically replaced by Terraform when changes to the instance template / HTCondor configuration are made. The Access Point is NOT safe to replace automatically as its local storage contains the state of the job queue. By default, the Access Point will be replaced only when:

  • intentionally by issuing an update via Cloud Console or using gcloud (below)
  • the VM becomes unhealthy or is otherwise automatically replaced (e.g. regular Google Cloud maintenance)

For example, to manually update all instances in a MIG:

gcloud compute instance-groups managed update-instances \
   <<NAME-OF-MIG>> --all-instances --region <<REGION>> \
   --project <<PROJECT_ID>> --minimal-action replace

This mode can be switched to proactive (automatic) replacement by setting var.update_policy to "PROACTIVE". In this case we recommend the use of Filestore to store the job queue state ("spool") and setting [var.spool_parent_dir][#input_spool_parent_dir] to its mount point:

  - id: spoolfs
    source: modules/file-system/filestore
    use:
    - network1
    settings:
      filestore_tier: ENTERPRISE
      local_mount: /shared

...

  - id: htcondor_access
    source: community/modules/scheduler/htcondor-access-point
    use:
    - network1
    - spoolfs
    - htcondor_secrets
    - htcondor_setup
    - htcondor_cm
    - htcondor_execute_point_group
    settings:
      spool_parent_dir: /shared

Copyright 2023 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 1.1
google >= 3.83
time ~> 0.9

Providers

Name Version
google >= 3.83
time ~> 0.9

Modules

Name Source Version
access_point_instance_template github.com/terraform-google-modules/terraform-google-vm//modules/instance_template 84d7959
htcondor_ap github.com/terraform-google-modules/terraform-google-vm//modules/mig aea74d1
startup_script github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script v1.28.1&depth=1

Resources

Name Type
google_storage_bucket_object.ap_config resource
time_sleep.mig_warmup resource
google_compute_image.htcondor data source
google_compute_instance.ap data source
google_compute_region_instance_group.ap data source
google_compute_zones.available data source

Inputs

Name Description Type Default Required
access_point_runner A list of Toolkit runners for configuring an HTCondor access point list(map(string)) [] no
access_point_service_account_email Service account for access point (e-mail format) string n/a yes
autoscaler_runner A list of Toolkit runners for configuring autoscaling daemons list(map(string)) [] no
central_manager_ips List of IP addresses of HTCondor Central Managers list(string) n/a yes
default_mig_id Default MIG ID for HTCondor jobs; if unset, jobs must specify MIG id string "" no
deployment_name HPC Toolkit deployment name. HTCondor cloud resource names will include this value. string n/a yes
disk_size_gb Boot disk size in GB number null no
distribution_policy_target_shape Target shape acoss zones for instance group managing high availability of access point string "BALANCED" no
enable_high_availability Provision HTCondor access point in high availability mode bool false no
enable_oslogin Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. string "ENABLE" no
enable_public_ips Enable Public IPs on the access points bool false no
enable_shielded_vm Enable the Shielded VM configuration (var.shielded_instance_config). bool false no
htcondor_bucket_name Name of HTCondor configuration bucket string n/a yes
instance_image Custom VM image with HTCondor and Toolkit support installed."

Expected Fields:
name: The name of the image. Mutually exclusive with family.
family: The image family to use. Mutually exclusive with name.
project: The project where the image is hosted.
map(string) n/a yes
labels Labels to add to resources. List key, value pairs. map(string) n/a yes
machine_type Machine type to use for HTCondor central managers string "c2-standard-4" no
metadata Metadata to add to HTCondor central managers map(string) {} no
mig_id List of Managed Instance Group IDs containing execute points in this pool (supplied by htcondor-execute-point module) list(string) [] no
network_self_link The self link of the network in which the HTCondor central manager will be created. string null no
network_storage An array of network attached storage mounts to be configured
list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string,
client_install_runner = map(string)
mount_runner = map(string)
}))
[] no
project_id Project in which HTCondor pool will be created string n/a yes
region Default region for creating resources string n/a yes
service_account_scopes Scopes by which to limit service account attached to central manager. set(string)
[
"https://www.googleapis.com/auth/cloud-platform"
]
no
shielded_instance_config Shielded VM configuration for the instance (must set var.enabled_shielded_vm)
object({
enable_secure_boot = bool
enable_vtpm = bool
enable_integrity_monitoring = bool
})
{
"enable_integrity_monitoring": true,
"enable_secure_boot": true,
"enable_vtpm": true
}
no
spool_parent_dir HTCondor access point configuration SPOOL will be set to subdirectory named "spool" string "/var/lib/condor" no
subnetwork_self_link The self link of the subnetwork in which the HTCondor central manager will be created. string null no
update_policy Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) string "OPPORTUNISTIC" no
zones Zone(s) in which access point may be created. If not supplied, will default to all zones in var.region. list(string) [] no

Outputs

Name Description
access_point_ips IP addresses of the access points provisioned by this module
access_point_name Name of the access point provisioned by this module