Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO MERGE: WIP dev branch for non-openhpc slurm update #16

Open
wants to merge 570 commits into
base: update/july2022
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
570 commits
Select commit Hold shift + click to select a range
b929c2b
make ssh cleanup depend on ansible_user
sjpb Apr 12, 2023
a4ebf5a
Fix eessi install
Apr 12, 2023
059dab8
Make eessi optional but installed by default
Apr 12, 2023
ef888d5
Fix fatimage workflow
m-bull Apr 12, 2023
4ad8f66
ensure latest cuda binaries are on the path
sjpb Apr 12, 2023
c464fee
use cuda samples to query GPUs and test bandwidth
sjpb Apr 12, 2023
44c4a59
fix prom install
sjpb Apr 12, 2023
e3769af
fix path to fat image packer manifest
sjpb Apr 12, 2023
a4f4e25
bump CI fat image
sjpb Apr 12, 2023
c0e35de
Add EESSI test job playbook
Apr 13, 2023
a95a6fb
Add EESSI test to CI workflow
Apr 13, 2023
f6311b6
Use testuser to run job
Apr 13, 2023
47aa9e9
Relax sacct state assertion
Apr 13, 2023
dcf2d1d
Merge pull request #258 from stackhpc/fix/resolv
sjpb Apr 13, 2023
be0fa41
Merge branch 'main' into feat/proxy-nameservers
sjpb Apr 13, 2023
d125cc7
feat: automatic update of community files main
stackhpc-ci Apr 13, 2023
c30b846
clarify NM disable file source
sjpb Apr 13, 2023
c4f9ade
clarify cleanup of NM resolv.conf overrides
sjpb Apr 13, 2023
4623af3
Merge branch 'main' into feature/hpctests-failure-msgs
sjpb Apr 13, 2023
43df3b5
Merge pull request #242 from stackhpc/feature/hpctests-failure-msgs
sjpb Apr 14, 2023
7f16201
fix yum/dnf config file
sjpb Apr 14, 2023
062c809
Merge branch 'main' into feat/proxy-nameservers
sjpb Apr 14, 2023
8cf8ab0
Merge pull request #260 from stackhpc/main-community-files
sjpb Apr 14, 2023
56dff7a
Merge branch 'main' into feat/proxy-nameservers
sjpb Apr 14, 2023
0e6ef7e
Merge pull request #247 from stackhpc/feat/proxy-nameservers
sjpb Apr 14, 2023
f1f6aba
Create home dir instead of using become_user
Apr 25, 2023
d60fbf8
Don't gather facts
Apr 25, 2023
a5e5674
Import GPG key & allow config overrides
Apr 25, 2023
034f460
Fix typo
Apr 25, 2023
9512a28
Add readme
Apr 25, 2023
a92c58d
Fix typos
sd109 Apr 25, 2023
fcba19e
Merge branch 'main' into eessi
sd109 Apr 26, 2023
d38244f
summarise cuda b/w test output
sjpb Apr 27, 2023
f2f9a1d
bump ohpc role to get gres support
sjpb Apr 27, 2023
605373d
add cuda playbooks
sjpb Apr 28, 2023
63f98b7
make fatimage work for cuda
sjpb Apr 28, 2023
96ad27f
enable nvidia persistence daemon to allow GRES to work after reimage
sjpb Apr 28, 2023
21a3639
support changing the podman uid
sjpb Dec 19, 2022
3767d76
Allow source images to be specified by UUID or name
m-bull May 2, 2023
bf81308
Attach a FIP directly to the fatimage builder
m-bull May 2, 2023
bf8c3f6
Use get_url instead of wget
May 3, 2023
fa07ec5
bump fat image in stackhpc env
sjpb May 3, 2023
bbd869e
Merge pull request #252 from stackhpc/eessi
sjpb May 3, 2023
05d587a
Merge branch 'main' into feat/packer-build-from-source-image-id
sjpb May 3, 2023
943f23b
Merge branch 'main' into feat/podman_uid
sjpb May 3, 2023
fe719d9
Merge pull request #266 from stackhpc/feat/packer-build-from-source-i…
sjpb May 3, 2023
aca4f0a
Merge pull request #264 from stackhpc/feat/podman_uid
sjpb May 3, 2023
16623d5
Merge branch 'main' into feat/fatimage-fip
m-bull May 4, 2023
af30b3c
Merge pull request #267 from stackhpc/feat/fatimage-fip
sjpb May 4, 2023
aff37d3
Allow specifying the packer manifest output path
m-bull May 4, 2023
683539c
Support using volume-backed instances for building
m-bull May 5, 2023
9d4f89f
default to no_proxy localhost in proxy role
sjpb May 5, 2023
7375f0d
add debug options for opensearch & filebeat
sjpb May 5, 2023
6680549
Merge pull request #269 from stackhpc/feat/packer-blockstorage
sjpb May 5, 2023
6d1908a
Merge branch 'main' into fix/no_proxy_localhost
sjpb May 5, 2023
9ba1c33
pause before checking TF to see if that exposes not enough hosts mess…
sjpb May 5, 2023
f393922
Revert "pause before checking TF to see if that exposes not enough ho…
sjpb May 5, 2023
422f6a4
Merge branch 'main' into feat/packer-manifest-path
m-bull May 5, 2023
f3872d5
always delete resources on deploy failure in CI
sjpb May 5, 2023
997aef8
Merge pull request #268 from stackhpc/feat/packer-manifest-path
sjpb May 9, 2023
68896e5
Merge branch 'main' into fix/no_proxy_localhost
sjpb May 9, 2023
ece4722
Merge pull request #270 from stackhpc/fix/no_proxy_localhost
sjpb May 10, 2023
4de3a2c
Merge pull request #271 from stackhpc/feat/debug-monitoring
sjpb May 10, 2023
aa63d7e
Merge pull request #272 from stackhpc/ci/cleanup
sjpb May 10, 2023
cca26dd
Use of ephemeral SSH keys when building Packer images
m-bull May 10, 2023
11d9268
Merge branch 'main' into feat/ephemeral-ssh-keys
m-bull May 10, 2023
6b702e1
Merge pull request #274 from stackhpc/feat/ephemeral-ssh-keys
sjpb May 11, 2023
881265c
allow defining ucx device per partition for hpctests
sjpb May 11, 2023
944e05d
make CI cleanup actually work
sjpb May 11, 2023
f26cfe2
tidy CI failure cleanup
sjpb May 11, 2023
999cfc8
Merge pull request #275 from stackhpc/feat/partition-ucx-dev
sjpb May 12, 2023
3adc8e4
Merge branch 'main' into cuda
sjpb May 12, 2023
0fb1469
bump openhpc role to allow empty partitions
sjpb May 12, 2023
29fd81b
add extras playbook for cuda
sjpb May 12, 2023
79ba34d
improve cuda role README
sjpb May 12, 2023
79b4e5b
Merge branch 'cuda' of github.com:stackhpc/ansible-slurm-appliance in…
sjpb May 12, 2023
32325fd
make etc_hosts role more flexible and support adding external hosts
sjpb May 17, 2023
9f4ef8e
Merge pull request #253 from stackhpc/cuda
sjpb May 24, 2023
a8329b8
define etc_hosts_extra_hosts
sjpb May 24, 2023
2439d80
add missing etc_hosts default changes
sjpb May 24, 2023
bd6526f
Merge branch 'main' into feat/etc_hosts
sjpb May 24, 2023
04a4188
Merge pull request #277 from stackhpc/feat/etc_hosts
sjpb May 25, 2023
d4b4e56
Update prometheus-slurm-exporter version
m-bull May 31, 2023
584f4b2
Merge pull request #280 from stackhpc/update/prom-slurm-exporter-0.21
sjpb Jun 7, 2023
a465c96
don't start CUDA persistence daemon in image build
sjpb Jun 13, 2023
b9c474f
Merge pull request #283 from stackhpc/fix/cuda-persistenced
sjpb Jun 16, 2023
e5cf3af
Install out of tree openstack builder plugin
m-bull Jun 21, 2023
5682902
Merge pull request #285 from stackhpc/fix/openstack-packer-build
m-bull Jun 22, 2023
64817d2
Merge branch 'main' into feat/freeipa-nocontainer
sjpb Jun 28, 2023
029ef11
move testuser setup to correct location
sjpb Jun 29, 2023
e034413
fix definition of control node for freeipa
sjpb Jun 29, 2023
6314134
still use clustername-each.key with FQDN hosts (for freeipa)
sjpb Jul 4, 2023
ca45ee2
fix etc_hosts template when using cluster_domain_suffix
sjpb Jul 4, 2023
c086c9e
revert rebuild adhoc now inventory_hostname isn't changed with fqdn h…
sjpb Jul 4, 2023
78b069c
Revert "move freeipa before filesystems so nfs clients can find server"
sjpb Jul 4, 2023
78a213e
move basic_users to after filesystems.yml so templating is done onto …
sjpb Jul 4, 2023
eba82fd
move EEESI to extras
sjpb Jul 4, 2023
0880610
remove duplicate resolv_conf role file
sjpb Jul 4, 2023
7aa6230
Revert "Revert "move freeipa before filesystems so nfs clients can fi…
sjpb Jul 4, 2023
dc926f2
update freeipa readme
sjpb Jul 4, 2023
8e511e6
fix freeipa client task inventory names
sjpb Jul 4, 2023
e3d8a60
disable freeipa in CI, just provide examples
sjpb Jul 4, 2023
52f062f
move node_fqdn to terraform inventory
sjpb Jul 4, 2023
d6fe264
document stackhpc env freeipa setup
sjpb Jul 4, 2023
cfc57db
Remove warn parameter for ansible>=2.14 (#286)
mkjpryor Jul 10, 2023
40cb57c
modify stackhpc workflow to use CI_CLOUD actions variable
sjpb Jul 18, 2023
96bf0c1
add vglabs-bastion fingerprint
sjpb Jul 18, 2023
fefd4bb
provide multi-cloud terraform varsfiles for CI environment
sjpb Jul 18, 2023
2e80689
fix secrets usage
sjpb Jul 18, 2023
01881e2
fixup reversed CI cloud definitions
sjpb Jul 18, 2023
ed34c06
fix SMS networking
sjpb Jul 18, 2023
5c227f0
allow for volume-backed instances and use for CI on SMS
sjpb Jul 18, 2023
a512822
remove (outdated, wrong) example terraform vars
sjpb Jul 19, 2023
3f66a18
set volume device paths in CI
sjpb Jul 19, 2023
6f328c5
fix CI VNIC types
sjpb Jul 19, 2023
8686850
provide bastion details for both CI clouds
sjpb Jul 19, 2023
793e52f
provide & record CI_CLOUD in workflow environment
sjpb Jul 19, 2023
72aec95
fix comment; CI actually only rebuilds login + control
sjpb Jul 20, 2023
4236f08
automatically define rebuild image
sjpb Jul 21, 2023
3a53585
Merge pull request #288 from stackhpc/ci/smslabs
sjpb Jul 21, 2023
69c34c0
Merge branch 'main' into feat/freeipa-nocontainer
sjpb Jul 21, 2023
cca7389
enable NFS PRC security service for freeipa clients
sjpb May 19, 2023
21503aa
fix opensearch grafana plugin at last working version
sjpb Aug 1, 2023
a9f5d33
Fix query type in the Slurm jobs Grafana dashboard
Aug 1, 2023
048e222
bump grafana opensearch plugin prior to fixing query type (i.e. this …
sjpb Aug 1, 2023
d9509ce
Merge pull request #293 from mkarpiarz/fix/slurm-jobs-grafana-dashboard
sjpb Aug 1, 2023
7f40e32
Increment iteration in slurm-jobs.json
m-bull Aug 1, 2023
eb10a25
fix slurmstats/opensearch datasource version configuration
sjpb Aug 2, 2023
b4c0279
use python3.9 for jupyter OOD app
sjpb Aug 2, 2023
f0db292
make eessi test async to avoid ansible timeouts
sjpb Aug 2, 2023
7d1f87c
Merge branch 'fix/jupyter' into fix/eessi-tests
sjpb Aug 2, 2023
854e491
extend eessi timeout
sjpb Aug 2, 2023
307f5ba
capture tensorflow EESSI test output
sjpb Aug 2, 2023
8cea013
disable EESSI tests in CI for now
sjpb Aug 3, 2023
c91c8ed
disable ssh session sharing for stackhpc
sjpb Aug 3, 2023
054d287
Revert "disable EESSI tests in CI for now"
sjpb Aug 3, 2023
aa6fb9c
Re-disable EESSI tests in CI for now
sjpb Aug 3, 2023
8c035d7
Merge pull request #295 from stackhpc/fix/eessi-tests
sjpb Aug 4, 2023
bae7255
reenable ControlMaster, adding ControlPath
sjpb Aug 4, 2023
6df61ae
Merge pull request #296 from stackhpc/fix/sms-ssh
sjpb Aug 4, 2023
ddcf14b
fix ssh ControlPath in skeleton
sjpb Aug 4, 2023
af1c633
Merge pull request #294 from stackhpc/fix/jupyter
sjpb Aug 4, 2023
e6645fd
Merge pull request #292 from stackhpc/fix/opensearch-plugin-ver
sjpb Aug 4, 2023
b69d2a9
Merge branch 'main' into fix/skelton-ssh-persist
sjpb Aug 4, 2023
80b8d71
modify fatimage workflow and .stackhpc packer config to use CI_CLOUD
sjpb Aug 4, 2023
53bfa0c
bump CI image
sjpb Aug 4, 2023
4099d1f
change SMS back to non-volume backed instances now SMS built smaller …
sjpb Aug 8, 2023
96bec2a
Merge branch 'main' into feat/freeipa-nocontainer
sjpb Aug 8, 2023
419c4af
bump opensearch version to 2.9.0
sjpb Aug 9, 2023
3dfe924
pre-pull opensearch container
sjpb Aug 9, 2023
2aa0706
break opensearch into install and runtime task books
sjpb Aug 9, 2023
2783bda
bump image for CI
sjpb Aug 9, 2023
8c65ec7
don't enable/start opensearch during build
sjpb Aug 9, 2023
884df2a
bump CI image
sjpb Aug 9, 2023
2937725
use slurm jobid for opensearch index and archive old data
sjpb Aug 10, 2023
0e32696
modify fatimage workflow and .stackhpc packer config to use CI_CLOUD
sjpb Aug 4, 2023
a49164d
remove rebuild_image from stackhpc env (left over from volume-backed …
sjpb Aug 11, 2023
181bc6c
bump CI image
sjpb Aug 11, 2023
ac3ca9c
Merge pull request #301 from stackhpc/ci/SMS-fatimage-v2
sjpb Aug 16, 2023
44d87c8
pin TF in CI to MPL licenced version
sjpb Aug 16, 2023
8ba90e8
Merge pull request #302 from stackhpc/fix/hashicorp-licencing
sjpb Aug 16, 2023
1549620
use portal-internal network for Arcus CI
sjpb Sep 5, 2023
b13b98d
Merge pull request #306 from stackhpc/ci/arcus-normal
sjpb Sep 5, 2023
d78a913
Merge branch 'main' into fix/skelton-ssh-persist
sjpb Sep 8, 2023
1036e4f
Merge pull request #297 from stackhpc/fix/skelton-ssh-persist
sjpb Sep 8, 2023
b304a47
Update fatimage.yml
thomasbergernz Sep 20, 2023
c5e27d4
oodv3 changes
thomasbergernz Sep 20, 2023
60123ee
swap to no-ohpc version of openhpc role
sjpb Sep 20, 2023
1d47689
use specific openhpc install play files
sjpb Sep 20, 2023
cee6770
bugfix slurm user not existing on non-control nodes
sjpb Sep 20, 2023
251389f
add default openhpc_install_type
sjpb Sep 20, 2023
a49f480
openhpc_ role config for custom binaries
sjpb Sep 20, 2023
fbaaba2
NFS export localhost directory to cluster for /slurm
sjpb Sep 20, 2023
9f888c1
modify hpctests to support non-OpenHPC slurm
sjpb Sep 20, 2023
446ec91
add stackhpc config for hpctests with non-openhpc slurm
sjpb Sep 20, 2023
9dea950
use GenericCloud image, i.e. w/o OpenHPC
sjpb Sep 20, 2023
8cdc4a6
move slurm build to .stackhpc environment
sjpb Sep 20, 2023
2a81f63
add containerised Slurm build
sjpb Sep 20, 2023
79a2cc5
simplify localhosts' NFS definition
sjpb Sep 20, 2023
b686f33
update OOD desktop websockify venv to python3.9
sjpb Sep 21, 2023
e3eef51
bugfix slurm user not existing on non-control nodes
sjpb Sep 20, 2023
434e190
create 10GB fat images on Arcus using volume-backed Packer build
sjpb Sep 22, 2023
424299d
bump image
sjpb Sep 22, 2023
7eb855e
Merge pull request #313 from stackhpc/fix/websockify
sjpb Sep 22, 2023
aec14fd
Merge branch 'main' into feat/no-ohpc
sjpb Sep 26, 2023
600ae8e
Merge branch 'main' into oodv3
sjpb Sep 26, 2023
e3d3e30
Merge pull request #310 from thomasbergernz/oodv3
sjpb Sep 26, 2023
49038fb
fix tags for openhpc role (need to run entire playbook due to changes…
sjpb Sep 26, 2023
29c8018
use /nopt/slurm/... directories, with prefix/sysconfdir set in build too
sjpb Sep 26, 2023
06baa59
bump image
sjpb Sep 26, 2023
e137a6b
Merge branch 'feat/oodv3' of github.com:stackhpc/ansible-slurm-applia…
sjpb Sep 26, 2023
3c70674
Merge pull request #314 from stackhpc/feat/oodv3
sjpb Sep 27, 2023
2a832f5
make dnf module install of nvidia-driver idempotent
sjpb Oct 4, 2023
fc2de71
make it clear _cuda_version_tuple is a private var
sjpb Oct 6, 2023
1472bd6
add cuda_driver_stream variable
sjpb Oct 6, 2023
8881657
use setup-env script to update galaxy installs
sjpb Oct 12, 2023
47208bb
add no_log override
sjpb Oct 12, 2023
7f8c39c
add delete-cluster script
sjpb Oct 12, 2023
089d85c
Merge pull request #315 from stackhpc/fix/nvidia-driver-install
sjpb Oct 12, 2023
0260f4d
update inventory retrieval for multiple CI environents
sjpb Oct 12, 2023
bc3b996
Merge branch 'main' into feat/dev-qol
sjpb Oct 12, 2023
32ff5c1
Merge branch 'main' into feat/freeipa-nocontainer
sjpb Oct 12, 2023
46ef0c2
fix basic_users not working in CI
sjpb Oct 12, 2023
c2bdb57
install freeipa client packages in fatimage build
sjpb Oct 12, 2023
63522fe
use local container image registry for CI to avoid docker.io ratelimits
sjpb Oct 17, 2023
58ed1fd
revise to use Arcus staging pull-through cache
sjpb Oct 18, 2023
888c2ac
use deployhost container registry again
sjpb Oct 18, 2023
18342ef
use podman_registry_address only on ARCUS
sjpb Oct 19, 2023
cb80c3f
add role cve-2023-41914.yml
sjpb Oct 19, 2023
8c65a43
fix cve-2023-41914 in fatimage build
sjpb Oct 19, 2023
c147d9c
report image name early in build
sjpb Oct 20, 2023
13c6965
make packer error behaviour controllable
sjpb Oct 20, 2023
7803e74
add ansible profiling in stackhpc environment
sjpb Oct 20, 2023
c89fa83
fix stackhpc cve-2023-41914 build
sjpb Oct 20, 2023
28bdf13
bump image
sjpb Oct 20, 2023
d828652
delete accidental file
sjpb Oct 20, 2023
d35ef7f
address PR comments
sjpb Oct 20, 2023
76be3d8
Merge pull request #318 from stackhpc/fix/dockerio-ratelimits
sjpb Oct 20, 2023
3fd087e
Merge branch 'main' into fix/CVE-2023-41914-v2
sjpb Oct 20, 2023
fe6ebaa
simplify cve_2023_41914_rpms
sjpb Oct 20, 2023
b5d8b05
run validate automatically when running install-rpms task
sjpb Oct 20, 2023
4c6aa82
bump image
sjpb Oct 20, 2023
f03c89f
Merge pull request #320 from stackhpc/fix/CVE-2023-41914-v2
sjpb Oct 20, 2023
86fae88
Merge branch 'main' into feat/dev-qol
sjpb Oct 20, 2023
ddcff02
Merge branch 'main' into feat/freeipa-nocontainer
sjpb Oct 25, 2023
dd0c48b
remove cve-2023-41914 hook for fatimage build now OpenHPC packages re…
sjpb Oct 27, 2023
3a094ba
bump fatimage source to Rocky8.8 to speedup build
sjpb Oct 27, 2023
893570d
WIP: test new ohpc version
sjpb Oct 27, 2023
46ca110
bump image
sjpb Oct 27, 2023
6e561c5
move EESSI to extras
sjpb Oct 27, 2023
85448b8
fix stackhpc env when running from genericcloud image
sjpb Oct 27, 2023
b3b6f44
bump openhpc role after merge
sjpb Oct 27, 2023
439076e
Merge pull request #324 from stackhpc/ci/bump-img
sjpb Oct 27, 2023
42b77eb
Merge branch 'main' into feat/dev-qol
sjpb Oct 27, 2023
ae1c4d9
Merge pull request #316 from stackhpc/feat/dev-qol
sjpb Oct 27, 2023
87af20c
Merge branch 'main' into feat/freeipa-nocontainer
sjpb Oct 27, 2023
89d282d
move freeipa_server validation so it runs
sjpb Nov 1, 2023
014c70d
add check for virtual servers in freeipa_server
sjpb Nov 1, 2023
40b6cff
move freeipa validation back to validate task
sjpb Nov 1, 2023
676de7c
don't log freeipa server passwords
sjpb Nov 1, 2023
5618d4e
fix freeipa testuser password to match normal CI/basic_users_users usage
sjpb Nov 1, 2023
3708a5c
add note re freeipa server incompatibility with other virtual servers
sjpb Nov 1, 2023
4febcc6
tweak freeipa names
sjpb Nov 2, 2023
6c07eca
freeipa README nits
sjpb Nov 2, 2023
a35ea26
freeipa editorial comments
sjpb Nov 2, 2023
966d350
remove argsplat from FreeIPA users task
sjpb Nov 2, 2023
6f31af4
Merge pull request #241 from stackhpc/feat/freeipa-nocontainer
sjpb Nov 8, 2023
f4b02ce
Merge branch 'main' into feat/no-ohpc
sjpb Nov 10, 2023
2253fb1
remove stackhpc demo config for openhpc-less slurm
sjpb Nov 10, 2023
8762d81
add example config in nrel environment for non-OpenHPC slurm
sjpb Nov 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @stackhpc/batch
62 changes: 62 additions & 0 deletions .github/workflows/fatimage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

name: Build fat image
on:
workflow_dispatch:
jobs:
openstack:
name: openstack-imagebuild
concurrency: ${{ github.ref }} # to branch/PR
runs-on: ubuntu-20.04
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
CI_CLOUD: ${{ vars.CI_CLOUD }}
steps:
- uses: actions/checkout@v2

- name: Setup ssh
run: |
set -x
mkdir ~/.ssh
echo "${{ secrets[format('{0}_SSH_KEY', vars.CI_CLOUD)] }}" > ~/.ssh/id_rsa
chmod 0600 ~/.ssh/id_rsa
shell: bash

- name: Add bastion's ssh key to known_hosts
run: cat environments/.stackhpc/bastion_fingerprints >> ~/.ssh/known_hosts
shell: bash

- name: Install ansible etc
run: dev/setup-env.sh

- name: Write clouds.yaml
run: |
mkdir -p ~/.config/openstack/
echo "${{ secrets[format('{0}_CLOUDS_YAML', vars.CI_CLOUD)] }}" > ~/.config/openstack/clouds.yaml
shell: bash

- name: Setup environment
run: |
. venv/bin/activate
. environments/.stackhpc/activate

- name: Build fat image with packer
id: packer_build
run: |
. venv/bin/activate
. environments/.stackhpc/activate
cd packer/
packer init .
PACKER_LOG=1 packer build -only openstack.openhpc -on-error=${{ vars.PACKER_ON_ERROR }} -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl

- name: Get created image name from manifest
id: manifest
run: |
. venv/bin/activate
IMAGE_ID=$(jq --raw-output '.builds[-1].artifact_id' packer/packer-manifest.json)
while ! openstack image show -f value -c name $IMAGE_ID; do
sleep 30
done
IMAGE_NAME=$(openstack image show -f value -c name $IMAGE_ID)
echo "::set-output name=IMAGE_ID::$IMAGE_ID"
echo "::set-output name=IMAGE_NAME::$IMAGE_NAME"
190 changes: 90 additions & 100 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

name: Test deployment and image build on OpenStack
name: Test deployment and reimage on OpenStack
on:
workflow_dispatch:
push:
Expand All @@ -8,103 +8,100 @@ on:
pull_request:
jobs:
openstack:
name: openstack-ci-${{ matrix.cloud }}
strategy:
matrix:
cloud:
- "arcus" # Arcus OpenStack in rcp-cloud-portal-demo project, with RoCE
fail-fast: false # as want clouds to continue independently
concurrency: ${{ matrix.cloud }}
name: openstack-ci
concurrency: ${{ github.ref }} # to branch/PR
runs-on: ubuntu-20.04
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
CI_CLOUD: ${{ vars.CI_CLOUD }}
steps:
- uses: actions/checkout@v2

- name: Record which cloud CI is running on
run: |
echo CI_CLOUD: ${{ vars.CI_CLOUD }}

- name: Setup ssh
run: |
set -x
mkdir ~/.ssh
echo "${${{ matrix.cloud }}_SSH_KEY}" > ~/.ssh/id_rsa
echo "${{ secrets[format('{0}_SSH_KEY', vars.CI_CLOUD)] }}" > ~/.ssh/id_rsa
chmod 0600 ~/.ssh/id_rsa
env:
smslabs_SSH_KEY: ${{ secrets.SSH_KEY }}
arcus_SSH_KEY: ${{ secrets.ARCUS_SSH_KEY }}

shell: bash

- name: Add bastion's ssh key to known_hosts
run: cat environments/${{ matrix.cloud }}/bastion_fingerprint >> ~/.ssh/known_hosts
run: cat environments/.stackhpc/bastion_fingerprints >> ~/.ssh/known_hosts
shell: bash

- name: Install ansible etc
run: dev/setup-env.sh

- name: Install terraform
uses: hashicorp/setup-terraform@v1
with:
terraform: v1.5.5

- name: Initialise terraform
run: terraform init
working-directory: ${{ github.workspace }}/environments/${{ matrix.cloud }}/terraform
working-directory: ${{ github.workspace }}/environments/.stackhpc/terraform

- name: Write clouds.yaml
run: |
mkdir -p ~/.config/openstack/
echo "${${{ matrix.cloud }}_CLOUDS_YAML}" > ~/.config/openstack/clouds.yaml
echo "${{ secrets[format('{0}_CLOUDS_YAML', vars.CI_CLOUD)] }}" > ~/.config/openstack/clouds.yaml
shell: bash
env:
smslabs_CLOUDS_YAML: ${{ secrets.CLOUDS_YAML }}
arcus_CLOUDS_YAML: ${{ secrets.ARCUS_CLOUDS_YAML }}

- name: Provision infrastructure
id: provision

- name: Setup environment-specific inventory/terraform inputs
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform apply -auto-approve
. environments/.stackhpc/activate
ansible-playbook ansible/adhoc/generate-passwords.yml
echo vault_testuser_password: "$TESTUSER_PASSWORD" > $APPLIANCES_ENVIRONMENT_ROOT/inventory/group_vars/all/test_user.yml
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}

- name: Get server provisioning failure messages
id: provision_failure
TESTUSER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}

- name: Provision nodes using fat image
id: provision_servers
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
. environments/.stackhpc/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
TF_FAIL_MSGS="$(../../skeleton/\{\{cookiecutter.environment\}\}/terraform/getfaults.py $PWD)"
echo $TF_FAIL_MSGS
echo "::set-output name=messages::${TF_FAIL_MSGS}"
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
if: always() && steps.provision.outcome == 'failure'

- name: Delete infrastructure if failed due to lack of hosts
terraform apply -auto-approve -var-file="${{ vars.CI_CLOUD }}.tfvars"

- name: Delete infrastructure if provisioning failed
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
. environments/.stackhpc/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
if: ${{ always() && steps.provision.outcome == 'failure' && contains('not enough hosts available', steps.provision_failure.messages) }}
terraform destroy -auto-approve -var-file="${{ vars.CI_CLOUD }}.tfvars"
if: failure() && steps.provision_servers.outcome == 'failure'

- name: Directly configure cluster
- name: Configure cluster
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
. environments/.stackhpc/activate
ansible all -m wait_for_connection
ansible-playbook ansible/adhoc/generate-passwords.yml
echo test_user_password: "$TEST_USER_PASSWORD" > $APPLIANCES_ENVIRONMENT_ROOT/inventory/group_vars/basic_users/defaults.yml
ansible-playbook -vv ansible/site.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}

ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml

- name: Run MPI-based tests
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml

# - name: Run EESSI tests
# run: |
# . venv/bin/activate
# . environments/.stackhpc/activate
# ansible-playbook -vv ansible/ci/check_eessi.yml

- name: Confirm Open Ondemand is up (via SOCKS proxy)
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
. environments/.stackhpc/activate

# load ansible variables into shell:
ansible-playbook ansible/ci/output_vars.yml \
Expand All @@ -126,68 +123,61 @@ jobs:
--server-response \
--no-check-certificate \
--http-user=testuser \
--http-password=${TEST_USER_PASSWORD} https://${openondemand_servername} \
--http-password=${TESTUSER_PASSWORD} https://${openondemand_servername} \
2>&1)
(echo $statuscode | grep "200 OK") || (echo $statuscode && exit 1)
env:
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
TESTUSER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}

- name: Build packer images
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
echo test_user_password: "$TEST_USER_PASSWORD" > $APPLIANCES_ENVIRONMENT_ROOT/inventory/group_vars/basic_users/defaults.yml
cd packer/
PACKER_LOG=1 packer build -on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
# - name: Build environment-specific compute image
# id: packer_build
# run: |
# . venv/bin/activate
# . environments/.stackhpc/activate
# cd packer/
# packer init
# PACKER_LOG=1 packer build -except openstack.fatimage -on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
# ../dev/output_manifest.py packer-manifest.json # Sets NEW_COMPUTE_IMAGE_ID outputs

- name: Test reimage of nodes
# - name: Test reimage of compute nodes to new environment-specific image (via slurm)
# run: |
# . venv/bin/activate
# . environments/.stackhpc/activate
# ansible login -v -a "sudo scontrol reboot ASAP nextstate=RESUME reason='rebuild image:${{ steps.packer_build.outputs.NEW_COMPUTE_IMAGE_ID }}' ${TF_VAR_cluster_name}-compute-[0-3]"
# ansible compute -m wait_for_connection -a 'delay=60 timeout=600' # delay allows node to go down
# ansible-playbook -v ansible/ci/check_slurm.yml

- name: Test reimage of login and control nodes (via rebuild adhoc)
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible all -m wait_for_connection
ansible-playbook -vv ansible/ci/test_reimage.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
. environments/.stackhpc/activate
ansible-playbook -v --limit control,login ansible/adhoc/rebuild.yml
ansible all -m wait_for_connection -a 'delay=60 timeout=600' # delay allows node to go down
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml

- name: Run MPI-based tests
- name: Check sacct state survived reimage
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
. environments/.stackhpc/activate
ansible-playbook -vv ansible/ci/check_sacct_hpctests.yml

- name: Check MPI-based tests are shown in Grafana
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
. environments/.stackhpc/activate
ansible-playbook -vv ansible/ci/check_grafana.yml
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack

- name: Delete infrastructure
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
. environments/.stackhpc/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
terraform destroy -auto-approve -var-file="${{ vars.CI_CLOUD }}.tfvars"
if: ${{ success() || cancelled() }}

- name: Delete images
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/ci/delete_images.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
# - name: Delete images
# run: |
# . venv/bin/activate
# . environments/.stackhpc/activate
# ansible-playbook -vv ansible/ci/delete_images.yml
22 changes: 20 additions & 2 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ roles/*
# Whitelist roles that are checked into this repository.
!roles/filebeat/
!roles/filebeat/**
!roles/opendistro/
!roles/opendistro/**
!roles/opensearch/
!roles/opensearch/**
!roles/podman/
!roles/podman/**
!roles/grafana-dashboards/
Expand All @@ -26,3 +26,21 @@ roles/*
!roles/slurm_exporter/**
!roles/firewalld/
!roles/firewalld/**
!roles/etc_hosts/
!roles/etc_hosts/**
!roles/cloud_init/
!roles/cloud_init/**
!roles/mysql/
!roles/mysql/**
!roles/systemd/
!roles/systemd/**
!roles/cuda/
!roles/cuda/**
!roles/freeipa/
!roles/freeipa/**
!roles/proxy/
!roles/proxy/**
!roles/resolv_conf/
!roles/resolv_conf/**
!roles/cve-2023-41914
!roles/cve-2023-41914/**
11 changes: 11 additions & 0 deletions ansible/adhoc/backup-keytabs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Use ONE of the following tags on this playbook:
# - retrieve: copies keytabs out of the state volume to the environment
# - deploy: copies keytabs from the environment to the state volume

- hosts: freeipa_client
become: yes
gather_facts: no
tasks:
- import_role:
name: freeipa
tasks_from: backup-keytabs.yml
8 changes: 8 additions & 0 deletions ansible/adhoc/cudatests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
- hosts: cuda
become: yes
gather_facts: no
tags: cuda_samples
tasks:
- import_role:
name: cuda
tasks_from: samples.yml
6 changes: 6 additions & 0 deletions ansible/adhoc/cve-2023-41914.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- hosts: openhpc
gather_facts: no
become: yes
tasks:
- import_role:
name: cve-2023-41914
Loading