Skip to content

Releases: nebius/soperator

1.18.0

13 Feb 20:34
b580718
Compare
Choose a tag to compare

Changes made since version 1.17.0 prior to version 1.18.0:

🚀 Features

  • add downscaleAndOverwritePopulateJail
  • add priority class
  • Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
  • MSP-3516: settings of accounting to scrape jobs stats
  • Print actual command before executing it in bash scripts
  • Move gpubench to worker image and bind mount it
  • Move chroot plugin inside containers and bind mount it
  • Move enroot inside images and bind mount it
  • NOTASK: add debug logs
  • Move Pyxis from jail to images and bind-mount it
  • MSP-4080: add simple rebooter
  • MSP-4080: add CheckNodeCondition to rebooter
  • MSP-4080: add rebooting node check
  • MSP-4080: add reboot node and build image
  • MSP-4080: add handleNodeReboot, handleNodeDrain, handleNodeUnDrain and fix patch condition
  • Preinstall Nvidia mock packages issues/384
  • Install nvtop as deb package from repo and bind mount it from container to the jail filesystem
  • Preinstall dcgmi tools to the jail
  • MSP-4080: add render, reconcile rebooter and rbac
  • Remove Nvidia CUDA from worker image and apt clean
  • Build jail image based on own CUDA packages installation
  • Add Epilog and Prolog options
  • Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory

🐛 Fixes

  • MSP-3918: Fix bug reconciliation logic for scenarios with maintenance=true and accounting=false
  • Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
  • NOTIC: Keep more failed NCCL benchmark jobs in the history instead of…
  • MSP-3515: fix mistake in values slurmdbdConfig and slurmConfig
  • [Fix] Install libpmix into nccl-benchmark image
  • Remove openmpi from controller
  • MSP-3992: fix bug with empty version of annotation
  • [FIX] Add patching for service annotations [MSP-3801]
  • fix: update AppArmor profile to allow creation of library links
  • NOTASK: fix bug invalid memory address or nil pointer when get role
  • Enable leader election for controller manager by default
  • Change watching ns mechanism
  • MSP-4080: fix bugs with stuck draining condition
  • Temporary remove expose_enroot_logs flag
  • Fix ci for external contributors
  • Fix non-zero error handling in gpu_healthcheck.sh
  • Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory

📦 Dependencies

  • build(deps): bump alpine from b97e2a8 to 56fa17d
  • bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.2
  • build(deps): bump golang from 7ea4c9d to a6927f4
  • build(deps): bump golang from a6927f4 to 585103a
  • build(deps): bump k8s.io/apimachinery from 0.32.0 to 0.32.1
  • build(deps): bump k8s.io/api from 0.32.0 to 0.32.1
  • build(deps): bump golang from 585103a to 9820aca
  • build(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1
  • build(deps): bump golang from 9820aca to 51a6466
  • bump golang.org/x/net to v0.33.0
  • build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.4
  • build(deps): bump actions/setup-go from 5.2.0 to 5.3.0
  • build(deps): bump docker/login-action from 7ca345011ac4304463197fac0e56eab1bc7e6af0 to 327cd5a69de6c009b9ce71bce8395f28e651bf99
  • build(deps): bump google.golang.org/grpc from 1.69.2 to 1.69.4 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/sdk from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump golang from 51a6466 to 8c10f21
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.33.0 to 1.34.0 in /images/worker/gpubench
  • build(deps): bump google.golang.org/grpc from 1.69.4 to 1.70.0 in /images/worker/gpubench
  • Bump kube-apiserver v0.32.1 in gpubench
  • Bump go version for gpubench
  • build(deps): bump golang from 8c10f21 to e213430
  • build(deps): bump golang from e213430 to 9271129
  • build(deps): bump docker/setup-buildx-action from 3.8.0 to 3.9.0
  • build(deps): bump golang.org/x/crypto from 0.32.0 to 0.33.0

Other

  • fix docs about GPUs are required #306
  • Revert "Print actual command before executing it in bash scripts"
  • Update pyxis version with container_image_save and expose_enroot_logs enagled

Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke, @dstaroff, @itechdima, @nandexsp, @angelbejarano

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
5301 235 196 4604 1434

1.17.0

09 Jan 12:38
9b33f35
Compare
Choose a tag to compare

Changes made since version 1.16.1 prior to version 1.17.0:

🚀 Features

  • add priority weight QOS parameter to slurm.conf
  • Separate sshd configs for login and worker nodes
  • MSP-3392: add supporting apparmor to soperator
  • MSP-3875: add maintenance mode
  • MSP-3569: delete accounting pod, svc and mariadb when accounting false
  • Many small changes: Various fixes, fancy SSH banner, preinstall IB RDMA packages, unshare enroot runtime on login nodes, colored bash for root, keep more failed gpubench jobs, SSH debug logs

🐛 Fixes

  • MSP-3849: fix bug soperator does not watch external of configmap
  • MSP-3851: remove privilages rights from sleep container sysctl
  • MSP-3652: improve probes accounting, worker and controller
  • delete StartupProbe
  • MSP-3868: delete ownerReferences for mariadb secrets
  • gpubench add more informations in output
  • MSP-3308: delete unused NCCLTypeH100GPUCluster topology

📦 Dependencies

  • build(deps): bump github.com/onsi/gomega from 1.36.1 to 1.36.2
  • build(deps): bump golang from 7003184 to f06d2bb
  • build(deps): bump golang from f06d2bb to 7ea4c9d
  • build(deps): bump github.com/onsi/ginkgo/v2 from 2.22.1 to 2.22.2
  • build(deps): bump golang.org/x/crypto from 0.31.0 to 0.32.0
  • build(deps): bump alpine from 21dc606 to b97e2a8
  • build(deps): bump sigs.k8s.io/controller-runtime from 0.19.3 to 0.19.4

Other

  • HOTFIX: fix bug apparmor config

Contributors:
@dependabot[bot], @Uburro, @skleymenov, @asteny, @rdjjke

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
1644 48 70 3203 1206

1.16.1

23 Dec 16:34
3d3bfb0
Compare
Choose a tag to compare

Changes made since version 1.16.0 prior to version 1.16.1:

🚀 Features

  • MSP-3783: Support setting Slurm node 'extra' field

🐛 Fixes

  • MSP-3724: fix Supervisor entrypoint
  • MSP-3645: add task and jobs params into slurm.conf
  • NOTIC: add max startretries
  • Fix daemonset labels
  • Sshd client alive count 10
  • MSP-3782: fix bug slurm.conf

📦 Dependencies

  • build(deps): bump google.golang.org/grpc from 1.69.0 to 1.69.2 in /images/jail/gpubench
  • build(deps): bump github.com/onsi/ginkgo/v2 from 2.22.0 to 2.22.1

Other

  • bump soperator 1.16.1

Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
643 38 22 461 237

1.16.0

17 Dec 09:45
4c06b19
Compare
Choose a tag to compare

Changes made since version 1.15.3 prior to version 1.16.0:

🚀 Features

  • MSP-3191: add webhook protect delete secret mariadb
  • MSP-3642: add some configmap values
  • MSP-3541: removed unused crd values
  • Set compute instance name as slurm InstanceId
  • MSP-3272: Support GDRCopy + preinstall more tools
  • MSP-3578: add sshd to worker
  • run enroot containers without root privileges
  • MSP-3705: Support starting Docker containers from Slurm jobs

🐛 Fixes

  • MSP-3705: Support starting Docker containers from Slurm jobs
  • NOTIC: fix error var

📦 Dependencies

  • Bump golang from 1.22 to 1.23
  • build(deps): bump k8s.io/apimachinery from 0.30.2 to 0.31.3
  • build(deps): bump k8s.io/api from 0.31.2 to 0.31.3
  • build(deps): bump sigs.k8s.io/controller-runtime from 0.19.1 to 0.19.3
  • build(deps): bump google.golang.org/grpc from 1.68.0 to 1.68.1 in /images/jail/gpubench
  • build(deps): bump github.com/onsi/ginkgo/v2 from 2.21.0 to 2.22.0
  • build(deps): bump k8s.io/client-go from 0.31.2 to 0.31.3
  • build(deps): bump github.com/stretchr/testify from 1.9.0 to 1.10.0
  • build(deps): bump github.com/onsi/gomega from 1.35.1 to 1.36.0
  • build(deps): bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.0 to 0.78.2
  • build(deps): bump alpine from 1e42bbe to 21dc606
  • build(deps): bump golang.org/x/crypto from 0.29.0 to 0.30.0
  • build(deps): bump github.com/onsi/gomega from 1.36.0 to 1.36.1
  • build(deps): bump k8s.io/client-go from 0.31.3 to 0.31.4
  • build(deps): bump actions/setup-go from 5.1.0 to 5.2.0
  • build(deps): bump softprops/action-gh-release from 2.1.0 to 2.2.0
  • build(deps): bump k8s.io/api from 0.31.3 to 0.31.4 in /images/jail/gpubench
  • build(deps): bump golang from 574185e to 7003184
  • build(deps): bump golang.org/x/crypto from 0.30.0 to 0.31.0
  • build(deps): bump go.opentelemetry.io/otel/sdk/metric from 1.32.0 to 1.33.0 in /images/jail/gpubench
  • build(deps): bump google.golang.org/grpc from 1.68.1 to 1.69.0 in /images/jail/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.32.0 to 1.33.0 in /images/jail/gpubench
  • build(deps): bump docker/setup-buildx-action from 3.7.1 to 3.8.0
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.32.0 to 1.33.0 in /images/jail/gpubench

Other

  • MSP-3525: Bump golang from 1.22 to 1.23, k8s.io/api and k8s.io/apimachinery from 0.30.2 to 0.31.2, sigs.k8s.io/controller-runtime from 0.18.4 to 0.19.0, mariadb-operator from v0.0.29 to v0.36.0. Close PR from dependobot #38, #39, #47, #189, #195
  • MSP-3635: add generate rbac
  • 3578: moving to native sidecar
  • HOTFIX: fix ServiceAccount name for role binding
  • description protectedSecret
  • NOTIC: Add rights to manage daemonsets
  • run build for release

Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
3323 556 55 5592 20294

1.15.3

20 Nov 16:26
145f189
Compare
Choose a tag to compare

Changes made since version 1.15.2 prior to version 1.15.3:

🐛 Fixes

  • NOTASK: test gpubench
  • Sshd keepalive

📦 Dependencies

  • build(deps): bump softprops/action-gh-release from 2.0.9 to 2.1.0
  • build(deps): bump go.opentelemetry.io/otel/sdk/metric from 1.31.0 to 1.32.0 in /images/jail/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.31.0 to 1.32.0 in /images/jail/gpubench
  • build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.28.0 to 1.32.0 in /images/jail/gpubench
  • build(deps): bump google.golang.org/grpc from 1.67.1 to 1.68.0 in /images/jail/gpubench
  • build(deps): bump golang.org/x/crypto from 0.24.0 to 0.29.0
  • build(deps): bump alpine from beefdbd to 1e42bbe
  • build(deps): bump step-security/harden-runner from 2.10.1 to 2.10.2

Other

  • bug: support jail sub-mounting single file
  • docs(README): deploy slurm cluster in a namespace aligned with the cluster name
  • Set UseInfiniband defaul - true

Contributors:
@dependabot[bot], @CrackedPoly, @Uburro, @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
1144 203 20 209 93

1.15.2

12 Nov 15:19
ee4743f
Compare
Choose a tag to compare

Changes made since version 1.15.1 prior to version 1.15.2:

🚀 Features

  • MSP-3281 add codowners, rm review action

🐛 Fixes

  • MSP-2609: fix secret templating and appapormor populatejail

Contributors:
@Uburro, @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
164 0 4 120 89

1.15.1

07 Nov 14:16
8766049
Compare
Choose a tag to compare

Changes made since version 1.14.14 prior to version 1.15.1:

🚀 Features

  • feat: support ImagePullPolicy from manifest
  • Helm (slurm-cluster-storage): Configure scheduling for particular storage targets
  • add: arrange storage attachment scheduling by storage type
  • Slurm REST API

🐛 Fixes

  • perf(images): use multi-stage to reduce populate_jail image size
  • NOTASK: no cpu limits no throttling

📦 Dependencies

  • build(deps): bump k8s.io/api from 0.31.1 to 0.31.2 in /images/jail/gpubench
  • build(deps): bump softprops/action-gh-release from 2.0.8 to 2.0.9
  • build(deps): bump mikepenz/release-changelog-builder-action from 5.0.0.pre.rc02 to 5
  • build(deps): bump docker/login-action from 06895751d15a223ec091bea144ad5c7f50d228d0 to 7ca345011ac4304463197fac0e56eab1bc7e6af0

Other

  • hotfix: fix required Limit must be set for non overcommitable
  • HOTFIX: make worker pod as a Guaranteed
  • MSP-3261: add apparmor configurable

Contributors:
@CrackedPoly, @dependabot[bot], @dstaroff, @Uburro, @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
1068 187 27 1767 169

1.14.14

29 Oct 13:27
0574392
Compare
Choose a tag to compare

Changes made since version 1.14.13 prior to version 1.14.14:

  • no changes
📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
5 29 25

1.14.13

25 Oct 13:52
b6064cc
Compare
Choose a tag to compare

Changes made since version 1.14.12 prior to version 1.14.13:

🐛 Fixes

  • Fix getting uniq node for all slurm partitions in nccl test

📦 Dependencies

  • build(deps): bump actions/setup-go from 5.0.2 to 5.1.0
  • build(deps): bump docker/login-action from 1f36f5b7a2d2f7bfd524795fc966e6d88c37baa9 to 5d8785b43a795ee002a17dbf1a2235dc1997224b

Contributors:
@dependabot[bot], @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
3 0 7 25 25

1.14.12

24 Oct 10:12
8c72c1a
Compare
Choose a tag to compare

Changes made since version 1.14.11 prior to version 1.14.12:

📦 Dependencies

  • Bump go.opentelemetry.io/otel/sdk from 1.30.0 to 1.31.0 in /images/jail/gpubench
  • Bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.28.0 to 1.31.0 in /images/jail/gpubench
  • build(deps): bump actions/checkout from 4.2.1 to 4.2.2

Other

  • other: bump soperator

Contributors:
@dependabot[bot], @asteny

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
3 1 7 68 68