Releases: nebius/soperator
1.18.0
Changes made since version 1.17.0
prior to version 1.18.0
:
🚀 Features
- add downscaleAndOverwritePopulateJail
- PR: #311
- add priority class
- PR: #313
- Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
- PR: #316
- MSP-3516: settings of accounting to scrape jobs stats
- PR: #321
- Print actual command before executing it in bash scripts
- PR: #329
- Move gpubench to worker image and bind mount it
- PR: #333
- Move chroot plugin inside containers and bind mount it
- PR: #335
- Move enroot inside images and bind mount it
- PR: #339
- NOTASK: add debug logs
- PR: #357
- Move Pyxis from jail to images and bind-mount it
- PR: #361
- MSP-4080: add simple rebooter
- PR: #369
- MSP-4080: add CheckNodeCondition to rebooter
- PR: #372
- MSP-4080: add rebooting node check
- PR: #377
- MSP-4080: add reboot node and build image
- PR: #381
- MSP-4080: add handleNodeReboot, handleNodeDrain, handleNodeUnDrain and fix patch condition
- PR: #383
- Preinstall Nvidia mock packages issues/384
- PR: #387
- Install nvtop as deb package from repo and bind mount it from container to the jail filesystem
- PR: #390
- Preinstall dcgmi tools to the jail
- PR: #394
- MSP-4080: add render, reconcile rebooter and rbac
- PR: #391
- Remove Nvidia CUDA from worker image and apt clean
- PR: #397
- Build jail image based on own CUDA packages installation
- PR: #415
- Add Epilog and Prolog options
- PR: #411
- Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory
- PR: #389
🐛 Fixes
- MSP-3918: Fix bug reconciliation logic for scenarios with maintenance=true and accounting=false
- PR: #309
- Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack
- PR: #316
- NOTIC: Keep more failed NCCL benchmark jobs in the history instead of…
- PR: #315
- MSP-3515: fix mistake in values slurmdbdConfig and slurmConfig
- PR: #318
- [Fix] Install libpmix into nccl-benchmark image
- PR: #319
- Remove openmpi from controller
- PR: #320
- MSP-3992: fix bug with empty version of annotation
- PR: #334
- [FIX] Add patching for service annotations [MSP-3801]
- PR: #354
- fix: update AppArmor profile to allow creation of library links
- PR: #356
- NOTASK: fix bug invalid memory address or nil pointer when get role
- PR: #359
- Enable leader election for controller manager by default
- PR: #365
- Change watching ns mechanism
- PR: #366
- MSP-4080: fix bugs with stuck draining condition
- PR: #399
- Temporary remove
expose_enroot_logs
flag- PR: #417
- Fix ci for external contributors
- PR: #419
- Fix non-zero error handling in gpu_healthcheck.sh
- PR: #418
- Pre-create enroot credentials, exclude enroot bind-mounts from motd, make Docker mount /dev/infiniband, store /tmp on disk and add tmpfs /mnt/memory
- PR: #389
📦 Dependencies
- build(deps): bump alpine from
b97e2a8
to56fa17d
- PR: #310
- bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.2
- PR: #312
- build(deps): bump golang from
7ea4c9d
toa6927f4
- PR: #322
- build(deps): bump golang from
a6927f4
to585103a
- PR: #323
- build(deps): bump k8s.io/apimachinery from 0.32.0 to 0.32.1
- PR: #325
- build(deps): bump k8s.io/api from 0.32.0 to 0.32.1
- PR: #324
- build(deps): bump golang from
585103a
to9820aca
- PR: #328
- build(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1
- PR: #327
- build(deps): bump golang from
9820aca
to51a6466
- PR: #331
- bump golang.org/x/net to v0.33.0
- PR: #340
- build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.4
- PR: #341
- build(deps): bump actions/setup-go from 5.2.0 to 5.3.0
- PR: #342
- build(deps): bump docker/login-action from 7ca345011ac4304463197fac0e56eab1bc7e6af0 to 327cd5a69de6c009b9ce71bce8395f28e651bf99
- PR: #344
- build(deps): bump google.golang.org/grpc from 1.69.2 to 1.69.4 in /images/worker/gpubench
- PR: #345
- build(deps): bump go.opentelemetry.io/otel/sdk from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #346
- build(deps): bump golang from
51a6466
to8c10f21
- PR: #338
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #349
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.33.0 to 1.34.0 in /images/worker/gpubench
- PR: #353
- build(deps): bump google.golang.org/grpc from 1.69.4 to 1.70.0 in /images/worker/gpubench
- PR: #358
- Bump kube-apiserver v0.32.1 in gpubench
- PR: #367
- Bump go version for gpubench
- PR: #368
- build(deps): bump golang from
8c10f21
toe213430
- PR: #386
- build(deps): bump golang from
e213430
to9271129
- PR: #392
- build(deps): bump docker/setup-buildx-action from 3.8.0 to 3.9.0
- PR: #402
- build(deps): bump golang.org/x/crypto from 0.32.0 to 0.33.0
- PR: #421
Other
- fix docs about GPUs are required #306
- PR: #317
- Revert "Print actual command before executing it in bash scripts"
- PR: #332
- Update pyxis version with
container_image_save
andexpose_enroot_logs
enagled- PR: #376
Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke, @dstaroff, @itechdima, @nandexsp, @angelbejarano
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
5301 | 235 | 196 | 4604 | 1434 |
1.17.0
Changes made since version 1.16.1
prior to version 1.17.0
:
🚀 Features
- add priority weight QOS parameter to slurm.conf
- PR: #292
- Separate sshd configs for login and worker nodes
- PR: #287
- MSP-3392: add supporting apparmor to soperator
- PR: #288
- MSP-3875: add maintenance mode
- PR: #298
- MSP-3569: delete accounting pod, svc and mariadb when accounting false
- PR: #301
- Many small changes: Various fixes, fancy SSH banner, preinstall IB RDMA packages, unshare enroot runtime on login nodes, colored bash for root, keep more failed gpubench jobs, SSH debug logs
- PR: #303
🐛 Fixes
- MSP-3849: fix bug soperator does not watch external of configmap
- PR: #289
- MSP-3851: remove privilages rights from sleep container sysctl
- PR: #290
- MSP-3652: improve probes accounting, worker and controller
- PR: #291
- delete StartupProbe
- PR: #294
- MSP-3868: delete ownerReferences for mariadb secrets
- PR: #295
- gpubench add more informations in output
- PR: #296
- MSP-3308: delete unused NCCLTypeH100GPUCluster topology
- PR: #297
📦 Dependencies
- build(deps): bump github.com/onsi/gomega from 1.36.1 to 1.36.2
- PR: #282
- build(deps): bump golang from
7003184
tof06d2bb
- PR: #283
- build(deps): bump golang from
f06d2bb
to7ea4c9d
- PR: #284
- build(deps): bump github.com/onsi/ginkgo/v2 from 2.22.1 to 2.22.2
- PR: #293
- build(deps): bump golang.org/x/crypto from 0.31.0 to 0.32.0
- PR: #300
- build(deps): bump alpine from
21dc606
tob97e2a8
- PR: #302
- build(deps): bump sigs.k8s.io/controller-runtime from 0.19.3 to 0.19.4
- PR: #304
Other
- HOTFIX: fix bug apparmor config
- PR: #305
Contributors:
@dependabot[bot], @Uburro, @skleymenov, @asteny, @rdjjke
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
1644 | 48 | 70 | 3203 | 1206 |
1.16.1
Changes made since version 1.16.0
prior to version 1.16.1
:
🚀 Features
- MSP-3783: Support setting Slurm node 'extra' field
- PR: #277
🐛 Fixes
- MSP-3724: fix Supervisor entrypoint
- PR: #269
- MSP-3645: add task and jobs params into slurm.conf
- PR: #271
- NOTIC: add max startretries
- PR: #272
- Fix daemonset labels
- PR: #275
- Sshd client alive count 10
- PR: #278
- MSP-3782: fix bug slurm.conf
- PR: #279
📦 Dependencies
- build(deps): bump google.golang.org/grpc from 1.69.0 to 1.69.2 in /images/jail/gpubench
- PR: #273
- build(deps): bump github.com/onsi/ginkgo/v2 from 2.22.0 to 2.22.1
- PR: #276
Other
- bump soperator 1.16.1
- PR: #280
Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
643 | 38 | 22 | 461 | 237 |
1.16.0
Changes made since version 1.15.3
prior to version 1.16.0
:
🚀 Features
- MSP-3191: add webhook protect delete secret mariadb
- PR: #215
- MSP-3642: add some configmap values
- PR: #223
- MSP-3541: removed unused crd values
- PR: #225
- Set compute instance name as slurm InstanceId
- PR: #228
- MSP-3272: Support GDRCopy + preinstall more tools
- PR: #231
- MSP-3578: add sshd to worker
- PR: #248
- run enroot containers without root privileges
- PR: #249
- MSP-3705: Support starting Docker containers from Slurm jobs
- PR: #256
🐛 Fixes
📦 Dependencies
- Bump golang from 1.22 to 1.23
- PR: #38
- build(deps): bump k8s.io/apimachinery from 0.30.2 to 0.31.3
- PR: #195
- build(deps): bump k8s.io/api from 0.31.2 to 0.31.3
- PR: #213
- build(deps): bump sigs.k8s.io/controller-runtime from 0.19.1 to 0.19.3
- PR: #214
- build(deps): bump google.golang.org/grpc from 1.68.0 to 1.68.1 in /images/jail/gpubench
- PR: #216
- build(deps): bump github.com/onsi/ginkgo/v2 from 2.21.0 to 2.22.0
- PR: #217
- build(deps): bump k8s.io/client-go from 0.31.2 to 0.31.3
- PR: #219
- build(deps): bump github.com/stretchr/testify from 1.9.0 to 1.10.0
- PR: #218
- build(deps): bump github.com/onsi/gomega from 1.35.1 to 1.36.0
- PR: #220
- build(deps): bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.78.0 to 0.78.2
- PR: #221
- build(deps): bump alpine from
1e42bbe
to21dc606
- PR: #224
- build(deps): bump golang.org/x/crypto from 0.29.0 to 0.30.0
- PR: #226
- build(deps): bump github.com/onsi/gomega from 1.36.0 to 1.36.1
- PR: #230
- build(deps): bump k8s.io/client-go from 0.31.3 to 0.31.4
- PR: #233
- build(deps): bump actions/setup-go from 5.1.0 to 5.2.0
- PR: #239
- build(deps): bump softprops/action-gh-release from 2.1.0 to 2.2.0
- PR: #238
- build(deps): bump k8s.io/api from 0.31.3 to 0.31.4 in /images/jail/gpubench
- PR: #236
- build(deps): bump golang from
574185e
to7003184
- PR: #243
- build(deps): bump golang.org/x/crypto from 0.30.0 to 0.31.0
- PR: #247
- build(deps): bump go.opentelemetry.io/otel/sdk/metric from 1.32.0 to 1.33.0 in /images/jail/gpubench
- PR: #253
- build(deps): bump google.golang.org/grpc from 1.68.1 to 1.69.0 in /images/jail/gpubench
- PR: #252
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.32.0 to 1.33.0 in /images/jail/gpubench
- PR: #254
- build(deps): bump docker/setup-buildx-action from 3.7.1 to 3.8.0
- PR: #264
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.32.0 to 1.33.0 in /images/jail/gpubench
- PR: #263
Other
- MSP-3525: Bump golang from 1.22 to 1.23, k8s.io/api and k8s.io/apimachinery from 0.30.2 to 0.31.2, sigs.k8s.io/controller-runtime from 0.18.4 to 0.19.0, mariadb-operator from v0.0.29 to v0.36.0. Close PR from dependobot #38, #39, #47, #189, #195
- PR: #196
- MSP-3635: add generate rbac
- PR: #222
- 3578: moving to native sidecar
- PR: #227
- HOTFIX: fix ServiceAccount name for role binding
- PR: #229
- description protectedSecret
- PR: #258
- NOTIC: Add rights to manage daemonsets
- PR: #259
- run build for release
- PR: #265
Contributors:
@Uburro, @dependabot[bot], @asteny, @rdjjke
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
3323 | 556 | 55 | 5592 | 20294 |
1.15.3
Changes made since version 1.15.2
prior to version 1.15.3
:
🐛 Fixes
📦 Dependencies
- build(deps): bump softprops/action-gh-release from 2.0.9 to 2.1.0
- PR: #181
- build(deps): bump go.opentelemetry.io/otel/sdk/metric from 1.31.0 to 1.32.0 in /images/jail/gpubench
- PR: #174
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.31.0 to 1.32.0 in /images/jail/gpubench
- PR: #172
- build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc from 1.28.0 to 1.32.0 in /images/jail/gpubench
- PR: #186
- build(deps): bump google.golang.org/grpc from 1.67.1 to 1.68.0 in /images/jail/gpubench
- PR: #170
- build(deps): bump golang.org/x/crypto from 0.24.0 to 0.29.0
- PR: #171
- build(deps): bump alpine from
beefdbd
to1e42bbe
- PR: #180
- build(deps): bump step-security/harden-runner from 2.10.1 to 2.10.2
- PR: #188
Other
- bug: support jail sub-mounting single file
- PR: #182
- docs(README): deploy slurm cluster in a namespace aligned with the cluster name
- PR: #183
- Set UseInfiniband defaul - true
- PR: #190
Contributors:
@dependabot[bot], @CrackedPoly, @Uburro, @asteny
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
1144 | 203 | 20 | 209 | 93 |
1.15.2
1.15.1
Changes made since version 1.14.14
prior to version 1.15.1
:
🚀 Features
- feat: support ImagePullPolicy from manifest
- PR: #152
- Helm (
slurm-cluster-storage
): Configure scheduling for particular storage targets- PR: #154
- add: arrange storage attachment scheduling by storage type
- PR: #160
- Slurm REST API
- PR: #167
🐛 Fixes
- perf(images): use multi-stage to reduce populate_jail image size
- PR: #151
- NOTASK: no cpu limits no throttling
- PR: #158
📦 Dependencies
- build(deps): bump k8s.io/api from 0.31.1 to 0.31.2 in /images/jail/gpubench
- PR: #142
- build(deps): bump softprops/action-gh-release from 2.0.8 to 2.0.9
- PR: #159
- build(deps): bump mikepenz/release-changelog-builder-action from 5.0.0.pre.rc02 to 5
- PR: #162
- build(deps): bump docker/login-action from 06895751d15a223ec091bea144ad5c7f50d228d0 to 7ca345011ac4304463197fac0e56eab1bc7e6af0
- PR: #166
Other
- hotfix: fix required Limit must be set for non overcommitable
- PR: #161
- HOTFIX: make worker pod as a Guaranteed
- PR: #163
- MSP-3261: add apparmor configurable
- PR: #165
Contributors:
@CrackedPoly, @dependabot[bot], @dstaroff, @Uburro, @asteny
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
1068 | 187 | 27 | 1767 | 169 |
1.14.14
Changes made since version 1.14.13
prior to version 1.14.14
:
- no changes
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
5 | 29 | 25 |
1.14.13
Changes made since version 1.14.12
prior to version 1.14.13
:
🐛 Fixes
- Fix getting uniq node for all slurm partitions in nccl test
- PR: #145
📦 Dependencies
- build(deps): bump actions/setup-go from 5.0.2 to 5.1.0
- PR: #144
- build(deps): bump docker/login-action from 1f36f5b7a2d2f7bfd524795fc966e6d88c37baa9 to 5d8785b43a795ee002a17dbf1a2235dc1997224b
- PR: #143
Contributors:
@dependabot[bot], @asteny
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
3 | 0 | 7 | 25 | 25 |
1.14.12
Changes made since version 1.14.11
prior to version 1.14.12
:
📦 Dependencies
- Bump go.opentelemetry.io/otel/sdk from 1.30.0 to 1.31.0 in /images/jail/gpubench
- PR: #115
- Bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 1.28.0 to 1.31.0 in /images/jail/gpubench
- PR: #112
- build(deps): bump actions/checkout from 4.2.1 to 4.2.2
- PR: #136
Other
- other: bump soperator
- PR: #138
Contributors:
@dependabot[bot], @asteny
📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
---|---|---|---|---|
3 | 1 | 7 | 68 | 68 |