Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release soperator 1.16.0 #266

Merged
merged 50 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
ce03ed3
notask: fix space in ln (#208)
Uburro Nov 29, 2024
ca647bf
NOTASK: real memory convert bytes to mebibytes (#209)
Uburro Nov 29, 2024
acf42d7
bump 1.15.5 (#210)
Uburro Dec 3, 2024
4f647d2
MSP-3525: Bump golang from 1.22 to 1.23, k8s.io/api and k8s.io/apimac…
Uburro Dec 4, 2024
0aa082a
Bump golang from 1.22 to 1.23 (#38)
dependabot[bot] Dec 4, 2024
2972b9c
build(deps): bump k8s.io/apimachinery from 0.30.2 to 0.31.3 (#195)
dependabot[bot] Dec 4, 2024
c6f657a
build(deps): bump k8s.io/api from 0.31.2 to 0.31.3 (#213)
dependabot[bot] Dec 4, 2024
27a97b4
build(deps): bump sigs.k8s.io/controller-runtime from 0.19.1 to 0.19.…
dependabot[bot] Dec 4, 2024
88fb7af
MSP-3191: add webhook protect delete secret mariadb (#215)
Uburro Dec 5, 2024
c565c1d
build(deps): bump google.golang.org/grpc in /images/jail/gpubench (#216)
dependabot[bot] Dec 5, 2024
3f26d30
build(deps): bump github.com/onsi/ginkgo/v2 from 2.21.0 to 2.22.0 (#217)
dependabot[bot] Dec 5, 2024
a0b2977
build(deps): bump k8s.io/client-go from 0.31.2 to 0.31.3 (#219)
dependabot[bot] Dec 5, 2024
fd43743
build(deps): bump github.com/stretchr/testify from 1.9.0 to 1.10.0 (#…
dependabot[bot] Dec 5, 2024
61992a3
build(deps): bump github.com/onsi/gomega from 1.35.1 to 1.36.0 (#220)
dependabot[bot] Dec 5, 2024
f84a81a
build(deps): bump github.com/prometheus-operator/prometheus-operator/…
dependabot[bot] Dec 5, 2024
ec91232
MSP-3635: add generate rbac (#222)
Uburro Dec 6, 2024
343c144
MSP-3642: add some configmap values (#223)
Uburro Dec 6, 2024
0d72539
build(deps): bump alpine from `1e42bbe` to `21dc606` (#224)
dependabot[bot] Dec 6, 2024
584e012
MSP-3541: removed unused crd values (#225)
Uburro Dec 6, 2024
3e82cf9
build(deps): bump golang.org/x/crypto from 0.29.0 to 0.30.0 (#226)
dependabot[bot] Dec 6, 2024
898a94a
3578: moving to native sidecar (#227)
Uburro Dec 9, 2024
7d8c185
HOTFIX: fix ServiceAccount name for role binding (#229)
Uburro Dec 10, 2024
f70c806
build(deps): bump github.com/onsi/gomega from 1.36.0 to 1.36.1 (#230)
dependabot[bot] Dec 11, 2024
f77c200
set compute instance name as slurm InstanceId
asteny Dec 9, 2024
8583808
Merge pull request #228 from nebius/add_node_name
asteny Dec 11, 2024
22f80b2
MSP-3272: Support GDRCopy + preinstall more tools
rdjjke Dec 11, 2024
16d62a0
build(deps): bump k8s.io/client-go from 0.31.3 to 0.31.4 (#233)
dependabot[bot] Dec 11, 2024
e4a647c
build(deps): bump softprops/action-gh-release from 2.1.0 to 2.2.0
dependabot[bot] Dec 11, 2024
a59054c
build(deps): bump actions/setup-go from 5.1.0 to 5.2.0
dependabot[bot] Dec 11, 2024
93af296
Merge pull request #239 from nebius/dependabot/github_actions/dev/act…
asteny Dec 11, 2024
58aeb04
Merge pull request #238 from nebius/dependabot/github_actions/dev/sof…
asteny Dec 11, 2024
7d9c5bf
build(deps): bump k8s.io/api in /images/jail/gpubench (#236)
dependabot[bot] Dec 12, 2024
7d3fd82
build(deps): bump golang from `574185e` to `7003184` (#243)
dependabot[bot] Dec 12, 2024
b6ee7bc
build(deps): bump golang.org/x/crypto from 0.30.0 to 0.31.0 (#247)
dependabot[bot] Dec 12, 2024
da33add
Merge pull request #231 from nebius/gdrcopy-support
rdjjke Dec 12, 2024
960c57b
MSP-3578: add sshd to worker (#248)
Uburro Dec 13, 2024
3afb3fb
run enroot containers without root privileges
asteny Dec 12, 2024
fd6ef6c
Merge pull request #249 from nebius/enroot
asteny Dec 13, 2024
f4cd729
build(deps): bump go.opentelemetry.io/otel/sdk/metric (#253)
dependabot[bot] Dec 16, 2024
f5b8931
build(deps): bump google.golang.org/grpc in /images/jail/gpubench (#252)
dependabot[bot] Dec 16, 2024
6a0b82f
build(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/…
dependabot[bot] Dec 16, 2024
3f4125a
MSP-3705: Support starting Docker containers from Slurm jobs
rdjjke Dec 16, 2024
dc1be9d
Merge pull request #256 from nebius/srun-docker-run/0
rdjjke Dec 16, 2024
34cca07
description protectedSecret (#258)
Uburro Dec 16, 2024
6720e9e
NOTIC: Add rights to manage daemonsets
rdjjke Dec 16, 2024
738ae46
fix error var
Uburro Dec 16, 2024
f7f0b9f
Merge pull request #259 from nebius/soperator-rbac-daemonsets/0
rdjjke Dec 16, 2024
a507c0a
Merge pull request #260 from nebius/notic-accounting
rdjjke Dec 16, 2024
8e639b4
Merge branch 'main' into dev
Uburro Dec 16, 2024
915b01d
run build for release (#265)
asteny Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/github_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ jobs:
token: ${{ secrets.GITHUB_TOKEN }}

- name: Create GitHub Release with changelog
uses: softprops/action-gh-release@01570a1f39cb168c169c802c3bceb9e93fb10974 # v2.1.0
uses: softprops/action-gh-release@7b4da11513bf3f43f9999e90eabced41ab8bb048 # v2.2.0
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/gpubench_only.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Install GO
uses: actions/setup-go@41dfa10bad2bb2ae585af6ee5bb4d7d973ad74ed # v5.1.0
uses: actions/setup-go@3041bf56c941b39c61721a86cd11f3bb1338122a # v5.2.0
with:
go-version-file: 'go.mod'

Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/one_job.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ on:
- 'README.md'
- 'SECURITY.md'
- 'images/jail/gpubench/**'
pull_request:
branches:
- main

permissions:
contents: read
Expand Down Expand Up @@ -59,7 +62,7 @@ jobs:
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Install GO
uses: actions/setup-go@41dfa10bad2bb2ae585af6ee5bb4d7d973ad74ed # v5.1.0
uses: actions/setup-go@3041bf56c941b39c61721a86cd11f3bb1338122a # v5.2.0
with:
go-version-file: 'go.mod'

Expand Down
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM golang:1.22@sha256:4594271250150c1a322ed749abfd218e1a8c6eb1ade90872e325a664412e2037 AS operator_builder
FROM golang:1.23@sha256:70031844b8c225351d0bb63e2c383f80db85d92ba894e3da7e13bcf80efa9a37 AS operator_builder

ARG GO_LDFLAGS=""
ARG BUILD_TIME
Expand All @@ -16,7 +16,7 @@ RUN GOOS=$GOOS GOARCH=$GOARCH CGO_ENABLED=$CGO_ENABLED GO_LDFLAGS=$GO_LDFLAGS \
go build -o slurm_operator ./cmd/

#######################################################################################################################
FROM alpine:latest@sha256:1e42bbe2508154c9126d48c2b8a75420c3544343bf86fd041fb7527e017a4b4a AS slurm-operator
FROM alpine:latest@sha256:21dc6063fd678b478f57c0e13f47560d0ea4eeba26dfc947b2a4f81f686b9f45 AS slurm-operator

COPY --from=operator_builder /operator/slurm_operator /usr/bin/

Expand Down
17 changes: 8 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,12 @@ lint-fix: golangci-lint ## Run golangci-lint linter and perform fixes
$(GOLANGCI_LINT) run --fix

.PHONY: helm
helm: kustomize helmify yq ## Update soperator Helm chart
rm -rf $(CHART_OPERATOR_PATH)
$(KUSTOMIZE) build config/default | $(HELMIFY) --crd-dir $(CHART_OPERATOR_PATH)
rm -f $(CHART_PATH)/operatorAppVersion
cp -r $(CHART_OPERATOR_PATH)/crds/* $(CHART_OPERATOR_CRDS_PATH)/templates/
@$(YQ) -i ".name = \"helm-soperator\"" "$(CHART_OPERATOR_PATH)/Chart.yaml"
@$(SED_COMMAND) '/^#/d' "$(CHART_OPERATOR_PATH)/Chart.yaml"
helm: generate manifests ## Update soperator Helm chart
$(KUSTOMIZE) build config/crd > $(CHART_OPERATOR_PATH)/crds/slurmcluster-crd.yaml
$(KUSTOMIZE) build config/crd > $(CHART_OPERATOR_CRDS_PATH)/templates/slurmcluster-crd.yaml
mv $(CHART_OPERATOR_PATH)/values.yaml $(CHART_OPERATOR_PATH)/values.yaml.bak
$(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/rbac/soperator-helm | $(HELMIFY) $(CHART_OPERATOR_PATH)
mv $(CHART_OPERATOR_PATH)/values.yaml.bak $(CHART_OPERATOR_PATH)/values.yaml

.PHONY: get-version
get-version:
Expand Down Expand Up @@ -297,11 +296,11 @@ YQ ?= $(LOCALBIN)/yq

## Tool Versions
KUSTOMIZE_VERSION ?= v5.5.0
CONTROLLER_TOOLS_VERSION ?= v0.14.0
CONTROLLER_TOOLS_VERSION ?= v0.16.4
ENVTEST_VERSION ?= release-0.17
GOLANGCI_LINT_VERSION ?= v1.57.2
HELMIFY_VERSION ?= 0.4.13
YQ_VERSION ?= 4.44.1
YQ_VERSION ?= 4.44.3

.PHONY: kustomize
kustomize: $(KUSTOMIZE) ## Download kustomize locally if necessary.
Expand Down
8 changes: 8 additions & 0 deletions PROJECT
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,12 @@ resources:
kind: SlurmCluster
path: nebius.ai/slurm-operator/api/v1
version: v1
- core: true
group: core
kind: Secret
path: k8s.io/api/core/v1
version: v1
webhooks:
validation: true
webhookVersion: v1
version: "3"
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ Slurm's accounting system records detailed job information such as:
- User and group identities
- Job start/end times
- Resource requests and allocations
- If `protectedSecret` is set to `true`, the user secret for MariaDB will not be deleted after the MariaDB CR is deleted

This helps cluster administrators and users monitor resource utilization, enforce quotas, and generate usage reports for performance optimization or billing purposes.

Expand All @@ -114,11 +115,10 @@ This helps cluster administrators and users monitor resource utilization, enforc
[22.04](https://releases.ubuntu.com/jammy/).
- Slurm: versions `23.11.6` and `24.05.3`.
- CUDA: version [12.2.2](https://developer.nvidia.com/cuda-12-2-2-download-archive).
- Kubernetes: >= [1.28](https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/).
- Kubernetes: >= [1.29](https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/).
- Versions of some preinstalled software packages can't be changed.



## 🚀 Installation
The steps required to deploy Soperator to your Kubernetes cluster depend on whether you are using Kubernetes
on premises or in a cloud.
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.15.5
1.16.0
74 changes: 66 additions & 8 deletions api/v1/slurmcluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import (
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

mariadv1alpha1 "github.com/mariadb-operator/mariadb-operator/api/v1alpha1"
mariadbv1alpha1 "github.com/mariadb-operator/mariadb-operator/api/v1alpha1"
prometheusv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
)

Expand Down Expand Up @@ -73,6 +73,40 @@ type SlurmClusterSpec struct {
// https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION
// +kubebuilder:validation:Optional
PartitionConfiguration PartitionConfiguration `json:"partitionConfiguration,omitempty"`

// SlurmConfig represents the Slurm configuration in slurm.conf. Not all options are supported.
//
// +kubebuilder:validation:Optional
SlurmConfig SlurmConfig `json:"slurmConfig,omitempty"`
}

// SlurmConfig represents the Slurm configuration in slurm.conf
type SlurmConfig struct {
// Default real memory size available per allocated node in mebibytes.
//
// +kubebuilder:validation:Optional
// +kubebuilder:default=1228800
DefMemPerNode int32 `json:"defMemPerNode,omitempty"`
// Default count of CPUs allocated per allocated GPU
//
// +kubebuilder:validation:Optional
// +kubebuilder:default=16
DefCpuPerGPU int32 `json:"defCpuPerGPU,omitempty"`
// The time to wait, in seconds, when any job is in the COMPLETING state before any additional jobs are scheduled.
//
// +kubebuilder:validation:Optional
// +kubebuilder:default=5
CompleteWait int32 `json:"completeWait,omitempty"`
// Defines specific subsystems which should provide more detailed event logging.
//
// +kubebuilder:validation:Optional
// +kubebuilder:default="Cgroup,CPU_Bind,Gres,JobComp,Priority,Script,SelectType,Steps,TraceJobs"
// +kubebuilder:validation:Pattern="^((Accrue|Agent|AuditRPCs|Backfill|BackfillMap|BurstBuffer|Cgroup|ConMgr|CPU_Bind|CpuFrequency|Data|DBD_Agent|Dependency|Elasticsearch|Energy|Federation|FrontEnd|Gres|Hetjob|Gang|GLOB_SILENCE|JobAccountGather|JobComp|JobContainer|License|Network|NetworkRaw|NodeFeatures|NO_CONF_HASH|Power|Priority|Profile|Protocol|Reservation|Route|Script|SelectType|Steps|Switch|TLS|TraceJobs|Triggers)(,)?)+$"
DebugFlags string `json:"debugFlags,omitempty"`
// +kubebuilder:validation:Optional
// +kubebuilder:default="Verbose"
// +kubebuilder:validation:Pattern="^((None|Cores|Sockets|Threads|SlurmdOffSpec|OOMKillStep|Verbose|Autobind)(,)?)+$"
TaskPluginParam string `json:"taskPluginParam,omitempty"`
}

type PartitionConfiguration struct {
Expand Down Expand Up @@ -277,7 +311,6 @@ type K8sNodeFilter struct {
Name string `json:"name"`

// Affinity defines the desired affinity for the node
//
// NOTE: Affinity could not be set if NodeSelector is specified
//
// +kubebuilder:validation:Optional
Expand Down Expand Up @@ -442,13 +475,26 @@ type MariaDbOperator struct {
// +kubebuilder:validation:Optional
Enabled bool `json:"enabled"`

// If enabled, secret cannot be deleted until custom resource slurmcluster is deleted
//
// +kubebuilder:validation:Optional
// +kubebuilder:default=false
// +kubebuilder:validation:Immutable
ProtectedSecret bool `json:"protectedSecret"`

NodeContainer `json:",inline"`
PodSecurityContext *corev1.PodSecurityContext `json:"podSecurityContext,omitempty"`
SecurityContext *corev1.SecurityContext `json:"securityContext,omitempty"`
Replicas int32 `json:"replicas,omitempty"`
Metrics *mariadv1alpha1.MariadbMetrics `json:"metrics,omitempty"`
Replication *mariadv1alpha1.Replication `json:"replication,omitempty"`
Storage mariadv1alpha1.Storage `json:"storage,omitempty"`
PodSecurityContext *mariadbv1alpha1.PodSecurityContext `json:"podSecurityContext,omitempty"`
SecurityContext *mariadbv1alpha1.SecurityContext `json:"securityContext,omitempty"`
Replicas int32 `json:"replicas,omitempty"`
Metrics MariadbMetrics `json:"metrics,omitempty"`
Replication *mariadbv1alpha1.Replication `json:"replication,omitempty"`
Storage mariadbv1alpha1.Storage `json:"storage,omitempty"`
}

type MariadbMetrics struct {
// +kubebuilder:validation:Optional
// +kubebuilder:default=true
Enabled bool `json:"enabled,omitempty"`
}

type SlurmdbdConfig struct {
Expand Down Expand Up @@ -577,6 +623,11 @@ type SlurmNodeWorker struct {
// +kubebuilder:validation:Required
Munge NodeContainer `json:"munge"`

// SupervisordConfigMapRefName is the name of the supervisord config, which runs in slurmd container
//
// +kubebuilder:validation:Optional
SupervisordConfigMapRefName string `json:"supervisordConfigMapRefName,omitempty"`

// Volumes represents the volume configurations for the worker node
//
// +kubebuilder:validation:Required
Expand All @@ -587,6 +638,13 @@ type SlurmNodeWorker struct {
// +kubebuilder:default="v2"
// +kubebuilder:validation:Enum="v1";"v2"
CgroupVersion string `json:"cgroupVersion,omitempty"`

// EnableGDRCopy driver propagation into containers (this feature must also be enabled in NVIDIA GPU operator)
// https://developer.nvidia.com/gdrcopy
//
// +kubebuilder:validation:Optional
// +kubebuilder:default=false
EnableGDRCopy bool `json:"enableGDRCopy,omitempty"`
}

// SlurmNodeWorkerVolumes defines the volumes for the Slurm worker node
Expand Down
41 changes: 34 additions & 7 deletions api/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 10 additions & 2 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,15 @@ import (
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"

mariadv1alpha1 "github.com/mariadb-operator/mariadb-operator/api/v1alpha1"
mariadbv1alpha1 "github.com/mariadb-operator/mariadb-operator/api/v1alpha1"
otelv1beta1 "github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
prometheusv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"

slurmv1 "nebius.ai/slurm-operator/api/v1"
"nebius.ai/slurm-operator/internal/check"
"nebius.ai/slurm-operator/internal/consts"
"nebius.ai/slurm-operator/internal/controller/clustercontroller"
webhookcorev1 "nebius.ai/slurm-operator/internal/webhook/v1"
//+kubebuilder:scaffold:imports
)

Expand All @@ -62,7 +63,7 @@ func init() {
utilruntime.Must(prometheusv1.AddToScheme(scheme))
}
if check.IsMariaDbCRDInstalled() {
utilruntime.Must(mariadv1alpha1.AddToScheme(scheme))
utilruntime.Must(mariadbv1alpha1.AddToScheme(scheme))
}

utilruntime.Must(slurmv1.AddToScheme(scheme))
Expand Down Expand Up @@ -193,6 +194,13 @@ func main() {
setupLog.Error(err, "unable to create controller", "controller", reflect.TypeOf(slurmv1.SlurmCluster{}).Name())
os.Exit(1)
}
// nolint:goconst
if os.Getenv("ENABLE_WEBHOOKS") != "false" {
if err = webhookcorev1.SetupSecretWebhookWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create webhook", "webhook", "Secret")
os.Exit(1)
}
}
//+kubebuilder:scaffold:builder

if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
Expand Down
35 changes: 35 additions & 0 deletions config/certmanager/certificate.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# The following manifests contain a self-signed issuer CR and a certificate CR.
# More document can be found at https://docs.cert-manager.io
# WARNING: Targets CertManager v1.0. Check https://cert-manager.io/docs/installation/upgrading/ for breaking changes.
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
labels:
app.kubernetes.io/name: slurm-operator
app.kubernetes.io/managed-by: kustomize
name: selfsigned-issuer
namespace: system
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
labels:
app.kubernetes.io/name: certificate
app.kubernetes.io/instance: serving-cert
app.kubernetes.io/component: certificate
app.kubernetes.io/created-by: slurm-operator
app.kubernetes.io/part-of: slurm-operator
app.kubernetes.io/managed-by: kustomize
name: serving-cert # this name should match the one appeared in kustomizeconfig.yaml
namespace: system
spec:
# SERVICE_NAME and SERVICE_NAMESPACE will be substituted by kustomize
dnsNames:
- SERVICE_NAME.SERVICE_NAMESPACE.svc
- SERVICE_NAME.SERVICE_NAMESPACE.svc.cluster.local
issuerRef:
kind: Issuer
name: selfsigned-issuer
secretName: webhook-server-cert # this secret will not be prefixed, since it's not managed by kustomize
5 changes: 5 additions & 0 deletions config/certmanager/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
resources:
- certificate.yaml

configurations:
- kustomizeconfig.yaml
Loading
Loading