-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Backport bug fixes for v0.7.1 release #1329
🌱 Backport bug fixes for v0.7.1 release #1329
Conversation
The delegating Logger will cause a high number of memory allocations when no actual Logger is set. This change makes us set a NullLogger after 30 seconds if none was set yet to avoid that.
In admission v1, API server requires that Patch and PatchType are both provided or none are provided. Meanwhile, admission v1beta1 does not have this kind of requirement. In controller-runtime, PatchResponseFromRaw sets PatchType regardless of the existence of patch. If patch is empty, a response contains only PatchType and API server does not like it. Webhook call fails. This change fixes this issue by not setting PatchType if patch is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
controller-runtime 0.7.0 was causing API submissions to error out whenever no default values were applied. this was fixed in version 0.7.1 kubernetes-sigs/controller-runtime#1329
* creates raycluster api along with other scaffolded files and types (i.e. controller) * adds RayCluster API * fixes linting error * adds more badges to README don't we need stinking badges??? * runs tests using bash shell make target requires the use of `source' command * sets bash as the default shell in Makefile since this uses "/bin/sh" by default. if that path doesn't point to bash, then the scripting will error out because the "source" command is unavailable * updates RayClusterSpec API doc strings * changes controller-gen crds opts generating CRDs that support the latest and greatest * adds basic validation to RayClusterSpec fields also sets fields with default values as "optional" since the v1 CRD spec requires this distinction. "omitempty" accomplishes the same thing but preference is lent to the former annotation since it clearly conveys our intention. update sample ray CR with a real field value; more will come * adds generic metadata functions for use by all resources * adds RayClusterReconciler loop structure and scaffolds structure for creating all the resources need to build a ray cluster. serviceaccount is implemented. tests will come * adds util funcs used to generate deployments along with tests * adds Volumes and VolumeMounts to RayClusterSpec these will be used to mount additional volumes into cluster pods * updates RayClusterSpec fixes default repo for ray image adds NodeSelector and Affinity fields * changes RayClusterSpec adds realistic ObjectStoreMemoryBytes min validation in accordance with the app's requirements. Points to OCIImageDefinition object * adds naive creation of head/worker deployments naive in the sense the controller creates them w/o check, but the generation of the deployment structs are well tested. also adds a head service and empty stub files for other required resources. * moves util pkg out of resources this is a bit cleaner, more descriptive * oh so many changes adds ClientServerPort to RayClusterSpec exposes proper ports in head service for ray changes default ray image to standard image (w/o extra ML libs) refactors ray resource lib funcs and adds more unit tests * adds more resources to controller reconciliation creates network policy creates and binds pod security policy adds watches to controller for "owned" resources (still naive) changes ObjectStoreMemoryBytes type to account for nil * fixes ObjectStoreMemoryBytes field and processing should be int64 so we can represent more than 2.2Gi fixes bug with cmd args passing value instead of pointer addr * reworks psp support while we can create a PSP for every cluster, the logistics behind managing both cluster and namespace scoped resources are gross. these changes make it possible for users to provide a psp "name". the psp will be verified and then use will be enabled via RBAC resources. * bumps go version to 1.16 * another load of changes enables primitive autoscaling separates head/worker resource requests (different load profiles) adds printer columns for `kubectl get' output uses "recreate" strategy to manage head deployment updates example RayCluster YAML * PR feedback changes exposes ScaleDownStabilizationWindowSeconds for HPA adds ImagePullSecrets to spec for pulling private images makes RayCluster.Spec field required * adds network policy support for clients * changes default docker IMG name default value is not very descriptive * adds pkg/ to dockerfile required to build project since we leverage libraries contained therein * adds docker image build/push steps to CI * logging into registry * adds docker labels and layer caching to ci * fixes docker build in ci corrects labels and uses registry cache instead of local fs * removes useless comment * removes Build from CI step is implicitly performed during `make test' * removes fmt vet targets from Makefile and adds build step back to ci; that should handle the main caching bits * updates golangci-lint and configures more linters we can use this tool for vet, fmt and other linting checks instead of running those checks as separate steps * migrates from using "flag" to "cobra" so we can add subcommands to managing CRDs * adds crd-apply / crd-delete commands * moves reconcileDeployments into a separate func we're going to have to perform some advanced checks and the linting error was also bothering me * fixes Dockerfile adds crd defs to Dockerfile so they can be embedded during build time. * adds beginnings of a helm chart added Namespace to manager config so we can watch resources within the operator deployment namespace. without this, the manager requires cluster-scope watch permissions which does not really make sense. * makes netpol client labels configurable users can provide their own client labels when desired; default stays the same as before * removes helm chart service.yaml and adds an update to the generated crd that was missed in the last commit * decouples head/worker common attributes these changes make it possible to assign labels, annotations, volumes, etc..., to head and worker nodes separately. we can add "global" fields if/when necessary * adds missing "start" cmd to `make run' * removes superfluous "hack" files from generator we're not using this so there's no need to preserve the original dir/args * excludes unit tests from gocyclo linting in the spirit of writing clearer tests instead of making everything modular * adds extra global settings to RayClusterSpec adds pod security context adds extra env vars added to all ray containers adds the ability to use an external service account * enhances hpa creation exposes optional field for MinReplicas. when specified, that value will be applied to the underlying HPA. when absent, the worker replicas value will be set as a minimum. * corrects cluster-owned resource names adding some reference to "ray" makes it a lot easiser to differentiate cluster resources from others in the same NS * tweaks exhaustive linter don't really need to include ComponentNone in switch checks since that value is used elsewhere * adding coverage report * fixes leftover exhaustive linting errors * updates README and adds notes * splits controller processResources into separate funcs the optional creation of the underlying resources and the updates that will need to be perform for only certain resources merit expanding this single func into separate, dedicated, resource-specific funcs * adds service updates to reconciliation loop this will cause the head service ports to change whenever related spec field values change. * fixes printercolumn forgot to update this when separating head/worker fields api * adds silly logo * updates README a little more shine. shine on * fixes deployment creation default service account name never updated to reflect "-ray" change sets both head/worker deployment strategy to "Recreate" If the image tag is updated, the head pod and some worker pods will terminate immediatetly. the remaining worker pods can remain in an error loop for an extended period of time if the new image is large. furthermore, there's no telling how the ray cluster will behave when mixing different versions. a simple solution for now is to recreate the entire cluster from scratch using the new image. * adds advanced resource reconciliation now updates owned resources when parent CR fields change * adds head/worker pod names to CR status * updates .dockerignore avoids uploading 250Mi+ into docker context when building * updates ray controller adds metadata to "owned" event sources for adding visibility uses company-specific patch annotations for tracking updates * updates helm chart rbac permissions provides the extra update/delete rights for managing updates grants pod list rights so we can add their names to the spec status * updates RayClusterReconciler rbac annotations so they're in-line with what's actually required by the operator and generate the appropriate rbac resource for deploy via `kustomize' * adds RayCluster webhooks adding defaulting and validation webhooks to controller ignores long kubebuilder comment annotations when linting adds back boilerplate text (required by `operator-sdk' generators) * makes ginko test suites more usable setting the envtest kubebuilder assets dir so that we can run tests w/o having to invoke `make test`. separates test assets install procedure into a separate make target. * adds validation webhook logic adding `omitempty` tag to optional fields to prevent serializations errs migrates primitive validation from CRD to webhook adds first set of integration tests (validation) * adds webhook support to helm chart * moves helm chart into nested dir * refactors helm chart macro names i'm getting tired of typing "distributed-compute-operator" and since this is NOT a library chart, it seems fine to use whatever prefix i prefer * renames logo file * adds development setup script * fixes mutating webhook / removes operator-sdk the project was revised to use `kubebuilder' instead of `operator-sdk' for a number of reasons: - operator-sdk uses kubebuilder under the hood - kubebuilder improves faster than operator-sdk - we do not need the extra OLM and scorecard nonsense - the code itself is not directly reliant on one or the other * fixes github workflow build target name changed * adds SHELL override back to Makefile * adds desired code formatter so devs can format their code before pushing * build, test, lint * points to local goimports bin in Makefile * adds go tooling cache to ci * removes lint target from test/build/run * run linting before uploading coverage report * adds caches for go build cache in CI and updates ginkgo/gomega * adds a way to disable webhooks to aid in development, an envvar can be used to disable webhooks and execute the project outside k8s * removes "test" from "docker-build" make target i want this, i don't want this. bottom line, we run these steps separately in CI so we'll always test before building a docker image. devs will just need to be diligent...eh? * reworks hpa exposes "scale" subresource on RayCluster uses "scale" subresource to scale worker deployment targeting RayCluster with HPA instead of deployment * fixes linting issue * fixes patch issue with defaulting webhook controller-runtime 0.7.0 was causing API submissions to error out whenever no default values were applied. this was fixed in version 0.7.1 kubernetes-sigs/controller-runtime#1329 * moves all validation/defaulting into webhooks reworks HPA creation behavior adds integration tests for webhook logic * removes RayClusterWorker.Replicas field requirement we already default this field and validate use input. making this optional allows the user to specify other worker fields (e.g. resources) without having to explicitly provide a replica count as well. * modifies notes * bumps controller-runtime to latest version version 0.8 allows us to excluded error stack traces in logs. this does 3 things: - produces less noise in production logs - forces us to add context to errors with wrapping - still allows us to trace errors in development when fixing bugs * exposes all log config opts in helm chart removes "development mode" default * adds TypeMeta to generated k8s resources this helps us log type information when creating owned resources * uses WithName logger func correctly * updates crd after bumping k8s api dependency minor changes moving from 1.19 to 1.20 * adds decent controller logging logging create/update/delete of owned components debug logging for status updates and object patches debug++ logging for reconciliation loop triggers * ignoring magic number in logger V() func ...because * adds script to help with developer workflow tasks * fixes RayCluster status updates when updates to a stale object occur, we just requeue the key so it can be reprocessed with the latest version of the RayCluster object. * adds image tag prefix const to development script * enforces worker cpu resource requests when autoscaling is enabled * refactor development script to split minikube profile name out from common name (which represents image, chart, and deployment names) * bumps controller-runtime version 0.8.3 removes deprecation warnings that were muddling the logs * fixes issue with script and refactors some more change MINIKUBE_PROFILE to evaluate as $MINIKUBE_PROFILE adds extra vars with defaults to allow users override capabilities * extracts context logging from RayClusterReconciler now this can be used by any reconciler or other types that require the need to propagate a contextual logger * removes superfluous TypeMeta from generated objs this was only added to enrich log statements. instead, we can grab the type metadata directly from the scheme and add it to the logs prior to create/update operations. * adds developer comment to reconcileAutoscaler indicates the reason why the shape of this operation is slightly different from the other reconcile functions. * standardizes approach to checking apierrors.IsNotFound using ifs to assert whether we need to take one action or another, or return an err * changes image defaulting behavior default image definition is added when completely missing. if a user provides an image, then we valid that the minimum required fields are present. Co-authored-by: Adam Udi <[email protected]>
This PR backports some bug fixes that have been merged to the mainline branch which are important to be included for the current release (v0.7.x).
/assign @alvaroaleman
/milestone v0.7.x