Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robustness to driver pod taking time to create #2315

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Nov 10, 2024

Purpose of this PR

Improve reliability.
Closes: #2302

Proposed changes:

  • Add a grace period for the driver pod to be created.
  • Grace period is controlled by a new config option driver-pod-creation-grace-period. Default is 10 seconds.
  • Add 3 new tests around reconciling submitted status, including the new grace period functionality. This is the majority of the lines changed.
  • Expose the new config option in the helm chart and add a helm test

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly - I think its sufficient just to add a description on the new argument on the helm chart and regenerate the helm docs.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

Some logs of a real example on our prod cluster where the spark application was saved by the grace period added in this PR.
Explore-logs-2024-11-11 12_09_13.txt

@Tom-Newton Tom-Newton force-pushed the tomnewton/robustness_to_driver_pod_taking_time_to_create branch from dd39e70 to 3f3967c Compare November 10, 2024 17:24
@Tom-Newton Tom-Newton force-pushed the tomnewton/robustness_to_driver_pod_taking_time_to_create branch from a3d6607 to bbe7510 Compare November 10, 2024 17:27
@Tom-Newton
Copy link
Contributor Author

Sorry for the direct ping. @ChenYi015 are you the right person to review this?

@Tom-Newton
Copy link
Contributor Author

The CI failure looks unrelated

• [FAILED] [300.030 seconds]
Example SparkApplication spark-pi-python [It] Should complete successfully
/home/runner/work/spark-operator/spark-operator/test/e2e/sparkapplication_test.go:383

  [FAILED] Unexpected error:
      <*fmt.wrapError | 0xc001b63360>: 
      client rate limiter Wait returned an error: context deadline exceeded
      {
          msg: "client rate limiter Wait returned an error: context deadline exceeded",
          err: <context.deadlineExceededError>{},
      }
  occurred
  In [It] at: /home/runner/work/spark-operator/spark-operator/test/e2e/sparkapplication_test.go:386 @ 12/04/24 12:25:47.015

I'll try rebasing on latest main

Tom-Newton and others added 12 commits December 4, 2024 12:41
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
@Tom-Newton Tom-Newton force-pushed the tomnewton/robustness_to_driver_pod_taking_time_to_create branch from 37846be to 6c306e8 Compare December 4, 2024 12:42
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChenYi015

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChenYi015
Copy link
Contributor

@Tom-Newton Thanks for improving the robustness of Spark operator.
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Dec 4, 2024
@google-oss-prow google-oss-prow bot merged commit d815e78 into kubeflow:master Dec 4, 2024
11 checks passed
@Tom-Newton
Copy link
Contributor Author

@Tom-Newton Thanks for improving the robustness of Spark operator.

Thanks for reviewing 🙂

ChenYi015 pushed a commit to ChenYi015/spark-operator that referenced this pull request Dec 10, 2024
* Retry after driver pod now found if recent submission

Signed-off-by: Thomas Newton <[email protected]>

* Add a test

Signed-off-by: Thomas Newton <[email protected]>

* Make grace period configurable

Signed-off-by: Thomas Newton <[email protected]>

* Update test

Signed-off-by: Thomas Newton <[email protected]>

* Add an extra test with the driver pod

Signed-off-by: Thomas Newton <[email protected]>

* Separate context to create and delete the driver pod

Signed-off-by: Thomas Newton <[email protected]>

* Tidy

Signed-off-by: Thomas Newton <[email protected]>

* Autoformat

Signed-off-by: Thomas Newton <[email protected]>

* Update error message

Signed-off-by: Thomas Newton <[email protected]>

* Add helm paramater

Signed-off-by: Thomas Newton <[email protected]>

* Update internal/controller/sparkapplication/controller.go

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>

* Newlines between helm tests

Signed-off-by: Thomas Newton <[email protected]>

---------

Signed-off-by: Thomas Newton <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
(cherry picked from commit d815e78)
@ChenYi015 ChenYi015 mentioned this pull request Dec 10, 2024
google-oss-prow bot pushed a commit that referenced this pull request Dec 11, 2024
* Allow setting automountServiceAccountToken (#2298)

* Allow setting automountServiceAccountToken on workloads and serviceAccounts

Signed-off-by: Aran Shavit <[email protected]>

* update helm docs

Signed-off-by: Aran Shavit <[email protected]>

---------

Signed-off-by: Aran Shavit <[email protected]>
(cherry picked from commit 515d805)

* Fix: executor container security context does not work (#2306)

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 171e429)

* Fix: should not add emptyDir sizeLimit conf if it is nil (#2305)

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 763682d)

* Allow the Controller and Webhook Containers to run with the securityContext: readOnlyRootfilesystem: true (#2282)

* create a tmp dir for the controller to write Spark artifacts to and set the controller to readOnlyRootFilesystem

Signed-off-by: Nick Gretzon <[email protected]>

* mount a dir for the webhook container to generate its certificates in and set readOnlyRootFilesystem: true for the webhook pod

Signed-off-by: Nick Gretzon <[email protected]>

* update the securityContext in the controller deployment test

Signed-off-by: Nick Gretzon <[email protected]>

* update securityContext of the webhook container in the deployment_test

Signed-off-by: Nick Gretzon <[email protected]>

* update README

Signed-off-by: Nick Gretzon <[email protected]>

* remove -- so comments are not rendered in the README.md

Signed-off-by: Nick Gretzon <[email protected]>

* recreate README.md after removal of comments for volumes and volumeMounts

Signed-off-by: Nick Gretzon <[email protected]>

* make indentation for volumes and volumeMounts consistent with rest of values.yaml

Signed-off-by: Nick Gretzon <[email protected]>

* Revert "make indentation for volumes and volumeMounts consistent with rest of values.yaml"

This reverts commit dba97fc.

Signed-off-by: Nick Gretzon <[email protected]>

* fix indentation in webhook and controller deployment templates for volumes and volumeMounts

Signed-off-by: Nick Gretzon <[email protected]>

* Update charts/spark-operator-chart/values.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/values.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/values.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/values.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/templates/controller/deployment.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/templates/controller/deployment.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/templates/webhook/deployment.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* Update charts/spark-operator-chart/templates/webhook/deployment.yaml

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>

* add additional securityContext to the controller deployment_test.yaml

Signed-off-by: Nick Gretzon <[email protected]>

---------

Signed-off-by: Nick Gretzon <[email protected]>
Signed-off-by: Nicholas Gretzon <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
(cherry picked from commit 72107fd)

* Fix: should not add emptyDir sizeLimit conf on executor pods if it is nil (#2316)

Signed-off-by: Cian Gallagher <[email protected]>
(cherry picked from commit 2999546)

* Bump `volcano.sh/apis` to 1.10.0 (#2320)

Signed-off-by: Jacob Salway <[email protected]>
(cherry picked from commit 22e4fb8)

* Truncate UI service name if over 63 characters (#2311)

* Truncate UI service name if over 63 characters

Signed-off-by: Jacob Salway <[email protected]>

* Also truncate ingress name

Signed-off-by: Jacob Salway <[email protected]>

---------

Signed-off-by: Jacob Salway <[email protected]>
(cherry picked from commit 43c1888)

* Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 (#2332)

Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.28.0 to 0.29.0.
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.28.0...0.29.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
(cherry picked from commit 270b09e)

* Bump github.com/onsi/ginkgo/v2 from 2.20.2 to 2.22.0 (#2335)

Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.20.2 to 2.22.0.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.20.2...v2.22.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
(cherry picked from commit 40423d5)

* The webhook-key-name command-line param isn't taking effect (#2344)

Signed-off-by: C. H. Afzal <[email protected]>
(cherry picked from commit a261523)

* Robustness to driver pod taking time to create (#2315)

* Retry after driver pod now found if recent submission

Signed-off-by: Thomas Newton <[email protected]>

* Add a test

Signed-off-by: Thomas Newton <[email protected]>

* Make grace period configurable

Signed-off-by: Thomas Newton <[email protected]>

* Update test

Signed-off-by: Thomas Newton <[email protected]>

* Add an extra test with the driver pod

Signed-off-by: Thomas Newton <[email protected]>

* Separate context to create and delete the driver pod

Signed-off-by: Thomas Newton <[email protected]>

* Tidy

Signed-off-by: Thomas Newton <[email protected]>

* Autoformat

Signed-off-by: Thomas Newton <[email protected]>

* Update error message

Signed-off-by: Thomas Newton <[email protected]>

* Add helm paramater

Signed-off-by: Thomas Newton <[email protected]>

* Update internal/controller/sparkapplication/controller.go

Co-authored-by: Yi Chen <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>

* Newlines between helm tests

Signed-off-by: Thomas Newton <[email protected]>

---------

Signed-off-by: Thomas Newton <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
(cherry picked from commit d815e78)

* Use NSS_WRAPPER_PASSWD instead of /etc/passwd as in spark-operator image entrypoint.sh (#2312)

Signed-off-by: Aakcht <[email protected]>
(cherry picked from commit 5dd91c4)

* Move sparkctl to cmd directory (#2347)

* Move spark-operator

Signed-off-by: Yi Chen <[email protected]>

* Move sparkctl to cmd directory

Signed-off-by: Yi Chen <[email protected]>

* Remove unnecessary app package/directory

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 2375a30)

* Spark Operator Official Release v2.1.0

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Co-authored-by: Aran Shavit <[email protected]>
Co-authored-by: Nicholas Gretzon <[email protected]>
Co-authored-by: Cian (Keen) Gallagher <[email protected]>
Co-authored-by: Jacob Salway <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: C. H. Afzal <[email protected]>
Co-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Aakcht <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Robustness to the driver pod taking some time to create
2 participants