Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Operator Roadmap 2024 #2193

Open
1 of 8 tasks
ChenYi015 opened this issue Sep 26, 2024 · 7 comments
Open
1 of 8 tasks

Spark Operator Roadmap 2024 #2193

ChenYi015 opened this issue Sep 26, 2024 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ChenYi015
Copy link
Contributor

ChenYi015 commented Sep 26, 2024

Roadmap

Creating this roadmap issue to track work items that we will do in the future. If you have any ideas, please leave a comment.

Features

Chores

  • Doc improvement
  • Improve test coverage to improve the confidence in releases, particularly with e2e tests
@ChenYi015 ChenYi015 pinned this issue Sep 26, 2024
@jacobsalway
Copy link
Member

jacobsalway commented Sep 29, 2024

Some ideas:

  • A new CR to support Spark Connect
  • A HTTP API for job submission
  • A web UI for visibility into currently running applications
  • Deprecate the need for a mutating webhook by moving all functionality into the pod template
  • Controller performance improvements and recommendations for large scale clusters

Chores:

  • Improve test coverage to improve the confidence in releases, particularly with e2e tests
  • Doc improvements

@ChenYi015 ChenYi015 added help wanted Extra attention is needed good first issue Good for newcomers enhancement New feature or request labels Sep 29, 2024
@cccsss01
Copy link

cccsss01 commented Oct 5, 2024

Upgrade default security posture
Remove reliance on userid 185
(seems it's connected to the krb5.conf file leveraging domains and realms of institutions that may not need it).

@josecsotomorales
Copy link
Contributor

@jacobsalway @ChenYi015 I think that "Deprecate the need for a mutating webhook by moving all functionality into the pod template" should be a top priority, especially with the upcoming release of Spark v4

@gangahiremath
Copy link

gangahiremath commented Oct 11, 2024

@bnetzi , @vara-bonthu , regarding the point 'referring you to the discussion here, I think we just need to provide in general more options to configure the controller runtime, and that my PR is irrelevant',

Does it mean ' one queue per app and one go routine per app'(#1990) is not a solution for the performance issue faced?

Is #2186 solution for the same?

Do we see performance opportunity improvement with the approach that we have tried? - #1574 (comment)
Summary of changes :
- port spark-submit to golang
- this removes JVM invocation, hence, performance-wise faster
- no dependency on apache spark(as the frequency and quantity of changes to driver pod going to be minimal in future releases of apache spark).
We are happy to contribute our effort in this context to open source. Porting of spark-submit to golang is well-tested in our setup. Please let us know.

@c-h-afzal
Copy link
Contributor

c-h-afzal commented Oct 15, 2024

@gangahiremath - I think the two improvements aren't mutually exclusive - Given the testing done by @bnetzi and captured in this document - it seems that the one mutex per queue does have performance benefits. I also think that using Go instead of Java based submission can also help reduce job submission latency. However, as pointed out by @bnetz using Go would require corresponding changes to spark operator whenever there are changes to spark-submit and may also introduce functionality gaps. We can probably include both improvements in the roadmap if the performance hit from JVM is significant enough.

It would be great if other users can share/comment if JVM spin-up times indeed were a contributor to job submission latency? Also, if anyone tweaked/optimized JVM specifically to alleviate this pain point? Thanks.

@gangahiremath
Copy link

@gangahiremath - I think the two improvements aren't mutually exclusive - Given the testing done by @bnetzi and captured in this document - it seems that the one mutex per queue does have performance benefits. I also think that using Go instead of Java based submission can also help reduce job submission latency. However, as pointed out by @bnetz using Go would require corresponding changes to spark operator whenever there are changes to spark-submit and may also introduce functionality gaps. We can probably include both improvements in the roadmap if the performance hit from JVM is significant enough.

It would be great if other users can share/comment if JVM spin-up times indeed were a contributor to job submission latency? Also, if anyone tweaked/optimized JVM specifically to alleviate this pain point? Thanks.

@c-h-afzal , FYI point So the way I see it - work queue per app might not longer be the solution by bnetzi in thread #1990 (comment).

@bnetzi
Copy link

bnetzi commented Nov 27, 2024

Some update - we are still working on v2 compatible performance enhancements, we expect to update mid December with our results.

As for Deprecation of webhook - my concern is with changes that are not applicable via pod templates.
Quote from spark docs (https://spark.apache.org/docs/3.5.3/running-on-kubernetes.html#pod-template):

It is important to note that Spark is opinionated about certain pod configurations so there are values in the pod template that will always be overwritten by Spark. Therefore, users of this feature should note that specifying the pod template file only lets Spark start with a template pod instead of an empty pod during the pod-building process. For details, see the full list of pod template values that will be overwritten by spark.

For example - I personally think spark are doing a huge mistake in preventing users configuring different memory request and memory limit. It is a valid config for many use cases.
With webhook we are able to override this behavior - I have an example here which I intended to push as a part of The performance PR https://github.com/kubeflow/spark-operator/pull/1990/files, You can see the changes I did in patch.go

In our env where there are very high memory peaks but low memory use on average, it saved a huge over allocation and the work overhead of configuring the memory request in an efficient way per spark app.

Unfortunately I can't participate in the community meeting as it is impossible for my timezone, so I hope my voice will be heard here

@vara-bonthu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants