-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LIVY-588]: Full support for Spark on Kubernetes #167
base: master
Are you sure you want to change the base?
[LIVY-588]: Full support for Spark on Kubernetes #167
Conversation
@vanzin please take a look. |
Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon. |
Well, then I'll try to prepare as much as I can till you become available. |
Codecov Report
@@ Coverage Diff @@
## master #167 +/- ##
============================================
- Coverage 68.6% 65.12% -3.48%
- Complexity 904 940 +36
============================================
Files 100 102 +2
Lines 5666 6291 +625
Branches 850 946 +96
============================================
+ Hits 3887 4097 +210
- Misses 1225 1614 +389
- Partials 554 580 +26
Continue to review full report at Codecov.
|
I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely. From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner. |
Sounds cool, will be glad to assist you during the experiments. Maybe you can share with me the cases you are looking the solution for and I'm sure this would be helpful for designing the requirements to the features to implement within this work. By the way in the near future I'll prepare the guidelines for deployment, customization and usage options of Livy on Kubernetes. Will share the progress on that. |
I built Livy on my own machine based on your branch and the Dockerfile in your repository. I got it running so that it created the driver pod, but I was unable to fully start the driver due to using my own spark image, which requires some configuration parameters to be passed in. Here's some feedback:
Unfortunately I don't know Scala really well, so I couldn't really dig into the code easily to determine how this works, so I'm not unable to provide you with more detailed recommendations. |
server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala
Outdated
Show resolved
Hide resolved
@garo Thanks for the review. Here are some explanations on you questions:
Could you provide the example of a Job you wanna run? I hope I will be able to show you the available solutions using that example. |
Thank you very much for the detailed response! I'm just leaving for my easter holiday so I am not going to be able to actually try again until after that. I however created this gist showing how we create the spark drivers in our current workflow: We run Azkaban (like a glorified cron service) which runs our spark applications. Each application (ie. a scheduled cron execution) starts a spark driver pod into kubernetes. If you look at this gist https://gist.github.com/garo/90c6e69d2430ef7d93ca9f564ba86059 there is first a build of spark-submit configuration parameters following with the yaml for the driver pod. So I naturally tried to think how I can use Livy to launch the same image with same kind of settings. I think that with your explanations I can implement most if not all of these settings except the run_id. Lets continue this discussion after easter. Have a great week! |
Just to clarify to be on the same page...
Under the hood livy just runs spark-submit for you:
Starting from Spark 2.4.0, spark-submit in cluster-mode creates Driver Pod, which entrypoint runs spark-submit in client mode, just like you try to do in the gist. Pushgateway sidecar may be deployed as a separate Pod, just configure prometheus sink with right pushgateway-address. Good week for you! |
I'm getting the following error
Spark is running in different namespace that Livy. Service is also created just before this message appears so it does not seems to be error in ordering. Am I doing something wrong? |
@lukatera From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately. I require to know more about your environment to move further. Could you please provide something from the following in addition:
Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues. Will be happy to help, thanks for the feedback. |
Thanks for the help! I was checking out master branch from your repo instead of this specific one. All good now! |
@lukatera |
Great PR! one suggestion is maybe adding the authenticated livy user to both driver and executor pods labels. It should be simple enough since spark already supports arbitrary labels through submit command |
@igorcalabria |
@jahstreet I'm not familiar with livy's codebase, but I'm guessing that the param we want is |
@igorcalabria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch. I played around and things seem to work though I faced a few issues with the batch mode.
Shouldn't the Livy docker/helm charts should also be part of the livy repository since its most likely that users would want to run Livy in a K8s container while launching spark on k8s?. Maybe it can be added as a follow-up task.
server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala
Outdated
Show resolved
Hide resolved
server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala
Show resolved
Hide resolved
Well that a good idea. Since this patch will be accepted and merged I would love to take care of that. |
@jahstreet There's a minor issue when a interactive session is recovered from a filesystem. After a restart, livy correctly recovers the session, but it stops displaying the spark master's url on the "Sessions" tab. The config used was pretty standard
|
server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala
Outdated
Show resolved
Hide resolved
Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.
However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster. If I set I saw this in the driver logs however:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new configs also needs to be added to livy conf template (https://github.com/apache/incubator-livy/blob/master/conf/livy.conf.template) with sufficient comments.
A separate documentation explaining the setup (like the running spark on kubernetes) would definitely be helpful.
server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala
Show resolved
Hide resolved
server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala
Show resolved
Hide resolved
Actually I'm not familiar with Livy impersonation and do not know how it should behave. Maybe someone can clarify that? |
@jahstreet thanks a lot for your contribution, I'm wondering do you a design doc about k8s support on Livy? |
Hi, in progress, you can track it here. Once finished I'll update the PR and fix the build. |
Hi @jahstreet , First of all thanks for your efforts for Livy on K8S. We are testing it in our landscape. It works fine for most cases but occasionally we got the following error message. Any idea why it happens? This creates new driver and executor pods without a session on livy even the original session is alive and operational.
|
appTagLabel: String = SPARK_APP_TAG_LABEL, | ||
appIdLabel: String = SPARK_APP_ID_LABEL | ||
): Seq[KubernetesApplication] = { | ||
client.pods.inAnyNamespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
In my opinion it would be good to have an ability to optionally configure which namespace Livy should check here, so it would be possible to bound one Livy Server to one namespace, so Deployment with Livy Server wouldn't need to have permissions to list pods cluster-wide (and there are no more places where Livy needs cluster-wide permissions at all).
Is this makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, truly agree. This feature is enabled here: https://github.com/apache/incubator-livy/pull/249/files#diff-486ae8357b2836f1addaafe102d02287R169-R219 . Going to port the changes from #249 to this one a bit later.
@jahstreet does this mean that spark on kubernetes helm chart only supports kubernetes api <= 1.15.X? I am getting @JagadeeshNagella 's first error now and I can't tell whats changed other than the kubernetes verion. |
I was able to setup Livy using the helm chart, but when I create a session it fails. I am using the default configuration with minikube Create session payload
|
@kyprifog indeed. I'm working on Spark 3 support in https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade, where this limit is extended to ">=1.11.0 <= 1.17.0". |
@cekicbaris , would be nice to see Spark Driver logs during this failure, I believe it might be related to Livy <-> Spark Driver communication. Also might be the networking is not very stable in your env, not sure if Livy does retries. |
@jahstreet One more thing, if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running. But if I delete the session with a DELETE request to REST API, then it also deletes the pods.
|
I would leave it as the last effort, it's better to start from checking the Driver logs in the moment of the failure and/or more logs from Livy to get the idea on what was happening before and after the failure. Otherwise it is good to contact your network SREs to give the advice on that. They definitely know more about your setup.
I remember this bug in this PR, though I believe it has been fixed in #249 . This PR is out of sync and not supported anymore, but once I get some free time of my work I plan to backport it to this PR and the issue should be gone then. Also once I finish with Spark 3 support in the Helm chart I will also update the images which will solve that as well. Please refer: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade . |
@jahstreet I'm pretty sure I was using something in the 1.17 range before and it still worked, can you confirm? I tried three different (what I thought were stable) versions of 1.15.X yesterday and couldn't get the cluster up and running with any of them, I think something has drifted with the AWS api since then that doesn't work with older kubernetes versions, so I feel like I would have better luck with 1.17.X unless you have a version that has been working for you with AWS. Btw does your comment mean that spark 2.4.5 support is working with 1.17? Maybe 2.4.5 just doesn't work with what I'm using now (1.18.8). 2.4.5 is good enough for my purposes. This compatability matrix is kind of challenging because downgrading the k8s api on a cluster is somewhat involved. Luckily I have almost everything in terraform |
It shouldn't work with K8s APIs > 1.15.3 (this is said so by the fabric8 compatibility matrix). In case it works for some commands - you are lucky. But then you can face the moment when it won't work. Alternative is to upgrade Spark fabric8 client dependency version and have the custom image build, please refer this comment to get the idea on what should be changed: https://stackoverflow.com/a/60052900/7947644 . |
@jahstreet you said that you are working on spark 3.0.0 support in that branch but does that mean 2.4.5 is already working with k8's 1.17 + on that branch? I'm trying to avoid downgrading to 1.15, and upgrading fabric seems like a not so savory option either. |
@jahstreet , thanks for quick reply. Please let me know how I can help/contribute ?
|
Nope, Spark 2.4.x doesn't work with K8s > 1.15 w/o fabric8 upgrade. I'm not going to support that either.
The images are not yet finished and tested. I'll leave the announcement here and in #249 once released. |
@jahstreet I probbably won't use this until #249 is done because downgrading my k8s clusters proved to be a PITA. Is there anything aside from testing holding up that PR? Anywhere you are getting stuck that I can try to push it along? |
@kyprifog , good news for you. Yesterday I've upgraded the helm charts to Spark 3.0.1 which unlocks K8s API 1.18.0 usage. Feel free to try it out with this guide. |
Will do! I'll let you know if I hit any snags, thanks for all the work you've put into this. |
Scratch that, this actually works how I would expect using spark conf. Will look into ENV variable AWS config later, but not high priority, and out of scope for incubator-livy. |
I have requirement to use pod templates feature https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
how can I add these configurations in livy request . spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to point to local files accessible to the spark-submit process in a single tenant I can build Livy image with hard-corded templates but in multi tenets we need this files comes from client how can we pass configs .can you able to check this. |
Hi @JagadeeshNagella , unfortunately it is not possible with the setup you've described. The files should be available for Livy container and be
This is what I currently can come up with. I would also suggest reaching spark on Kubernetes community for the advice. |
@jahstreet thank you for the response. Let me explore options. |
Can the livy maintainers shed some light on remaining steps to merging this PR? |
@jpugliesi this one can be close but still kept for the reference as the initial idea of the feature. The first PR to merge should be #249 as the first step towards integrating Livy with K8s RM. |
@jahstreet One question about configuring the Spark UI in livy/sparkmagic - with this current implementation, it seems that a sparkmagic notebook's Have you figured out a way to construct this notebook link to point to a proxy that will either:
Appreciate your help! |
Hi @jpugliesi , the routing indeed should work as you described and if it is not so then there is a bug in the implementation. Unfortunately since the maintainers are not looking forward continuing with this work I'm not going to continue maintaining the PRs as well. The best I can advice you is to look into the #249 #252 PRs and in the https://github.com/JahstreetOrg/spark-on-kubernetes-helm https://github.com/JahstreetOrg/spark-on-kubernetes-docker repositories where the latest state of the project is stored along with some guidance around using it. |
@jahstreet Huge Thanks for your efforts on this. Saved us a few days. I could have saved some more had I found it sooner :-D Anyways, we're seeing some weird behaviour in which when spark-driver is connecting to Livy RPC. We see the following in the Livy logs: The Local (L) side is fine, as that's where livy's RPC server is running, but the Remote (R) side is incorrect. The remote should be coming as the driver pod and not To get around this, I ended up making the following change. And now, instead of connecting to the remote side of the channel, it tries to connect to the hostname communicated inside the RPC messages. Now I wonder
|
Hi @prongs , thank you for joining the party 🎉 . |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #167 +/- ##
=============================================
- Coverage 68.48% 26.21% -42.27%
+ Complexity 840 358 -482
=============================================
Files 103 104 +1
Lines 5940 6378 +438
Branches 898 959 +61
=============================================
- Hits 4068 1672 -2396
- Misses 1312 4364 +3052
+ Partials 560 342 -218 ☔ View full report in Codecov by Sentry. |
NOTE: this PR is deprecated and kept for discussions history only. Please refer the #249 to get the latest state of the work.
What changes were proposed in this pull request?
This PR is a new feature proposal: full support for Spark on Kubernetes (inspired by SparkYarnApp implementation).
Since Spark on Kubernetes has been released relatively long ago this can be a good idea to include Kubernetes support to Livy project as well, as it can solve much problems related to working with Spark on Kubernetes, it can fully replace Yarn in case of working atop Kubernetes cluster:
basePath
support for Spark UI and History Server as well as has lots of auth integrations available: https://github.com/kubernetes/ingress-nginxDockerfiles repo: https://github.com/jahstreet/spark-on-kubernetes-docker
Helm charts: https://github.com/jahstreet/spark-on-kubernetes-helm
Associated JIRA: https://issues.apache.org/jira/browse/LIVY-588
Design concept: https://github.com/jahstreet/spark-on-kubernetes-helm/blob/develop/README.md
How was this patch tested?
Was tested manually on AKS cluster (Azure Kubernetes Services), Kubernetes v1.11.8:
What do you think on that?