[LIVY-588]: Full support for Spark on Kubernetes #167

jahstreet · 2019-04-09T16:34:06Z

NOTE: this PR is deprecated and kept for discussions history only. Please refer the #249 to get the latest state of the work.

What changes were proposed in this pull request?

This PR is a new feature proposal: full support for Spark on Kubernetes (inspired by SparkYarnApp implementation).

Since Spark on Kubernetes has been released relatively long ago this can be a good idea to include Kubernetes support to Livy project as well, as it can solve much problems related to working with Spark on Kubernetes, it can fully replace Yarn in case of working atop Kubernetes cluster:

Livy UI has cached logs/diagnostics page
Livy UI shows links to Spark UI and Spark History Server
With Kubernetes Ingress resource Livy can be configured to serve as an orchestrator of Spark Apps atop Kubernetes (PR includes Nginx Ingress support option to create routes to Spark UI)
Nginx Ingress solves basePath support for Spark UI and History Server as well as has lots of auth integrations available: https://github.com/kubernetes/ingress-nginx
Livy UI can be integrated with Grafana Loki logs (PR provides solution for that)

Dockerfiles repo: https://github.com/jahstreet/spark-on-kubernetes-docker
Helm charts: https://github.com/jahstreet/spark-on-kubernetes-helm

Associated JIRA: https://issues.apache.org/jira/browse/LIVY-588

Design concept: https://github.com/jahstreet/spark-on-kubernetes-helm/blob/develop/README.md

How was this patch tested?

Was tested manually on AKS cluster (Azure Kubernetes Services), Kubernetes v1.11.8:

Image: Spark 2.4.3 with Hadoop 3.2.0 (https://github.com/jahstreet/spark-on-kubernetes-docker)
History Server: https://github.com/helm/charts/tree/master/stable/spark-history-server
Jupyter Notebook with Sparkmagic: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/master/charts/jupyter

What do you think on that?

jahstreet · 2019-04-10T11:38:41Z

@vanzin please take a look.

vanzin · 2019-04-10T16:55:39Z

Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon.

jahstreet · 2019-04-10T17:21:08Z

Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon.

Well, then I'll try to prepare as much as I can till you become available.
Hope anyone from community will be able to share the feedback on the work done.

codecov-io · 2019-04-10T19:12:52Z

Codecov Report

Merging #167 into master will decrease coverage by 3.47%.
The diff coverage is 26.71%.

@@             Coverage Diff              @@
##             master     #167      +/-   ##
============================================
- Coverage      68.6%   65.12%   -3.48%     
- Complexity      904      940      +36     
============================================
  Files           100      102       +2     
  Lines          5666     6291     +625     
  Branches        850      946      +96     
============================================
+ Hits           3887     4097     +210     
- Misses         1225     1614     +389     
- Partials        554      580      +26

Impacted Files	Coverage Δ	Complexity Δ
...e/livy/server/interactive/InteractiveSession.scala	`68.75% <0%> (-0.37%)`	`46 <0> (+2)`
...rver/src/main/scala/org/apache/livy/LivyConf.scala	`96.46% <100%> (+0.6%)`	`22 <1> (+1)`	⬆️
...ala/org/apache/livy/utils/SparkKubernetesApp.scala	`20.36% <20.36%> (ø)`	`0 <0> (?)`
...main/scala/org/apache/livy/server/LivyServer.scala	`32.43% <33.33%> (-3.53%)`	`11 <0> (ø)`
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java	`79.25% <50%> (+1.28%)`	`45 <0> (+4)`	⬆️
...rc/main/scala/org/apache/livy/utils/SparkApp.scala	`67.5% <55.55%> (-8.5%)`	`1 <0> (ø)`
...in/scala/org/apache/livy/repl/SQLInterpreter.scala	`62.5% <0%> (-7.88%)`	`9% <0%> (+2%)`
...ain/scala/org/apache/livy/utils/SparkYarnApp.scala	`66.01% <0%> (-7.23%)`	`40% <0%> (+7%)`
...n/scala/org/apache/livy/server/AccessManager.scala	`75.47% <0%> (-5.38%)`	`46% <0%> (+2%)`
...cala/org/apache/livy/scalaapi/ScalaJobHandle.scala	`52.94% <0%> (-2.95%)`	`7% <0%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dee3cc...7f6ef8a. Read the comment docs.

garo · 2019-04-17T18:51:52Z

I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely.

From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner.

jahstreet · 2019-04-17T19:23:39Z

I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely.

From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner.

Sounds cool, will be glad to assist you during the experiments. Maybe you can share with me the cases you are looking the solution for and I'm sure this would be helpful for designing the requirements to the features to implement within this work.

By the way in the near future I'll prepare the guidelines for deployment, customization and usage options of Livy on Kubernetes. Will share the progress on that.

garo · 2019-04-18T08:43:32Z

I built Livy on my own machine based on your branch and the Dockerfile in your repository. I got it running so that it created the driver pod, but I was unable to fully start the driver due to using my own spark image, which requires some configuration parameters to be passed in.

Here's some feedback:

Helm chart doesn't allow to specify "serviceAccount" property for Livy.
Couldn't find a way to set namespace which Livy must use. It seems to try to want to search all pods in all namespaces. Also need to set the namespace where the pods are created (Seems to be fixed to "default")
Could you provide a way to fully customise the driver pod specification? I would want to set custom volumes and volume-mounts, environment variables, labels, sidecar containers and possibly even customise the command line arguments for the driver.
Also a way to provide custom spark configuration settings for the driver pod would be required.
Support for macros for both customising the driver pod and the extra spark configuration options. I would at least need the id of the livy session (eg. "livy-session-2-9SZP8Ijv") to be inserted to both the pod template and the spark configuration options.

Unfortunately I don't know Scala really well, so I couldn't really dig into the code easily to determine how this works, so I'm not unable to provide you with more detailed recommendations.

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala

jahstreet · 2019-04-18T10:51:51Z

@garo Thanks for the review.

Here are some explanations on you questions:

First version of chart were done without RBAC support. I've just done with RBAC support solution for Livy chart and not yet merged it, you can refer feature branch https://github.com/jahstreet/spark-on-kubernetes-helm/blob/charts/livy/rbac-support/charts/livy/values.yaml:
serviceAccount:
//Specifies whether a service account should be created
create: true
//The name of the service account to use.
//If not set and create is true, a name is generated using the fullname template
name:
Livy searches for a Driver Pod in all namespaces (theoretically user may want to use any namespace to submit job to) for the first time to initialize KubernetesApplication object, then it uses that object (which contains field namespace) to get Spark Pods states and logs and looks for that information only within 1 target namespace (I've added comments to the lines where this logic is done).
By default Livy should submit Spark App to default namespace (if it is not done so, than I need to make a fix ;)) ). You can change that behavior by adding spark.kubernetes.namespace=<desired_namespaces> to /opt/spark/conf/spark-defaults.conf in Livy container. Livy entrypoint is done so that it can set spark-defaults configs with env variables, so you can set Livy container env LIVY_SPARK_KUBERNETES_NAMESPACE=<desired_namespaces> to change Spark Apps default namespace. In the new version of Livy chart I set it to .Release.Namespace. And of course you can pass it as additional conf on App submission within POST request to Livy: { ... "conf": { "spark.kubernetes.namespace":"<desired_namespaces>"}, ... } to overwrite defaults.
Please refer some customization explanations to Livy: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/master/charts/livy#customizing-livy-server
Following that approach you can set any config defaults to both Livy and Spark. If you need to overwrite some - do that on job submission in the POST request body.
To customize Driver Pod spec we need a custom build of Spark installed to Livy image (Livy just runs spark-submit). I refer to the official releases of Apache Spark and do not see available options for that at present (including adding sidecars). But it has configs to set volumes and volume-mounts, environment variables, labels (https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties), which we can set to default values as I described before.
Customize the command line arguments for the driver - you mean application args? You can pass them on job submission in the POST request body: https://livy.incubator.apache.org/docs/latest/rest-api.html
Or you need custom spark-submit script options?
Why do you need the id of the Livy session and what kind of macros for both customizing the driver pod and the extra spark configuration options do you mean?

Could you provide the example of a Job you wanna run? I hope I will be able to show you the available solutions using that example.

garo · 2019-04-18T11:41:13Z

Thank you very much for the detailed response! I'm just leaving for my easter holiday so I am not going to be able to actually try again until after that.

I however created this gist showing how we create the spark drivers in our current workflow: We run Azkaban (like a glorified cron service) which runs our spark applications. Each application (ie. a scheduled cron execution) starts a spark driver pod into kubernetes. If you look at this gist https://gist.github.com/garo/90c6e69d2430ef7d93ca9f564ba86059 there is first a build of spark-submit configuration parameters following with the yaml for the driver pod.

So I naturally tried to think how I can use Livy to launch the same image with same kind of settings. I think that with your explanations I can implement most if not all of these settings except the run_id.

Lets continue this discussion after easter. Have a great week!

jahstreet · 2019-04-18T15:28:41Z

Just to clarify to be on the same page...
When you send request to Livy, eg:

kubectl exec livy-pod -- curl -H 'Content-Type: application/json' -X POST
-d '{ "name": "spark-pi",
"proxyUser": "livy_user",
"numExecutors": 2,
"conf": {
"spark.kubernetes.container.image": "sasnouskikh/spark:2.4.1-hadoop_3.2.0",
"spark.kubernetes.container.image.pullPolicy": "Always",
"spark.kubernetes.namespace": "default"
},
"file": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.1.jar",
"className": "org.apache.spark.examples.SparkPi",
"args": [
"1000000"
]
}' "http://localhost:8998/batches"

Under the hood livy just runs spark-submit for you:

spark-submit
--master k8s://https://<k8s_api_server>:443
--deploy-mode cluster
--name spark-pi
--class org.apache.spark.examples.SparkPi
--conf spark.executor.instances=2
--conf spark.kubernetes.container.image=sasnouskikh/spark:2.4.1-hadoop_3.2.0
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.namespace=default
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.1.jar 100000

Starting from Spark 2.4.0, spark-submit in cluster-mode creates Driver Pod, which entrypoint runs spark-submit in client mode, just like you try to do in the gist.
So I do not see why you may want to deploy customized Driver Pod in that particular case.
Most of --conf may be moved to defaults and you will have pretty JSON.

Pushgateway sidecar may be deployed as a separate Pod, just configure prometheus sink with right pushgateway-address.
All other configs for Driver Pod customization are already covered by docs for Spark on Kubernetes.

Good week for you!

ghost · 2019-04-23T16:32:47Z

I'm getting the following error

19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO Client: Deployed Spark application livy-session-0 into Kubernetes.
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO ShutdownHookManager: Shutdown hook called
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-62b7810e-667d-47e7-9940-72f8cd5f91e9
19/04/23 16:26:04 DEBUG InteractiveSession: InteractiveSession 0 app state changed from RUNNING to FINISHED
19/04/23 16:26:04 DEBUG InteractiveSession: InteractiveSession 0 session state change from starting to dead
19/04/23 16:26:10 DEBUG AbstractByteBuf: -Dio.netty.buffer.bytebuf.checkAccessible: true
19/04/23 16:26:10 DEBUG ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
19/04/23 16:26:10 DEBUG ResourceLeakDetector: -Dio.netty.leakDetection.maxRecords: 4
19/04/23 16:26:10 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 262144
19/04/23 16:26:10 DEBUG Recycler: -Dio.netty.recycler.linkCapacity: 16
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (41 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Handling SASL challenge message...
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Sending SASL challenge response...
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (98 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (275 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Handling SASL challenge message...
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Sending SASL challenge response...
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (45 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: SASL negotiation finished with QOP auth.
19/04/23 16:26:10 DEBUG ContextLauncher: New RPC client connected from [id: 0x2ae2b51a, L:/10.233.94.163:10000 - R:/10.233.94.164:39008].
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress (94 bytes)
19/04/23 16:26:10 DEBUG RpcDispatcher: [RegistrationHandler] Received RPC message: type=CALL id=0 payload=org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress
19/04/23 16:26:10 DEBUG ContextLauncher: Received driver info for client [id: 0x2ae2b51a, L:/10.233.94.163:10000 - R:/10.233.94.164:39008]: livy-session-0-1556036763266-driver/10000.
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$NullMessage (2 bytes)
19/04/23 16:26:10 DEBUG RpcDispatcher: Channel [id: 0x2ae2b51a, L:/10.233.94.163:10000 ! R:/10.233.94.164:39008] became inactive.
19/04/23 16:26:10 ERROR RSCClient: Failed to connect to context.
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1206)
	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:525)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:510)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:492)
	at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:949)
	at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:208)
	at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
	at java.lang.Thread.run(Thread.java:748)
19/04/23 16:26:10 ERROR RSCClient: RPC error.
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1206)
	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:525)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:510)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:492)
	at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:949)
	at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:208)
	at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
	at java.lang.Thread.run(Thread.java:748)
19/04/23 16:26:10 INFO RSCClient: Failing pending job d509417c-c894-416d-8218-625b278da8b7 due to shutdown.

Spark is running in different namespace that Livy. Service is also created just before this message appears so it does not seems to be error in ordering. Am I doing something wrong?

jahstreet · 2019-04-23T20:42:09Z

@lukatera
Good day,

From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately.

I require to know more about your environment to move further. Could you please provide something from the following in addition:

What is your Kubernetes installation and version?
What Docker images do you use? What version of Spark is running (this Livy was tested with Spark 2.4.0+, 2.3.* wasn't good enough and had some unpleasant bugs)? From what commit have you built Livy (if you did so)?
What are the livy.conf and livy-client.conf content (/opt/livy/conf/...)?
What are the Spark Job configs: kubectl describe configmap <spark-driver-pod-conf-map> -n <spark-job-namespace>? What is the JSON body you post to create a session?
What are the Spark Driver Pod logs?
If you use Helm charts - what are the versions and what are the custom values you provide on install?
Maybe some more debugging info you feel may be related?

Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues.

Will be happy to help, thanks for the feedback.

ghost · 2019-04-24T14:24:20Z

@lukatera
Good day,

From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately.

I require to know more about your environment to move further. Could you please provide something from the following in addition:

What is your Kubernetes installation and version?

What Docker images do you use? What version of Spark is running (this Livy was tested with Spark 2.4.0+, 2.3.* wasn't good enough and had some unpleasant bugs)? From what commit have you built Livy (if you did so)?

What are the livy.conf and livy-client.conf content (/opt/livy/conf/...)?

What are the Spark Job configs: kubectl describe configmap <spark-driver-pod-conf-map> -n <spark-job-namespace>? What is the JSON body you post to create a session?

What are the Spark Driver Pod logs?

If you use Helm charts - what are the versions and what are the custom values you provide on install?

Maybe some more debugging info you feel may be related?

Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues.

Will be happy to help, thanks for the feedback.

Thanks for the help! I was checking out master branch from your repo instead of this specific one. All good now!

jahstreet · 2019-04-24T14:33:12Z

@lukatera
Cool, nice to know that. Do not hesitate to ask if you face any problems on that.

igorcalabria · 2019-05-13T19:08:17Z

Great PR! one suggestion is maybe adding the authenticated livy user to both driver and executor pods labels. It should be simple enough since spark already supports arbitrary labels through submit command spark.kubernetes.driver.label.[LabelName].

jahstreet · 2019-05-14T06:39:21Z

@igorcalabria
Thanks for the feedback, what mechanism of getting livy user value do you propose? I see an option of setting those labels with proxyUser value on Spark job submission from POST request to Livy. Did you mean that?

igorcalabria · 2019-05-14T13:51:49Z

@igorcalabria
Thanks for the feedback, what mechanism of getting livy user value do you propose? I see an option of setting those labels with proxyUser value on Spark job submission from POST request to Livy. Did you mean that?

@jahstreet
It could be that, but I was thinking about the authenticated user(via kerberos) making the requests. To give you more context, this could be great for resource usage tracking, especially if livy has more info available about the principal, like groups or even teams.

I'm not familiar with livy's codebase, but I'm guessing that the param we want is owner on the Session classes:

jahstreet · 2019-05-14T13:58:26Z

@igorcalabria
Oh, I see, will try that, thanks.

arunmahadevan

Thanks for the patch. I played around and things seem to work though I faced a few issues with the batch mode.

Shouldn't the Livy docker/helm charts should also be part of the livy repository since its most likely that users would want to run Livy in a K8s container while launching spark on k8s?. Maybe it can be added as a follow-up task.

server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala

jahstreet · 2019-05-23T13:41:31Z

Shouldn't the Livy docker/helm charts should also be part of the livy repository since its most likely that users would want to run Livy in a K8s container while launching spark on k8s?. Maybe it can be added as a follow-up task.

Well that a good idea. Since this patch will be accepted and merged I would love to take care of that.
Thanks for your feedback.

igorcalabria · 2019-05-23T17:24:05Z

@jahstreet There's a minor issue when a interactive session is recovered from a filesystem. After a restart, livy correctly recovers the session, but it stops displaying the spark master's url on the "Sessions" tab. The config used was pretty standard

    livy.server.recovery.mode = recovery
    livy.server.recovery.state-store = filesystem
    livy.server.recovery.state-store.url = ...

server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala

ghost · 2019-06-06T14:58:55Z

Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.

%%configure -f
{
    "proxyUser": "customUser"
}

However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster.

If I set HADOOP_USER_NAME env variable on the driver and the executor, it runs stuff on top of hadoop as that user.

I saw this in the driver logs however:

19/06/06 14:53:11 DEBUG UserGroupInformation: PrivilegedAction as:customUser (auth:PROXY) via root (auth:SIMPLE) from:org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:150)

arunmahadevan

I think the new configs also needs to be added to livy conf template (https://github.com/apache/incubator-livy/blob/master/conf/livy.conf.template) with sufficient comments.

A separate documentation explaining the setup (like the running spark on kubernetes) would definitely be helpful.

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala

server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala

server/src/main/scala/org/apache/livy/LivyConf.scala

server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala

jahstreet · 2019-06-19T12:56:57Z

Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.
%%configure -f
{
    "proxyUser": "customUser"
}
However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster.

If I set HADOOP_USER_NAME env variable on the driver and the executor, it runs stuff on top of hadoop as that user.

I saw this in the driver logs however:
19/06/06 14:53:11 DEBUG UserGroupInformation: PrivilegedAction as:customUser (auth:PROXY) via root (auth:SIMPLE) from:org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:150)

Actually I'm not familiar with Livy impersonation and do not know how it should behave. Maybe someone can clarify that?

jerryshao · 2019-07-12T02:26:38Z

@jahstreet thanks a lot for your contribution, I'm wondering do you a design doc about k8s support on Livy?

jahstreet · 2020-08-24T16:00:42Z

Hi @jahstreet any update on Spark 3.0 ? build failed on spark-3.0 profile

Hi, in progress, you can track it here. Once finished I'll update the PR and fix the build.

cekicbaris · 2020-09-12T20:15:37Z

Hi @jahstreet ,

First of all thanks for your efforts for Livy on K8S. We are testing it in our landscape. It works fine for most cases but occasionally we got the following error message. Any idea why it happens? This creates new driver and executor pods without a session on livy even the original session is alive and operational.

2020-09-12T20:05:05.271309681Z 2020-09-12 20:05:05 WARN RSCClient:127 - Client RPC channel closed unexpectedly. 2020-09-12T20:05:05.271758816Z 2020-09-12 20:05:05 WARN RSCClient:234 - Error stopping RPC. 2020-09-12T20:05:05.271776249Z io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@1ee211ac(uncancellable) 2020-09-12T20:05:05.27178287Z at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:394) 2020-09-12T20:05:05.271788878Z at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) 2020-09-12T20:05:05.271794021Z at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:230) 2020-09-12T20:05:05.271798608Z at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:129) 2020-09-12T20:05:05.271803528Z at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:28) 2020-09-12T20:05:05.271808597Z at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:336) 2020-09-12T20:05:05.271813574Z at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:117) 2020-09-12T20:05:05.271818427Z at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:28) 2020-09-12T20:05:05.271823729Z at org.apache.livy.rsc.rpc.Rpc.close(Rpc.java:310) 2020-09-12T20:05:05.271828282Z at org.apache.livy.rsc.RSCClient.stop(RSCClient.java:232) 2020-09-12T20:05:05.271833421Z at org.apache.livy.rsc.RSCClient$2$1.onSuccess(RSCClient.java:129) 2020-09-12T20:05:05.271838476Z at org.apache.livy.rsc.RSCClient$2$1.onSuccess(RSCClient.java:123) 2020-09-12T20:05:05.271843788Z at org.apache.livy.rsc.Utils$2.operationComplete(Utils.java:108) 2020-09-12T20:05:05.271849251Z at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:518) 2020-09-12T20:05:05.271853952Z at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:492) 2020-09-12T20:05:05.271886201Z at io.netty.util.concurrent.DefaultPromise.notifyListenersWithStackOverFlowProtection(DefaultPromise.java:431) 2020-09-12T20:05:05.271892906Z at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) 2020-09-12T20:05:05.271897634Z at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:108) 2020-09-12T20:05:05.271902407Z at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) 2020-09-12T20:05:05.271907054Z at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:995) 2020-09-12T20:05:05.271911776Z at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:621) 2020-09-12T20:05:05.271916658Z at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:599) 2020-09-12T20:05:05.271921391Z at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:543) 2020-09-12T20:05:05.271927411Z at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.closeOnRead(AbstractNioByteChannel.java:71) 2020-09-12T20:05:05.271932031Z at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:158) 2020-09-12T20:05:05.271935259Z at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564) 2020-09-12T20:05:05.271938426Z at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505) 2020-09-12T20:05:05.271941673Z at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419) 2020-09-12T20:05:05.27194499Z at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391) 2020-09-12T20:05:05.271948247Z at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) 2020-09-12T20:05:05.27195149Z at java.lang.Thread.run(Thread.java:748)

idzikovsky · 2020-09-18T13:52:47Z

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala

+      appTagLabel: String = SPARK_APP_TAG_LABEL,
+      appIdLabel: String = SPARK_APP_ID_LABEL
+    ): Seq[KubernetesApplication] = {
+      client.pods.inAnyNamespace


Hi!
In my opinion it would be good to have an ability to optionally configure which namespace Livy should check here, so it would be possible to bound one Livy Server to one namespace, so Deployment with Livy Server wouldn't need to have permissions to list pods cluster-wide (and there are no more places where Livy needs cluster-wide permissions at all).

Is this makes sense?

Hey, truly agree. This feature is enabled here: https://github.com/apache/incubator-livy/pull/249/files#diff-486ae8357b2836f1addaafe102d02287R169-R219 . Going to port the changes from #249 to this one a bit later.

kyprifog · 2020-09-24T18:58:23Z

@jahstreet does this mean that spark on kubernetes helm chart only supports kubernetes api <= 1.15.X? I am getting @JagadeeshNagella 's first error now and I can't tell whats changed other than the kubernetes verion.

KamalGalrani · 2020-09-25T05:43:41Z

I was able to setup Livy using the helm chart, but when I create a session it fails. I am using the default configuration with minikube

Create session payload

{
    "kind": "pyspark",
    "name": "test-session1234",
    "conf": {
      "spark.kubernetes.namespace": "livy"
    }
}

20/09/25 04:00:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [livy]  failed.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330)
	at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
	at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
	at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
	at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894)
	at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865)
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
	at okio.Okio$1.write(Okio.java:79)
	at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
	at okio.RealBufferedSink.flush(RealBufferedSink.java:224)
	at okhttp3.internal.http2.Http2Writer.settings(Http2Writer.java:203)
	at okhttp3.internal.http2.Http2Connection.start(Http2Connection.java:515)
	at okhttp3.internal.http2.Http2Connection.start(Http2Connection.java:505)
	at okhttp3.internal.connection.RealConnection.startHttp2(RealConnection.java:298)
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:287)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:168)
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:257)
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254)
	at okhttp3.RealCall.execute(RealCall.java:92)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:411)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:241)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:819)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:334)
	... 17 more
20/09/25 04:01:00 INFO ShutdownHookManager: Shutdown hook called
20/09/25 04:01:00 INFO ShutdownHookManager: Deleting directory /tmp/spark-343d41df-d58c-4ed4-8a03-2eabbc21da1d

Kubernetes Diagnostics: 
Operation: [list]  for kind: [Pod]  with name: [null]  in namespace: [null]  failed.

jahstreet · 2020-09-25T10:35:17Z

kubernetes helm chart only supports kubernetes api <= 1.15.X

@kyprifog indeed. I'm working on Spark 3 support in https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade, where this limit is extended to ">=1.11.0 <= 1.17.0".

jahstreet · 2020-09-25T10:38:08Z

@cekicbaris , would be nice to see Spark Driver logs during this failure, I believe it might be related to Livy <-> Spark Driver communication. Also might be the networking is not very stable in your env, not sure if Livy does retries.

jahstreet · 2020-09-25T10:39:16Z

@KamalGalrani answered in : JahstreetOrg/spark-on-kubernetes-helm#41 (comment)

cekicbaris · 2020-09-25T11:00:47Z

@cekicbaris , would be nice to see Spark Driver logs during this failure, I believe it might be related to Livy <-> Spark Driver communication. Also might be the networking is not very stable in your env, not sure if Livy does retries.

@jahstreet
Do you mind to give some advice to check the networking? It is running on AWS EKS on kubernetes 1.15 and in livy namespace. livy service account has enough auth.

One more thing, if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running. But if I delete the session with a DELETE request to REST API, then it also deletes the pods.
Here is a time'd out session log.

2020-09-23T07:19:29.853571515Z 2020-09-23 07:19:29 INFO  InteractiveSessionManager:39 - Deleting InteractiveSession 2 because it was inactive for more than 3600000.0 ms.
2020-09-23T07:19:29.853690768Z 2020-09-23 07:19:29 INFO  InteractiveSession:39 - Stopping InteractiveSession 2...
2020-09-23T07:19:29.994715243Z 2020-09-23 07:19:29 WARN  RpcDispatcher:191 - [ClientProtocol] Closing RPC channel with 1 outstanding RPCs.
2020-09-23T07:19:30.013570228Z 2020-09-23 07:19:30 INFO  InteractiveSession:39 - Stopped InteractiveSession 2.
2020-09-23T07:19:39.119654462Z 2020-09-23 07:19:39 ERROR SparkKubernetesApp:56 - Unknown Kubernetes state unknown for app with tag livy-session-2-dfceC0pO.

jahstreet · 2020-09-25T12:22:34Z

Do you mind to give some advice to check the networking?

I would leave it as the last effort, it's better to start from checking the Driver logs in the moment of the failure and/or more logs from Livy to get the idea on what was happening before and after the failure. Otherwise it is good to contact your network SREs to give the advice on that. They definitely know more about your setup.

if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running.

I remember this bug in this PR, though I believe it has been fixed in #249 . This PR is out of sync and not supported anymore, but once I get some free time of my work I plan to backport it to this PR and the issue should be gone then. Also once I finish with Spark 3 support in the Helm chart I will also update the images which will solve that as well. Please refer: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade .

kyprifog · 2020-09-25T13:12:56Z

@jahstreet I'm pretty sure I was using something in the 1.17 range before and it still worked, can you confirm?

I tried three different (what I thought were stable) versions of 1.15.X yesterday and couldn't get the cluster up and running with any of them, I think something has drifted with the AWS api since then that doesn't work with older kubernetes versions, so I feel like I would have better luck with 1.17.X unless you have a version that has been working for you with AWS.

Btw does your comment mean that spark 2.4.5 support is working with 1.17? Maybe 2.4.5 just doesn't work with what I'm using now (1.18.8). 2.4.5 is good enough for my purposes.

This compatability matrix is kind of challenging because downgrading the k8s api on a cluster is somewhat involved. Luckily I have almost everything in terraform

jahstreet · 2020-09-25T13:26:25Z

Maybe 2.4.5 just doesn't work with what I'm using now (1.18.8). 2.4.5 is good enough for my purposes.

It shouldn't work with K8s APIs > 1.15.3 (this is said so by the fabric8 compatibility matrix). In case it works for some commands - you are lucky. But then you can face the moment when it won't work.

Alternative is to upgrade Spark fabric8 client dependency version and have the custom image build, please refer this comment to get the idea on what should be changed: https://stackoverflow.com/a/60052900/7947644 .

kyprifog · 2020-09-25T13:33:02Z

@jahstreet you said that you are working on spark 3.0.0 support in that branch but does that mean 2.4.5 is already working with k8's 1.17 + on that branch? I'm trying to avoid downgrading to 1.15, and upgrading fabric seems like a not so savory option either.

cekicbaris · 2020-09-25T15:26:21Z

Do you mind to give some advice to check the networking?

I would leave it as the last effort, it's better to start from checking the Driver logs in the moment of the failure and/or more logs from Livy to get the idea on what was happening before and after the failure. Otherwise it is good to contact your network SREs to give the advice on that. They definitely know more about your setup.

if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running.

I remember this bug in this PR, though I believe it has been fixed in #249 . This PR is out of sync and not supported anymore, but once I get some free time of my work I plan to backport it to this PR and the issue should be gone then. Also once I finish with Spark 3 support in the Helm chart I will also update the images which will solve that as well. Please refer: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade .

@jahstreet , thanks for quick reply. Please let me know how I can help/contribute ?
In the meantime, I tried to deploy https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade chart but the driver failed to launch. Is it because of images not updated yet ?

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.util.concurrent.ExecutionException: java.lang.IncompatibleClassChangeError: Found interface org.objectweb.asm.MethodVisitor, but class was expected
	at io.netty.util.concurrent.DefaultPromise.get(DefaultPromise.java:349)
	at org.apache.livy.rsc.driver.RSCDriver.initializeServer(RSCDriver.java:210)
	at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:343)
	at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:93)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.objectweb.asm.MethodVisitor, but class was expected
	at org.apache.livy.shaded.kryo.reflectasm.ConstructorAccess.insertConstructor(ConstructorAccess.java:128)
	at org.apache.livy.shaded.kryo.reflectasm.ConstructorAccess.get(ConstructorAccess.java:98)
	at org.apache.livy.shaded.kryo.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1271)
	at org.apache.livy.shaded.kryo.kryo.Kryo.newInstantiator(Kryo.java:1127)
	at org.apache.livy.shaded.kryo.kryo.Kryo.newInstance(Kryo.java:1136)
	at org.apache.livy.shaded.kryo.kryo.serializers.FieldSerializer.create(FieldSerializer.java:562)
	at org.apache.livy.shaded.kryo.kryo.serializers.FieldSerializer.read(FieldSerializer.java:538)
	at org.apache.livy.shaded.kryo.kryo.Kryo.readClassAndObject(Kryo.java:813)
	at org.apache.livy.client.common.Serializer.deserialize(Serializer.java:67)
	at org.apache.livy.rsc.rpc.KryoMessageCodec.decode(KryoMessageCodec.java:70)
	at io.netty.handler.codec.ByteToMessageCodec$1.decode(ByteToMessageCodec.java:42)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:498)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:437)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:748)

jahstreet · 2020-09-27T12:37:43Z

@kyprifog

but does that mean 2.4.5 is already working with k8's 1.17 + on that branch?

Nope, Spark 2.4.x doesn't work with K8s > 1.15 w/o fabric8 upgrade. I'm not going to support that either.

@cekicbaris

Is it because of images not updated yet ?

The images are not yet finished and tested. I'll leave the announcement here and in #249 once released.

kyprifog · 2020-10-05T21:41:45Z

@jahstreet I probbably won't use this until #249 is done because downgrading my k8s clusters proved to be a PITA. Is there anything aside from testing holding up that PR? Anywhere you are getting stuck that I can try to push it along?

jahstreet · 2020-10-06T11:43:04Z

@kyprifog , good news for you. Yesterday I've upgraded the helm charts to Spark 3.0.1 which unlocks K8s API 1.18.0 usage. Feel free to try it out with this guide.
In the meantime I'm going to backport the required changes to this and #249 PRs.

kyprifog · 2020-10-06T12:18:17Z

Will do! I'll let you know if I hit any snags, thanks for all the work you've put into this.

kyprifog · 2020-12-22T01:40:09Z

@jahstreet Pulling in my comment from here: JahstreetOrg/spark-on-kubernetes-helm#54 It seems like that spark submit in spark 3.0 is the one responsible for managing the forwarding of AWS keys so I think that maybe this is getting missed in this corresponding livy implementation. I'm going to dig deeper and confirm/ make the appropriate fix.

Scratch that, this actually works how I would expect using spark conf. Will look into ENV variable AWS config later, but not high priority, and out of scope for incubator-livy.

JagadeeshNagella · 2021-01-26T03:21:54Z

@jahstreet ,

I have requirement to use pod templates feature https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
I am able to use in spark-submit

--conf spark.kubernetes.driver.podTemplateFile: "spark-driver-pod-template.yaml"
--conf spark.kubernetes.executor.podTemplateFile": "spark-executor-pod-template.yaml"

how can I add these configurations in livy request . spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to point to local files accessible to the spark-submit process in a single tenant I can build Livy image with hard-corded templates but in multi tenets we need this files comes from client how can we pass configs .can you able to check this.

jahstreet · 2021-01-27T17:55:05Z

Hi @JagadeeshNagella , unfortunately it is not possible with the setup you've described. The files should be available for Livy container and be local for its FS. So in your case I see several options:

Setup the process to copy .yaml file over to the Livy container and then refer it in the job request
Attach shared folder to Livy container and setup the process to copy files over to it for each user (I hope it is possible to configure ACLs for the share folder subdirectories)
Refer the file from the shared storage (eg.: hdfs or s3), though I'm not sure spark currently supports that

This is what I currently can come up with. I would also suggest reaching spark on Kubernetes community for the advice.
Best.

JagadeeshNagella · 2021-02-03T07:10:19Z

@jahstreet thank you for the response. Let me explore options.

jpugliesi · 2021-02-27T00:57:00Z

Can the livy maintainers shed some light on remaining steps to merging this PR?

jahstreet · 2021-03-02T21:46:00Z

@jpugliesi this one can be close but still kept for the reference as the initial idea of the feature. The first PR to merge should be #249 as the first step towards integrating Livy with K8s RM.

jpugliesi · 2021-03-23T21:59:48Z

@jahstreet One question about configuring the Spark UI in livy/sparkmagic - with this current implementation, it seems that a sparkmagic notebook's Spark UI link will link to the Kubernetes ingress host configured if livy.server.kubernetes.ingress.create=true. Unfortunately, this link will not be updated to point to the url of the history server upon spark session completion.

Have you figured out a way to construct this notebook link to point to a proxy that will either:

Route to Spark driver's Spark UI, if the driver/ingress still exists, or
Route to the Spark history server for the spark app id, in case the the session is no longer in progress

Appreciate your help!

jahstreet · 2021-05-14T08:45:30Z

Hi @jpugliesi , the routing indeed should work as you described and if it is not so then there is a bug in the implementation. Unfortunately since the maintainers are not looking forward continuing with this work I'm not going to continue maintaining the PRs as well. The best I can advice you is to look into the #249 #252 PRs and in the https://github.com/JahstreetOrg/spark-on-kubernetes-helm https://github.com/JahstreetOrg/spark-on-kubernetes-docker repositories where the latest state of the project is stored along with some guidance around using it.
I hope to get to it somewhere this year to update to the latest state of things.
Best.

prongs · 2021-07-06T09:59:21Z

@jahstreet Huge Thanks for your efforts on this. Saved us a few days. I could have saved some more had I found it sooner :-D

Anyways, we're seeing some weird behaviour in which when spark-driver is connecting to Livy RPC. We see the following in the Livy logs:

The Local (L) side is fine, as that's where livy's RPC server is running, but the Remote (R) side is incorrect. The remote should be coming as the driver pod and not 127.0.0.1. Would you have any idea about this?

To get around this, I ended up making the following change.

And now, instead of connecting to the remote side of the channel, it tries to connect to the hostname communicated inside the RPC messages.

Now I wonder

Why that's not already the case? Why are we not using the hostname communicated in the message.
How it works for you and others on this thread without this change?

jahstreet · 2021-07-06T11:11:12Z

Hi @prongs , thank you for joining the party 🎉 .
This PR is outdated and kept just for the discussions (probably I should reflect that in the description).
The work done here were split into smaller pieces and the first one lives now in #249 . Please check this block of the code to see how I approached that problem: https://github.com/apache/incubator-livy/pull/249/files#diff-43114318c4b009c2404f7eb326a84c184fb1501a3237c49a771df851d0f6f328R172-R177 . Though I think your solution looks nicer indeed. Worths to look deeper inside. 👍

codecov-commenter · 2024-10-09T20:09:54Z

Codecov Report

Attention: Patch coverage is 8.10811% with 408 lines in your changes missing coverage. Please review.

Project coverage is 26.21%. Comparing base (97cf2f7) to head (1c6bee8).
Report is 46 commits behind head on master.

Files with missing lines	Patch %	Lines
...ala/org/apache/livy/utils/SparkKubernetesApp.scala	0.00%	388 Missing ⚠️
...rc/main/scala/org/apache/livy/utils/SparkApp.scala	50.00%	8 Missing and 1 partial ⚠️
...main/scala/org/apache/livy/server/LivyServer.scala	0.00%	4 Missing and 1 partial ⚠️
...e/livy/server/interactive/InteractiveSession.scala	0.00%	1 Missing and 3 partials ⚠️
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java	0.00%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (97cf2f7) and HEAD (1c6bee8). Click for more details.

HEAD has 3 uploads less than BASE

Flag BASE (97cf2f7) HEAD (1c6bee8)

4 1

Additional details and impacted files

@@              Coverage Diff              @@
##             master     #167       +/-   ##
=============================================
- Coverage     68.48%   26.21%   -42.27%     
+ Complexity      840      358      -482     
=============================================
  Files           103      104        +1     
  Lines          5940     6378      +438     
  Branches        898      959       +61     
=============================================
- Hits           4068     1672     -2396     
- Misses         1312     4364     +3052     
+ Partials        560      342      -218

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jahstreet changed the title ~~[WIP][LIVY-KUBERNETES]: Full support for Spark on Kubernetes~~ [LIVY-588][WIP]: Full support for Spark on Kubernetes Apr 10, 2019

jahstreet commented Apr 18, 2019

View reviewed changes

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala Show resolved Hide resolved

jahstreet commented Apr 18, 2019

View reviewed changes

server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala Outdated Show resolved Hide resolved

arunmahadevan reviewed May 23, 2019

View reviewed changes

igorcalabria reviewed May 23, 2019

View reviewed changes

server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala Outdated Show resolved Hide resolved

arunmahadevan reviewed Jun 7, 2019

View reviewed changes

server/src/main/scala/org/apache/livy/server/interactive/InteractiveSession.scala Show resolved Hide resolved

idzikovsky reviewed Sep 18, 2020

View reviewed changes

idzikovsky mentioned this pull request Jan 31, 2023

Provide a way to disable the fix for LIVY-697 #388

Open

[LIVY-588]: Full support for Spark on Kubernetes #167

Are you sure you want to change the base?

[LIVY-588]: Full support for Spark on Kubernetes #167

Conversation

jahstreet commented Apr 9, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

jahstreet commented Apr 10, 2019

vanzin commented Apr 10, 2019

jahstreet commented Apr 10, 2019

codecov-io commented Apr 10, 2019 • edited Loading

Codecov Report

garo commented Apr 17, 2019 • edited Loading

jahstreet commented Apr 17, 2019

garo commented Apr 18, 2019

jahstreet commented Apr 18, 2019 • edited Loading

garo commented Apr 18, 2019

jahstreet commented Apr 18, 2019 • edited Loading

ghost commented Apr 23, 2019

jahstreet commented Apr 23, 2019

ghost commented Apr 24, 2019

jahstreet commented Apr 24, 2019

igorcalabria commented May 13, 2019

jahstreet commented May 14, 2019

igorcalabria commented May 14, 2019

jahstreet commented May 14, 2019

arunmahadevan left a comment

Choose a reason for hiding this comment

jahstreet commented May 23, 2019

igorcalabria commented May 23, 2019

ghost commented Jun 6, 2019 • edited by ghost Loading

arunmahadevan left a comment

Choose a reason for hiding this comment

jahstreet commented Jun 19, 2019

jerryshao commented Jul 12, 2019

jahstreet commented Aug 24, 2020

cekicbaris commented Sep 12, 2020 • edited Loading

idzikovsky Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

jahstreet Sep 20, 2020

Choose a reason for hiding this comment

kyprifog commented Sep 24, 2020 • edited Loading

KamalGalrani commented Sep 25, 2020

jahstreet commented Sep 25, 2020

jahstreet commented Sep 25, 2020

jahstreet commented Sep 25, 2020

cekicbaris commented Sep 25, 2020 • edited Loading

jahstreet commented Sep 25, 2020

kyprifog commented Sep 25, 2020 • edited Loading

jahstreet commented Sep 25, 2020

kyprifog commented Sep 25, 2020 • edited Loading

cekicbaris commented Sep 25, 2020 • edited Loading

jahstreet commented Sep 27, 2020

kyprifog commented Oct 5, 2020

jahstreet commented Oct 6, 2020 • edited Loading

kyprifog commented Oct 6, 2020

kyprifog commented Dec 22, 2020 • edited Loading

JagadeeshNagella commented Jan 26, 2021 • edited Loading

jahstreet commented Jan 27, 2021

JagadeeshNagella commented Feb 3, 2021

jpugliesi commented Feb 27, 2021

jahstreet commented Mar 2, 2021

jpugliesi commented Mar 23, 2021 • edited Loading

jahstreet commented May 14, 2021

prongs commented Jul 6, 2021

jahstreet commented Jul 6, 2021

codecov-commenter commented Oct 9, 2024

Codecov Report

jahstreet commented Apr 9, 2019 •

edited

Loading

codecov-io commented Apr 10, 2019 •

edited

Loading

garo commented Apr 17, 2019 •

edited

Loading

jahstreet commented Apr 18, 2019 •

edited

Loading

jahstreet commented Apr 18, 2019 •

edited

Loading

ghost commented Jun 6, 2019 •

edited by ghost

Loading

cekicbaris commented Sep 12, 2020 •

edited

Loading

idzikovsky Sep 18, 2020 •

edited

Loading

kyprifog commented Sep 24, 2020 •

edited

Loading

cekicbaris commented Sep 25, 2020 •

edited

Loading

kyprifog commented Sep 25, 2020 •

edited

Loading

kyprifog commented Sep 25, 2020 •

edited

Loading

cekicbaris commented Sep 25, 2020 •

edited

Loading

jahstreet commented Oct 6, 2020 •

edited

Loading

kyprifog commented Dec 22, 2020 •

edited

Loading

JagadeeshNagella commented Jan 26, 2021 •

edited

Loading

jpugliesi commented Mar 23, 2021 •

edited

Loading