Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LIVY-588]: Full support for Spark on Kubernetes #167

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jahstreet
Copy link
Contributor

@jahstreet jahstreet commented Apr 9, 2019

NOTE: this PR is deprecated and kept for discussions history only. Please refer the #249 to get the latest state of the work.

What changes were proposed in this pull request?

This PR is a new feature proposal: full support for Spark on Kubernetes (inspired by SparkYarnApp implementation).

Since Spark on Kubernetes has been released relatively long ago this can be a good idea to include Kubernetes support to Livy project as well, as it can solve much problems related to working with Spark on Kubernetes, it can fully replace Yarn in case of working atop Kubernetes cluster:

  • Livy UI has cached logs/diagnostics page
  • Livy UI shows links to Spark UI and Spark History Server
  • With Kubernetes Ingress resource Livy can be configured to serve as an orchestrator of Spark Apps atop Kubernetes (PR includes Nginx Ingress support option to create routes to Spark UI)
  • Nginx Ingress solves basePath support for Spark UI and History Server as well as has lots of auth integrations available: https://github.com/kubernetes/ingress-nginx
  • Livy UI can be integrated with Grafana Loki logs (PR provides solution for that)

Dockerfiles repo: https://github.com/jahstreet/spark-on-kubernetes-docker
Helm charts: https://github.com/jahstreet/spark-on-kubernetes-helm

Associated JIRA: https://issues.apache.org/jira/browse/LIVY-588

Design concept: https://github.com/jahstreet/spark-on-kubernetes-helm/blob/develop/README.md

How was this patch tested?

Was tested manually on AKS cluster (Azure Kubernetes Services), Kubernetes v1.11.8:

What do you think on that?

@jahstreet
Copy link
Contributor Author

@vanzin please take a look.

@jahstreet jahstreet changed the title [WIP][LIVY-KUBERNETES]: Full support for Spark on Kubernetes [LIVY-588][WIP]: Full support for Spark on Kubernetes Apr 10, 2019
@vanzin
Copy link

vanzin commented Apr 10, 2019

Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon.

@jahstreet
Copy link
Contributor Author

Just to set expectations, it's very unlikely I'll be able to look at this PR (or any other really) any time soon.

Well, then I'll try to prepare as much as I can till you become available.
Hope anyone from community will be able to share the feedback on the work done.

@codecov-io
Copy link

codecov-io commented Apr 10, 2019

Codecov Report

Merging #167 into master will decrease coverage by 3.47%.
The diff coverage is 26.71%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #167      +/-   ##
============================================
- Coverage      68.6%   65.12%   -3.48%     
- Complexity      904      940      +36     
============================================
  Files           100      102       +2     
  Lines          5666     6291     +625     
  Branches        850      946      +96     
============================================
+ Hits           3887     4097     +210     
- Misses         1225     1614     +389     
- Partials        554      580      +26
Impacted Files Coverage Δ Complexity Δ
...e/livy/server/interactive/InteractiveSession.scala 68.75% <0%> (-0.37%) 46 <0> (+2)
...rver/src/main/scala/org/apache/livy/LivyConf.scala 96.46% <100%> (+0.6%) 22 <1> (+1) ⬆️
...ala/org/apache/livy/utils/SparkKubernetesApp.scala 20.36% <20.36%> (ø) 0 <0> (?)
...main/scala/org/apache/livy/server/LivyServer.scala 32.43% <33.33%> (-3.53%) 11 <0> (ø)
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java 79.25% <50%> (+1.28%) 45 <0> (+4) ⬆️
...rc/main/scala/org/apache/livy/utils/SparkApp.scala 67.5% <55.55%> (-8.5%) 1 <0> (ø)
...in/scala/org/apache/livy/repl/SQLInterpreter.scala 62.5% <0%> (-7.88%) 9% <0%> (+2%)
...ain/scala/org/apache/livy/utils/SparkYarnApp.scala 66.01% <0%> (-7.23%) 40% <0%> (+7%)
...n/scala/org/apache/livy/server/AccessManager.scala 75.47% <0%> (-5.38%) 46% <0%> (+2%)
...cala/org/apache/livy/scalaapi/ScalaJobHandle.scala 52.94% <0%> (-2.95%) 7% <0%> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dee3cc...7f6ef8a. Read the comment docs.

@garo
Copy link

garo commented Apr 17, 2019

I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely.

From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner.

@jahstreet
Copy link
Contributor Author

I'm going to experiment with this a bit: We're running Spark on Kubernetes widely and we are seeking for also migrating our notebook usage on top of Kubernetes. The benefits we are seeing from Kubernetes is the elasticity with the associated cost savings, and the ability to track and analyse the resource usage of individual jobs closely.

From my quick glance on the source I will probably be missing more extensive support for customizing the created drivers (I assume that Livy creates the drivers as pods to the cluster, which then creates the executors). In our usage now with Spark on Kubernetes we supply about 20 different --conf options to the driver, from which some carry job specific information such as name and owner.

Sounds cool, will be glad to assist you during the experiments. Maybe you can share with me the cases you are looking the solution for and I'm sure this would be helpful for designing the requirements to the features to implement within this work.

By the way in the near future I'll prepare the guidelines for deployment, customization and usage options of Livy on Kubernetes. Will share the progress on that.

@garo
Copy link

garo commented Apr 18, 2019

I built Livy on my own machine based on your branch and the Dockerfile in your repository. I got it running so that it created the driver pod, but I was unable to fully start the driver due to using my own spark image, which requires some configuration parameters to be passed in.

Here's some feedback:

  • Helm chart doesn't allow to specify "serviceAccount" property for Livy.

  • Couldn't find a way to set namespace which Livy must use. It seems to try to want to search all pods in all namespaces. Also need to set the namespace where the pods are created (Seems to be fixed to "default")

  • Could you provide a way to fully customise the driver pod specification? I would want to set custom volumes and volume-mounts, environment variables, labels, sidecar containers and possibly even customise the command line arguments for the driver.

  • Also a way to provide custom spark configuration settings for the driver pod would be required.

  • Support for macros for both customising the driver pod and the extra spark configuration options. I would at least need the id of the livy session (eg. "livy-session-2-9SZP8Ijv") to be inserted to both the pod template and the spark configuration options.

Unfortunately I don't know Scala really well, so I couldn't really dig into the code easily to determine how this works, so I'm not unable to provide you with more detailed recommendations.

@jahstreet
Copy link
Contributor Author

jahstreet commented Apr 18, 2019

@garo Thanks for the review.

Here are some explanations on you questions:

  • First version of chart were done without RBAC support. I've just done with RBAC support solution for Livy chart and not yet merged it, you can refer feature branch https://github.com/jahstreet/spark-on-kubernetes-helm/blob/charts/livy/rbac-support/charts/livy/values.yaml:
    serviceAccount:
    //Specifies whether a service account should be created
    create: true
    //The name of the service account to use.
    //If not set and create is true, a name is generated using the fullname template
    name:

  • Livy searches for a Driver Pod in all namespaces (theoretically user may want to use any namespace to submit job to) for the first time to initialize KubernetesApplication object, then it uses that object (which contains field namespace) to get Spark Pods states and logs and looks for that information only within 1 target namespace (I've added comments to the lines where this logic is done).

  • By default Livy should submit Spark App to default namespace (if it is not done so, than I need to make a fix ;)) ). You can change that behavior by adding spark.kubernetes.namespace=<desired_namespaces> to /opt/spark/conf/spark-defaults.conf in Livy container. Livy entrypoint is done so that it can set spark-defaults configs with env variables, so you can set Livy container env LIVY_SPARK_KUBERNETES_NAMESPACE=<desired_namespaces> to change Spark Apps default namespace. In the new version of Livy chart I set it to .Release.Namespace. And of course you can pass it as additional conf on App submission within POST request to Livy: { ... "conf": { "spark.kubernetes.namespace":"<desired_namespaces>"}, ... } to overwrite defaults.

  • Please refer some customization explanations to Livy: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/master/charts/livy#customizing-livy-server
    Following that approach you can set any config defaults to both Livy and Spark. If you need to overwrite some - do that on job submission in the POST request body.

  • To customize Driver Pod spec we need a custom build of Spark installed to Livy image (Livy just runs spark-submit). I refer to the official releases of Apache Spark and do not see available options for that at present (including adding sidecars). But it has configs to set volumes and volume-mounts, environment variables, labels (https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties), which we can set to default values as I described before.

  • Customize the command line arguments for the driver - you mean application args? You can pass them on job submission in the POST request body: https://livy.incubator.apache.org/docs/latest/rest-api.html
    Or you need custom spark-submit script options?

  • Why do you need the id of the Livy session and what kind of macros for both customizing the driver pod and the extra spark configuration options do you mean?

Could you provide the example of a Job you wanna run? I hope I will be able to show you the available solutions using that example.

@garo
Copy link

garo commented Apr 18, 2019

Thank you very much for the detailed response! I'm just leaving for my easter holiday so I am not going to be able to actually try again until after that.

I however created this gist showing how we create the spark drivers in our current workflow: We run Azkaban (like a glorified cron service) which runs our spark applications. Each application (ie. a scheduled cron execution) starts a spark driver pod into kubernetes. If you look at this gist https://gist.github.com/garo/90c6e69d2430ef7d93ca9f564ba86059 there is first a build of spark-submit configuration parameters following with the yaml for the driver pod.

So I naturally tried to think how I can use Livy to launch the same image with same kind of settings. I think that with your explanations I can implement most if not all of these settings except the run_id.

Lets continue this discussion after easter. Have a great week!

@jahstreet
Copy link
Contributor Author

jahstreet commented Apr 18, 2019

Just to clarify to be on the same page...
When you send request to Livy, eg:

kubectl exec livy-pod -- curl -H 'Content-Type: application/json' -X POST
-d '{ "name": "spark-pi",
"proxyUser": "livy_user",
"numExecutors": 2,
"conf": {
"spark.kubernetes.container.image": "sasnouskikh/spark:2.4.1-hadoop_3.2.0",
"spark.kubernetes.container.image.pullPolicy": "Always",
"spark.kubernetes.namespace": "default"
},
"file": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.1.jar",
"className": "org.apache.spark.examples.SparkPi",
"args": [
"1000000"
]
}' "http://localhost:8998/batches"

Under the hood livy just runs spark-submit for you:

spark-submit
--master k8s://https://<k8s_api_server>:443
--deploy-mode cluster
--name spark-pi
--class org.apache.spark.examples.SparkPi
--conf spark.executor.instances=2
--conf spark.kubernetes.container.image=sasnouskikh/spark:2.4.1-hadoop_3.2.0
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.namespace=default
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.1.jar 100000

Starting from Spark 2.4.0, spark-submit in cluster-mode creates Driver Pod, which entrypoint runs spark-submit in client mode, just like you try to do in the gist.
So I do not see why you may want to deploy customized Driver Pod in that particular case.
Most of --conf may be moved to defaults and you will have pretty JSON.

Pushgateway sidecar may be deployed as a separate Pod, just configure prometheus sink with right pushgateway-address.
All other configs for Driver Pod customization are already covered by docs for Spark on Kubernetes.

Good week for you!

@ghost
Copy link

ghost commented Apr 23, 2019

I'm getting the following error

19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO Client: Deployed Spark application livy-session-0 into Kubernetes.
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO ShutdownHookManager: Shutdown hook called
19/04/23 16:26:04 INFO LineBufferedStream: 19/04/23 16:26:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-62b7810e-667d-47e7-9940-72f8cd5f91e9
19/04/23 16:26:04 DEBUG InteractiveSession: InteractiveSession 0 app state changed from RUNNING to FINISHED
19/04/23 16:26:04 DEBUG InteractiveSession: InteractiveSession 0 session state change from starting to dead
19/04/23 16:26:10 DEBUG AbstractByteBuf: -Dio.netty.buffer.bytebuf.checkAccessible: true
19/04/23 16:26:10 DEBUG ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
19/04/23 16:26:10 DEBUG ResourceLeakDetector: -Dio.netty.leakDetection.maxRecords: 4
19/04/23 16:26:10 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 262144
19/04/23 16:26:10 DEBUG Recycler: -Dio.netty.recycler.linkCapacity: 16
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (41 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Handling SASL challenge message...
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Sending SASL challenge response...
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (98 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (275 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Handling SASL challenge message...
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: Sending SASL challenge response...
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$SaslMessage (45 bytes)
19/04/23 16:26:10 DEBUG RpcServer$SaslServerHandler: SASL negotiation finished with QOP auth.
19/04/23 16:26:10 DEBUG ContextLauncher: New RPC client connected from [id: 0x2ae2b51a, L:/10.233.94.163:10000 - R:/10.233.94.164:39008].
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress (94 bytes)
19/04/23 16:26:10 DEBUG RpcDispatcher: [RegistrationHandler] Received RPC message: type=CALL id=0 payload=org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress
19/04/23 16:26:10 DEBUG ContextLauncher: Received driver info for client [id: 0x2ae2b51a, L:/10.233.94.163:10000 - R:/10.233.94.164:39008]: livy-session-0-1556036763266-driver/10000.
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
19/04/23 16:26:10 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$NullMessage (2 bytes)
19/04/23 16:26:10 DEBUG RpcDispatcher: Channel [id: 0x2ae2b51a, L:/10.233.94.163:10000 ! R:/10.233.94.164:39008] became inactive.
19/04/23 16:26:10 ERROR RSCClient: Failed to connect to context.
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1206)
	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:525)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:510)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:492)
	at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:949)
	at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:208)
	at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
	at java.lang.Thread.run(Thread.java:748)
19/04/23 16:26:10 ERROR RSCClient: RPC error.
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1206)
	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:525)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:510)
	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:492)
	at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:949)
	at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:208)
	at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
	at java.lang.Thread.run(Thread.java:748)
19/04/23 16:26:10 INFO RSCClient: Failing pending job d509417c-c894-416d-8218-625b278da8b7 due to shutdown.

Spark is running in different namespace that Livy. Service is also created just before this message appears so it does not seems to be error in ordering. Am I doing something wrong?

@jahstreet
Copy link
Contributor Author

@lukatera
Good day,

From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately.

I require to know more about your environment to move further. Could you please provide something from the following in addition:

  • What is your Kubernetes installation and version?
  • What Docker images do you use? What version of Spark is running (this Livy was tested with Spark 2.4.0+, 2.3.* wasn't good enough and had some unpleasant bugs)? From what commit have you built Livy (if you did so)?
  • What are the livy.conf and livy-client.conf content (/opt/livy/conf/...)?
  • What are the Spark Job configs: kubectl describe configmap <spark-driver-pod-conf-map> -n <spark-job-namespace>? What is the JSON body you post to create a session?
  • What are the Spark Driver Pod logs?
  • If you use Helm charts - what are the versions and what are the custom values you provide on install?
  • Maybe some more debugging info you feel may be related?

Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues.

Will be happy to help, thanks for the feedback.

@ghost
Copy link

ghost commented Apr 24, 2019

@lukatera
Good day,

From the first look I see that you are using either Livy build not from that PR (I've fixed the similar issue in that commit), or your Livy and/or Spark is configured not appropriately.

I require to know more about your environment to move further. Could you please provide something from the following in addition:

  • What is your Kubernetes installation and version?
  • What Docker images do you use? What version of Spark is running (this Livy was tested with Spark 2.4.0+, 2.3.* wasn't good enough and had some unpleasant bugs)? From what commit have you built Livy (if you did so)?
  • What are the livy.conf and livy-client.conf content (/opt/livy/conf/...)?
  • What are the Spark Job configs: kubectl describe configmap <spark-driver-pod-conf-map> -n <spark-job-namespace>? What is the JSON body you post to create a session?
  • What are the Spark Driver Pod logs?
  • If you use Helm charts - what are the versions and what are the custom values you provide on install?
  • Maybe some more debugging info you feel may be related?

Currently I run Livy build from this PR's branch with the provided Helm charts and Docker images both on Minikube for Windows and on Azure AKS without issues.

Will be happy to help, thanks for the feedback.

Thanks for the help! I was checking out master branch from your repo instead of this specific one. All good now!

@jahstreet
Copy link
Contributor Author

@lukatera
Cool, nice to know that. Do not hesitate to ask if you face any problems on that.

@igorcalabria
Copy link

Great PR! one suggestion is maybe adding the authenticated livy user to both driver and executor pods labels. It should be simple enough since spark already supports arbitrary labels through submit command spark.kubernetes.driver.label.[LabelName].

@jahstreet
Copy link
Contributor Author

@igorcalabria
Thanks for the feedback, what mechanism of getting livy user value do you propose? I see an option of setting those labels with proxyUser value on Spark job submission from POST request to Livy. Did you mean that?

@igorcalabria
Copy link

@igorcalabria
Thanks for the feedback, what mechanism of getting livy user value do you propose? I see an option of setting those labels with proxyUser value on Spark job submission from POST request to Livy. Did you mean that?

@jahstreet
It could be that, but I was thinking about the authenticated user(via kerberos) making the requests. To give you more context, this could be great for resource usage tracking, especially if livy has more info available about the principal, like groups or even teams.

I'm not familiar with livy's codebase, but I'm guessing that the param we want is owner on the Session classes:

@jahstreet
Copy link
Contributor Author

@igorcalabria
Oh, I see, will try that, thanks.

Copy link
Contributor

@arunmahadevan arunmahadevan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. I played around and things seem to work though I faced a few issues with the batch mode.

Shouldn't the Livy docker/helm charts should also be part of the livy repository since its most likely that users would want to run Livy in a K8s container while launching spark on k8s?. Maybe it can be added as a follow-up task.

@jahstreet
Copy link
Contributor Author

Shouldn't the Livy docker/helm charts should also be part of the livy repository since its most likely that users would want to run Livy in a K8s container while launching spark on k8s?. Maybe it can be added as a follow-up task.

Well that a good idea. Since this patch will be accepted and merged I would love to take care of that.
Thanks for your feedback.

@igorcalabria
Copy link

@jahstreet There's a minor issue when a interactive session is recovered from a filesystem. After a restart, livy correctly recovers the session, but it stops displaying the spark master's url on the "Sessions" tab. The config used was pretty standard

    livy.server.recovery.mode = recovery
    livy.server.recovery.state-store = filesystem
    livy.server.recovery.state-store.url = ...

@ghost
Copy link

ghost commented Jun 6, 2019

Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.

%%configure -f
{
    "proxyUser": "customUser"
}

However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster.

If I set HADOOP_USER_NAME env variable on the driver and the executor, it runs stuff on top of hadoop as that user.

I saw this in the driver logs however:

19/06/06 14:53:11 DEBUG UserGroupInformation: PrivilegedAction as:customUser (auth:PROXY) via root (auth:SIMPLE) from:org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:150)

Copy link
Contributor

@arunmahadevan arunmahadevan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new configs also needs to be added to livy conf template (https://github.com/apache/incubator-livy/blob/master/conf/livy.conf.template) with sufficient comments.

A separate documentation explaining the setup (like the running spark on kubernetes) would definitely be helpful.

@jahstreet
Copy link
Contributor Author

Livy impersonation seems to not be working. I'm trying to use it with Jupyter and sparkmagic with no luck.

%%configure -f
{
    "proxyUser": "customUser"
}

However, I'm not familiar with Livy enough to say how this should work and if it requires kerberized HDFS cluster.

If I set HADOOP_USER_NAME env variable on the driver and the executor, it runs stuff on top of hadoop as that user.

I saw this in the driver logs however:

19/06/06 14:53:11 DEBUG UserGroupInformation: PrivilegedAction as:customUser (auth:PROXY) via root (auth:SIMPLE) from:org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:150)

Actually I'm not familiar with Livy impersonation and do not know how it should behave. Maybe someone can clarify that?

@jerryshao
Copy link
Contributor

@jahstreet thanks a lot for your contribution, I'm wondering do you a design doc about k8s support on Livy?

@jahstreet
Copy link
Contributor Author

Hi @jahstreet any update on Spark 3.0 ? build failed on spark-3.0 profile

Hi, in progress, you can track it here. Once finished I'll update the PR and fix the build.

@cekicbaris
Copy link

cekicbaris commented Sep 12, 2020

Hi @jahstreet ,

First of all thanks for your efforts for Livy on K8S. We are testing it in our landscape. It works fine for most cases but occasionally we got the following error message. Any idea why it happens? This creates new driver and executor pods without a session on livy even the original session is alive and operational.

2020-09-12T20:05:05.271309681Z 2020-09-12 20:05:05 WARN RSCClient:127 - Client RPC channel closed unexpectedly. 2020-09-12T20:05:05.271758816Z 2020-09-12 20:05:05 WARN RSCClient:234 - Error stopping RPC. 2020-09-12T20:05:05.271776249Z io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@1ee211ac(uncancellable) 2020-09-12T20:05:05.27178287Z at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:394) 2020-09-12T20:05:05.271788878Z at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) 2020-09-12T20:05:05.271794021Z at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:230) 2020-09-12T20:05:05.271798608Z at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:129) 2020-09-12T20:05:05.271803528Z at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:28) 2020-09-12T20:05:05.271808597Z at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:336) 2020-09-12T20:05:05.271813574Z at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:117) 2020-09-12T20:05:05.271818427Z at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:28) 2020-09-12T20:05:05.271823729Z at org.apache.livy.rsc.rpc.Rpc.close(Rpc.java:310) 2020-09-12T20:05:05.271828282Z at org.apache.livy.rsc.RSCClient.stop(RSCClient.java:232) 2020-09-12T20:05:05.271833421Z at org.apache.livy.rsc.RSCClient$2$1.onSuccess(RSCClient.java:129) 2020-09-12T20:05:05.271838476Z at org.apache.livy.rsc.RSCClient$2$1.onSuccess(RSCClient.java:123) 2020-09-12T20:05:05.271843788Z at org.apache.livy.rsc.Utils$2.operationComplete(Utils.java:108) 2020-09-12T20:05:05.271849251Z at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:518) 2020-09-12T20:05:05.271853952Z at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:492) 2020-09-12T20:05:05.271886201Z at io.netty.util.concurrent.DefaultPromise.notifyListenersWithStackOverFlowProtection(DefaultPromise.java:431) 2020-09-12T20:05:05.271892906Z at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) 2020-09-12T20:05:05.271897634Z at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:108) 2020-09-12T20:05:05.271902407Z at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) 2020-09-12T20:05:05.271907054Z at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:995) 2020-09-12T20:05:05.271911776Z at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:621) 2020-09-12T20:05:05.271916658Z at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:599) 2020-09-12T20:05:05.271921391Z at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:543) 2020-09-12T20:05:05.271927411Z at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.closeOnRead(AbstractNioByteChannel.java:71) 2020-09-12T20:05:05.271932031Z at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:158) 2020-09-12T20:05:05.271935259Z at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564) 2020-09-12T20:05:05.271938426Z at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505) 2020-09-12T20:05:05.271941673Z at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419) 2020-09-12T20:05:05.27194499Z at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391) 2020-09-12T20:05:05.271948247Z at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) 2020-09-12T20:05:05.27195149Z at java.lang.Thread.run(Thread.java:748)

appTagLabel: String = SPARK_APP_TAG_LABEL,
appIdLabel: String = SPARK_APP_ID_LABEL
): Seq[KubernetesApplication] = {
client.pods.inAnyNamespace
Copy link
Contributor

@idzikovsky idzikovsky Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
In my opinion it would be good to have an ability to optionally configure which namespace Livy should check here, so it would be possible to bound one Livy Server to one namespace, so Deployment with Livy Server wouldn't need to have permissions to list pods cluster-wide (and there are no more places where Livy needs cluster-wide permissions at all).

Is this makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, truly agree. This feature is enabled here: https://github.com/apache/incubator-livy/pull/249/files#diff-486ae8357b2836f1addaafe102d02287R169-R219 . Going to port the changes from #249 to this one a bit later.

@kyprifog
Copy link

kyprifog commented Sep 24, 2020

@jahstreet does this mean that spark on kubernetes helm chart only supports kubernetes api <= 1.15.X? I am getting @JagadeeshNagella 's first error now and I can't tell whats changed other than the kubernetes verion.

@KamalGalrani
Copy link

I was able to setup Livy using the helm chart, but when I create a session it fails. I am using the default configuration with minikube

Create session payload

{
    "kind": "pyspark",
    "name": "test-session1234",
    "conf": {
      "spark.kubernetes.namespace": "livy"
    }
}
20/09/25 04:00:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [livy]  failed.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330)
	at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
	at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
	at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
	at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894)
	at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865)
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
	at okio.Okio$1.write(Okio.java:79)
	at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
	at okio.RealBufferedSink.flush(RealBufferedSink.java:224)
	at okhttp3.internal.http2.Http2Writer.settings(Http2Writer.java:203)
	at okhttp3.internal.http2.Http2Connection.start(Http2Connection.java:515)
	at okhttp3.internal.http2.Http2Connection.start(Http2Connection.java:505)
	at okhttp3.internal.connection.RealConnection.startHttp2(RealConnection.java:298)
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:287)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:168)
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:257)
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254)
	at okhttp3.RealCall.execute(RealCall.java:92)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:411)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:241)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:819)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:334)
	... 17 more
20/09/25 04:01:00 INFO ShutdownHookManager: Shutdown hook called
20/09/25 04:01:00 INFO ShutdownHookManager: Deleting directory /tmp/spark-343d41df-d58c-4ed4-8a03-2eabbc21da1d

Kubernetes Diagnostics: 
Operation: [list]  for kind: [Pod]  with name: [null]  in namespace: [null]  failed.

@jahstreet
Copy link
Contributor Author

kubernetes helm chart only supports kubernetes api <= 1.15.X

@kyprifog indeed. I'm working on Spark 3 support in https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade, where this limit is extended to ">=1.11.0 <= 1.17.0".

@jahstreet
Copy link
Contributor Author

@cekicbaris , would be nice to see Spark Driver logs during this failure, I believe it might be related to Livy <-> Spark Driver communication. Also might be the networking is not very stable in your env, not sure if Livy does retries.

@jahstreet
Copy link
Contributor Author

@cekicbaris
Copy link

cekicbaris commented Sep 25, 2020

@cekicbaris , would be nice to see Spark Driver logs during this failure, I believe it might be related to Livy <-> Spark Driver communication. Also might be the networking is not very stable in your env, not sure if Livy does retries.

@jahstreet
Do you mind to give some advice to check the networking? It is running on AWS EKS on kubernetes 1.15 and in livy namespace. livy service account has enough auth.

One more thing, if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running. But if I delete the session with a DELETE request to REST API, then it also deletes the pods.
Here is a time'd out session log.

2020-09-23T07:19:29.853571515Z 2020-09-23 07:19:29 INFO  InteractiveSessionManager:39 - Deleting InteractiveSession 2 because it was inactive for more than 3600000.0 ms.
2020-09-23T07:19:29.853690768Z 2020-09-23 07:19:29 INFO  InteractiveSession:39 - Stopping InteractiveSession 2...
2020-09-23T07:19:29.994715243Z 2020-09-23 07:19:29 WARN  RpcDispatcher:191 - [ClientProtocol] Closing RPC channel with 1 outstanding RPCs.
2020-09-23T07:19:30.013570228Z 2020-09-23 07:19:30 INFO  InteractiveSession:39 - Stopped InteractiveSession 2.
2020-09-23T07:19:39.119654462Z 2020-09-23 07:19:39 ERROR SparkKubernetesApp:56 - Unknown Kubernetes state unknown for app with tag livy-session-2-dfceC0pO.

@jahstreet
Copy link
Contributor Author

Do you mind to give some advice to check the networking?

I would leave it as the last effort, it's better to start from checking the Driver logs in the moment of the failure and/or more logs from Livy to get the idea on what was happening before and after the failure. Otherwise it is good to contact your network SREs to give the advice on that. They definitely know more about your setup.

if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running.

I remember this bug in this PR, though I believe it has been fixed in #249 . This PR is out of sync and not supported anymore, but once I get some free time of my work I plan to backport it to this PR and the issue should be gone then. Also once I finish with Spark 3 support in the Helm chart I will also update the images which will solve that as well. Please refer: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade .

@kyprifog
Copy link

kyprifog commented Sep 25, 2020

@jahstreet I'm pretty sure I was using something in the 1.17 range before and it still worked, can you confirm?

I tried three different (what I thought were stable) versions of 1.15.X yesterday and couldn't get the cluster up and running with any of them, I think something has drifted with the AWS api since then that doesn't work with older kubernetes versions, so I feel like I would have better luck with 1.17.X unless you have a version that has been working for you with AWS.

Btw does your comment mean that spark 2.4.5 support is working with 1.17? Maybe 2.4.5 just doesn't work with what I'm using now (1.18.8). 2.4.5 is good enough for my purposes.

This compatability matrix is kind of challenging because downgrading the k8s api on a cluster is somewhat involved. Luckily I have almost everything in terraform

@jahstreet
Copy link
Contributor Author

Maybe 2.4.5 just doesn't work with what I'm using now (1.18.8). 2.4.5 is good enough for my purposes.

It shouldn't work with K8s APIs > 1.15.3 (this is said so by the fabric8 compatibility matrix). In case it works for some commands - you are lucky. But then you can face the moment when it won't work.

Alternative is to upgrade Spark fabric8 client dependency version and have the custom image build, please refer this comment to get the idea on what should be changed: https://stackoverflow.com/a/60052900/7947644 .

@kyprifog
Copy link

kyprifog commented Sep 25, 2020

@jahstreet you said that you are working on spark 3.0.0 support in that branch but does that mean 2.4.5 is already working with k8's 1.17 + on that branch? I'm trying to avoid downgrading to 1.15, and upgrading fabric seems like a not so savory option either.

@cekicbaris
Copy link

cekicbaris commented Sep 25, 2020

Do you mind to give some advice to check the networking?

I would leave it as the last effort, it's better to start from checking the Driver logs in the moment of the failure and/or more logs from Livy to get the idea on what was happening before and after the failure. Otherwise it is good to contact your network SREs to give the advice on that. They definitely know more about your setup.

if the session time'd out , the interactive session is deleted from livy but driver pods and executor pods are still running.

I remember this bug in this PR, though I believe it has been fixed in #249 . This PR is out of sync and not supported anymore, but once I get some free time of my work I plan to backport it to this PR and the issue should be gone then. Also once I finish with Spark 3 support in the Helm chart I will also update the images which will solve that as well. Please refer: https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade .

@jahstreet , thanks for quick reply. Please let me know how I can help/contribute ?
In the meantime, I tried to deploy https://github.com/jahstreet/spark-on-kubernetes-helm/tree/spark-3.0.0-upgrade chart but the driver failed to launch. Is it because of images not updated yet ?

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.util.concurrent.ExecutionException: java.lang.IncompatibleClassChangeError: Found interface org.objectweb.asm.MethodVisitor, but class was expected
	at io.netty.util.concurrent.DefaultPromise.get(DefaultPromise.java:349)
	at org.apache.livy.rsc.driver.RSCDriver.initializeServer(RSCDriver.java:210)
	at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:343)
	at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:93)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.objectweb.asm.MethodVisitor, but class was expected
	at org.apache.livy.shaded.kryo.reflectasm.ConstructorAccess.insertConstructor(ConstructorAccess.java:128)
	at org.apache.livy.shaded.kryo.reflectasm.ConstructorAccess.get(ConstructorAccess.java:98)
	at org.apache.livy.shaded.kryo.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1271)
	at org.apache.livy.shaded.kryo.kryo.Kryo.newInstantiator(Kryo.java:1127)
	at org.apache.livy.shaded.kryo.kryo.Kryo.newInstance(Kryo.java:1136)
	at org.apache.livy.shaded.kryo.kryo.serializers.FieldSerializer.create(FieldSerializer.java:562)
	at org.apache.livy.shaded.kryo.kryo.serializers.FieldSerializer.read(FieldSerializer.java:538)
	at org.apache.livy.shaded.kryo.kryo.Kryo.readClassAndObject(Kryo.java:813)
	at org.apache.livy.client.common.Serializer.deserialize(Serializer.java:67)
	at org.apache.livy.rsc.rpc.KryoMessageCodec.decode(KryoMessageCodec.java:70)
	at io.netty.handler.codec.ByteToMessageCodec$1.decode(ByteToMessageCodec.java:42)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:498)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:437)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:748)

@jahstreet
Copy link
Contributor Author

@kyprifog

but does that mean 2.4.5 is already working with k8's 1.17 + on that branch?

Nope, Spark 2.4.x doesn't work with K8s > 1.15 w/o fabric8 upgrade. I'm not going to support that either.

@cekicbaris

Is it because of images not updated yet ?

The images are not yet finished and tested. I'll leave the announcement here and in #249 once released.

@kyprifog
Copy link

kyprifog commented Oct 5, 2020

@jahstreet I probbably won't use this until #249 is done because downgrading my k8s clusters proved to be a PITA. Is there anything aside from testing holding up that PR? Anywhere you are getting stuck that I can try to push it along?

@jahstreet
Copy link
Contributor Author

jahstreet commented Oct 6, 2020

@kyprifog , good news for you. Yesterday I've upgraded the helm charts to Spark 3.0.1 which unlocks K8s API 1.18.0 usage. Feel free to try it out with this guide.
In the meantime I'm going to backport the required changes to this and #249 PRs.

@kyprifog
Copy link

kyprifog commented Oct 6, 2020

Will do! I'll let you know if I hit any snags, thanks for all the work you've put into this.

@kyprifog
Copy link

kyprifog commented Dec 22, 2020

@jahstreet Pulling in my comment from here: JahstreetOrg/spark-on-kubernetes-helm#54 It seems like that spark submit in spark 3.0 is the one responsible for managing the forwarding of AWS keys so I think that maybe this is getting missed in this corresponding livy implementation. I'm going to dig deeper and confirm/ make the appropriate fix.

Scratch that, this actually works how I would expect using spark conf. Will look into ENV variable AWS config later, but not high priority, and out of scope for incubator-livy.

@JagadeeshNagella
Copy link

JagadeeshNagella commented Jan 26, 2021

@jahstreet ,

I have requirement to use pod templates feature https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
I am able to use in spark-submit

--conf spark.kubernetes.driver.podTemplateFile: "spark-driver-pod-template.yaml"
--conf spark.kubernetes.executor.podTemplateFile": "spark-executor-pod-template.yaml"

how can I add these configurations in livy request . spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to point to local files accessible to the spark-submit process in a single tenant I can build Livy image with hard-corded templates but in multi tenets we need this files comes from client how can we pass configs .can you able to check this.

@jahstreet
Copy link
Contributor Author

Hi @JagadeeshNagella , unfortunately it is not possible with the setup you've described. The files should be available for Livy container and be local for its FS. So in your case I see several options:

  • Setup the process to copy .yaml file over to the Livy container and then refer it in the job request
  • Attach shared folder to Livy container and setup the process to copy files over to it for each user (I hope it is possible to configure ACLs for the share folder subdirectories)
  • Refer the file from the shared storage (eg.: hdfs or s3), though I'm not sure spark currently supports that

This is what I currently can come up with. I would also suggest reaching spark on Kubernetes community for the advice.
Best.

@JagadeeshNagella
Copy link

@jahstreet thank you for the response. Let me explore options.

@jpugliesi
Copy link

Can the livy maintainers shed some light on remaining steps to merging this PR?

@jahstreet
Copy link
Contributor Author

@jpugliesi this one can be close but still kept for the reference as the initial idea of the feature. The first PR to merge should be #249 as the first step towards integrating Livy with K8s RM.

@jpugliesi
Copy link

jpugliesi commented Mar 23, 2021

@jahstreet One question about configuring the Spark UI in livy/sparkmagic - with this current implementation, it seems that a sparkmagic notebook's Spark UI link will link to the Kubernetes ingress host configured if livy.server.kubernetes.ingress.create=true. Unfortunately, this link will not be updated to point to the url of the history server upon spark session completion.

Have you figured out a way to construct this notebook link to point to a proxy that will either:

  1. Route to Spark driver's Spark UI, if the driver/ingress still exists, or
  2. Route to the Spark history server for the spark app id, in case the the session is no longer in progress

Appreciate your help!

@jahstreet
Copy link
Contributor Author

Hi @jpugliesi , the routing indeed should work as you described and if it is not so then there is a bug in the implementation. Unfortunately since the maintainers are not looking forward continuing with this work I'm not going to continue maintaining the PRs as well. The best I can advice you is to look into the #249 #252 PRs and in the https://github.com/JahstreetOrg/spark-on-kubernetes-helm https://github.com/JahstreetOrg/spark-on-kubernetes-docker repositories where the latest state of the project is stored along with some guidance around using it.
I hope to get to it somewhere this year to update to the latest state of things.
Best.

@prongs
Copy link

prongs commented Jul 6, 2021

@jahstreet Huge Thanks for your efforts on this. Saved us a few days. I could have saved some more had I found it sooner :-D

Anyways, we're seeing some weird behaviour in which when spark-driver is connecting to Livy RPC. We see the following in the Livy logs:

image

The Local (L) side is fine, as that's where livy's RPC server is running, but the Remote (R) side is incorrect. The remote should be coming as the driver pod and not 127.0.0.1. Would you have any idea about this?

To get around this, I ended up making the following change.

image

And now, instead of connecting to the remote side of the channel, it tries to connect to the hostname communicated inside the RPC messages.

Now I wonder

  • Why that's not already the case? Why are we not using the hostname communicated in the message.
  • How it works for you and others on this thread without this change?

@jahstreet
Copy link
Contributor Author

Hi @prongs , thank you for joining the party 🎉 .
This PR is outdated and kept just for the discussions (probably I should reflect that in the description).
The work done here were split into smaller pieces and the first one lives now in #249 . Please check this block of the code to see how I approached that problem: https://github.com/apache/incubator-livy/pull/249/files#diff-43114318c4b009c2404f7eb326a84c184fb1501a3237c49a771df851d0f6f328R172-R177 . Though I think your solution looks nicer indeed. Worths to look deeper inside. 👍

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 8.10811% with 408 lines in your changes missing coverage. Please review.

Project coverage is 26.21%. Comparing base (97cf2f7) to head (1c6bee8).
Report is 46 commits behind head on master.

Files with missing lines Patch % Lines
...ala/org/apache/livy/utils/SparkKubernetesApp.scala 0.00% 388 Missing ⚠️
...rc/main/scala/org/apache/livy/utils/SparkApp.scala 50.00% 8 Missing and 1 partial ⚠️
...main/scala/org/apache/livy/server/LivyServer.scala 0.00% 4 Missing and 1 partial ⚠️
...e/livy/server/interactive/InteractiveSession.scala 0.00% 1 Missing and 3 partials ⚠️
...ain/java/org/apache/livy/rsc/driver/RSCDriver.java 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (97cf2f7) and HEAD (1c6bee8). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (97cf2f7) HEAD (1c6bee8)
4 1
Additional details and impacted files
@@              Coverage Diff              @@
##             master     #167       +/-   ##
=============================================
- Coverage     68.48%   26.21%   -42.27%     
+ Complexity      840      358      -482     
=============================================
  Files           103      104        +1     
  Lines          5940     6378      +438     
  Branches        898      959       +61     
=============================================
- Hits           4068     1672     -2396     
- Misses         1312     4364     +3052     
+ Partials        560      342      -218     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.