Replies: 9 comments
-
Did you try to set DOTNET_ASSEMBLY_SEARCH_PATHS to point to your assemblies that contain your UDFs? |
Beta Was this translation helpful? Give feedback.
-
Thank you for the follow up @imback82 ! So setting the DOTNET_ASSEMBLY_SEARCH_PATHS does work in my local environment, but that technique is difficult to apply in several scenarios. e.g. using a zip of assembly, the path doesn't exist ahead of time, as it's dynamically created by the DotnetRunner. Also, using environments like Databricks for example, it's difficult to even anticipate exactly where the binaries are being put or unzipped to. Also of note, is the expectation that the assembly search path is relative the local path on where the driver is running? What about worker nodes? i.e. is this serialization only happening on the driver? Thanks again for your help! |
Beta Was this translation helpful? Give feedback.
-
When using a zip of assemblies, you can assume the assemblies to be unzipped in the current working directory of the driver (the serialization only happens on the driver). For example, say you are using WASB to store your zipped assemblies and you use the
|
Beta Was this translation helpful? Give feedback.
-
Well, I don't think Databricks uses YARN, so @AeroXuk suggested few ways in #193 and #194, and we may need to revisit those PRs to make the experience better in Databricks. |
Beta Was this translation helpful? Give feedback.
-
I think for databricks, how about uploading the required assemblies to dbfs and set environment variable |
Beta Was this translation helpful? Give feedback.
-
It is likely possible, but we'd have to do some bootstrapping per job for the databricks cluster side, which is kind of out of band. i.e. we'd like to treat the infrastructure as a single function, and job specific things get injected as job parameters. But in the above manner described, we'd have to find ways to do job specific IAC for the cluster, which is not ideal. I've actually looked at the DotnetRunner code, and did a POC by changing the behavior to unzip to current working dir rather than assembly subdir. That worked. But we didn't want to permanently fork your scala runner code for this purpose :) |
Beta Was this translation helpful? Give feedback.
-
@zzhu-bh do you want to create a PR for your POC so we can check whether it makes to sense to natively support the scenario? |
Beta Was this translation helpful? Give feedback.
-
Thanks @imback82 - Here's the PR: #622 - please let me know your thoughts. This the quickest way I thought of for addressing this issue I've described. Not sure if it's ideal, but might be better than changing current working directory dynamically. Btw, I want to reiterate again that I'm really amazed with the awesome responses and engagement from your team on this. Our engineering team at Bright Health was concerned with the support we may get by adopting Spark .NET, but what I've seen so far really puts my mind at ease. As director of engineering, I've been exploring this project's current state of maturity as part of our adoption process. And while there are lots of potential improvements / integration that would make this project even better (better support for C# interactive, Spark 3.0.0 :), more streamlined integration with Databricks and notebooks... the way this is being supported today by your team makes me believe without a doubt that this project will be successful, and that it will become widely adopted as a first order option for creating Spark data pipelines using .NET. Keep up the great work! Regards, Zheng |
Beta Was this translation helpful? Give feedback.
-
I'm having the same problem running on a yarn cluster. Setting |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
When UDF's exist in the Spark .NET job, the spark-submit call must happen in the exact same directory as the published assemblies, or you will get a serialization error on the UDF lambda's and/or functions.
i.e. the following will fail (with errors shown below)
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local netcoreapp3.1\microsoft-spark-2.4.x-0.12.1.jar netcoreapp3.1\spark01test.exe
[2020-08-09T15:37:23.1261501Z] [DESKTOP-7HK312O] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: System.Runtime.Serialization.SerializationException: Unable to load type System.Collections.Generic.List`1[[spark01test.Rule, spark01test, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]] required for deserialization.
but this will succeed:
cd netcoreapp3.1/
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.12.1.jar spark01test.exe
This error seems like one you can workaround, except there are quite a few sinister issues associated with it:
To Reproduce
error and reproduction steps described above.
Expected behavior
UDF's work / behave the same way as non-UDF's in Spark Jobs
Desktop (please complete the following information):
Also tried on Ubuntu 18.04 LTS and Ubuntu 20.04 LTS via both Vmware and WSL 1 and WSL 2 on above Windows environment
Additional context
Beta Was this translation helpful? Give feedback.
All reactions