[BUG]: Must run spark-submit in the same directory as published assembly or UDF's fail with serialization errors #615

zzhu8192a · 2020-08-09T15:46:16Z

zzhu8192a
Aug 9, 2020

Describe the bug
When UDF's exist in the Spark .NET job, the spark-submit call must happen in the exact same directory as the published assemblies, or you will get a serialization error on the UDF lambda's and/or functions.

i.e. the following will fail (with errors shown below)
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local netcoreapp3.1\microsoft-spark-2.4.x-0.12.1.jar netcoreapp3.1\spark01test.exe

[2020-08-09T15:37:23.1261501Z] [DESKTOP-7HK312O] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: System.Runtime.Serialization.SerializationException: Unable to load type System.Collections.Generic.List`1[[spark01test.Rule, spark01test, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]] required for deserialization.

but this will succeed:

cd netcoreapp3.1/
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.12.1.jar spark01test.exe

This error seems like one you can workaround, except there are quite a few sinister issues associated with it:

if you build an assembly.zip, the way the current DotnetRunner.scala works, it creates a subdirectory containing the assembly, but doesn't change current working directory to the subdirectory. So UDF's will fail with above errors when you run with assembly.zip.
If you try to run things like python notebook (Jupyter) using C# interactive libs, you can't control current directory. UDF's will fail there (non-UDF spark API's work just fine)
Debugging and running unit testing (Mstest or xUnit) can also fail on UDF's by default.

To Reproduce
error and reproduction steps described above.

Expected behavior
UDF's work / behave the same way as non-UDF's in Spark Jobs

Desktop (please complete the following information):

OS: tried on Windows 10 x64 2004 with .NET core SDK 3.1.101, 302 and Visual Studio 16.6, 16.7
Also tried on Ubuntu 18.04 LTS and Ubuntu 20.04 LTS via both Vmware and WSL 1 and WSL 2 on above Windows environment
Version 0.12.1

Additional context

imback82 · 2020-08-09T17:11:12Z

imback82
Aug 9, 2020

Did you try to set DOTNET_ASSEMBLY_SEARCH_PATHS to point to your assemblies that contain your UDFs?

0 replies

zzhu8192a · 2020-08-09T17:52:47Z

zzhu8192a
Aug 9, 2020
Author

Thank you for the follow up @imback82 ! So setting the DOTNET_ASSEMBLY_SEARCH_PATHS does work in my local environment, but that technique is difficult to apply in several scenarios. e.g. using a zip of assembly, the path doesn't exist ahead of time, as it's dynamically created by the DotnetRunner. Also, using environments like Databricks for example, it's difficult to even anticipate exactly where the binaries are being put or unzipped to.

Also of note, is the expectation that the assembly search path is relative the local path on where the driver is running? What about worker nodes? i.e. is this serialization only happening on the driver?

Thanks again for your help!

0 replies

Niharikadutta · 2020-08-10T04:02:33Z

Niharikadutta
Aug 10, 2020
Collaborator

When using a zip of assemblies, you can assume the assemblies to be unzipped in the current working directory of the driver (the serialization only happens on the driver). For example, say you are using WASB to store your zipped assemblies and you use the --archives option to pass the zip folder to the spark-submit command, you can set the DOTNET_ASSEMBLY_SEARCH_PATHS like so:

spark-submit --master yarn --conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS="./app/zipAssemblies" --archives wasbs://<containerName>@accountName/folder_path/zipAssemblies.zip#app --class org.apache.spark.deploy.dotnet.DotnetRunner <microsoft-spark.jar> <wasb_zip_assemblies> <app.exe> <app arguments>
#app would create a folder nameD app in the current working directory where zipAssemblies woUld get extracted, and you can set that as the path for DOTNET_ASSEMBLY_SEARCH_PATHS.

0 replies

imback82 · 2020-08-10T04:12:54Z

imback82
Aug 10, 2020

Well, I don't think Databricks uses YARN, so --conf spark.yarn.appMasterEnv will probably not work. @suhsteve, do you remember the easiest way to achieve this?

@AeroXuk suggested few ways in #193 and #194, and we may need to revisit those PRs to make the experience better in Databricks.

0 replies

suhsteve · 2020-08-10T22:22:57Z

suhsteve
Aug 10, 2020
Maintainer

I think for databricks, how about uploading the required assemblies to dbfs and set environment variable DOTNET_ASSEMBLY_SEARCH_PATHS to this path ?

0 replies

zzhu8192a · 2020-08-10T22:38:11Z

zzhu8192a
Aug 10, 2020
Author

It is likely possible, but we'd have to do some bootstrapping per job for the databricks cluster side, which is kind of out of band. i.e. we'd like to treat the infrastructure as a single function, and job specific things get injected as job parameters. But in the above manner described, we'd have to find ways to do job specific IAC for the cluster, which is not ideal.

I've actually looked at the DotnetRunner code, and did a POC by changing the behavior to unzip to current working dir rather than assembly subdir. That worked. But we didn't want to permanently fork your scala runner code for this purpose :)

0 replies

imback82 · 2020-08-11T05:54:33Z

imback82
Aug 11, 2020

@zzhu-bh do you want to create a PR for your POC so we can check whether it makes to sense to natively support the scenario?

0 replies

zzhu8192a · 2020-08-12T05:00:35Z

zzhu8192a
Aug 12, 2020
Author

Thanks @imback82 - Here's the PR: #622 - please let me know your thoughts. This the quickest way I thought of for addressing this issue I've described. Not sure if it's ideal, but might be better than changing current working directory dynamically.

Btw, I want to reiterate again that I'm really amazed with the awesome responses and engagement from your team on this. Our engineering team at Bright Health was concerned with the support we may get by adopting Spark .NET, but what I've seen so far really puts my mind at ease. As director of engineering, I've been exploring this project's current state of maturity as part of our adoption process. And while there are lots of potential improvements / integration that would make this project even better (better support for C# interactive, Spark 3.0.0 :), more streamlined integration with Databricks and notebooks... the way this is being supported today by your team makes me believe without a doubt that this project will be successful, and that it will become widely adopted as a first order option for creating Spark data pipelines using .NET. Keep up the great work!

Regards,

Zheng

0 replies

ovstberg · 2020-10-20T06:47:55Z

ovstberg
Oct 20, 2020

I'm having the same problem running on a yarn cluster. Setting DOTNET_ASSEMBLY_SEARCH_PATHS does not solve the problem. For some reason setting --conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS does not have any effect on where my worker nodes search for assemblies, but export DOTNET_ASSEMBLY_SEARCH_PATHS=<./myzip.zip> on the edge node from where i spark-submit does. Looking through the logs from the worker nodes I can see that the dotnet worker search for assemblies in workingdirectory/./myzip.zip. I have also checked the folders on the worker nodes and workingdirectory/myzip.zip exists and contains all my dll's.. Still, the dotnetworker cannot find the assemblies used in my UDF's.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Must run spark-submit in the same directory as published assembly or UDF's fail with serialization errors #615

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[BUG]: Must run spark-submit in the same directory as published assembly or UDF's fail with serialization errors #615

zzhu8192a Aug 9, 2020

Replies: 9 comments

imback82 Aug 9, 2020

zzhu8192a Aug 9, 2020 Author

Niharikadutta Aug 10, 2020 Collaborator

imback82 Aug 10, 2020

suhsteve Aug 10, 2020 Maintainer

zzhu8192a Aug 10, 2020 Author

imback82 Aug 11, 2020

zzhu8192a Aug 12, 2020 Author

ovstberg Oct 20, 2020

zzhu8192a
Aug 9, 2020

imback82
Aug 9, 2020

zzhu8192a
Aug 9, 2020
Author

Niharikadutta
Aug 10, 2020
Collaborator

imback82
Aug 10, 2020

suhsteve
Aug 10, 2020
Maintainer

zzhu8192a
Aug 10, 2020
Author

imback82
Aug 11, 2020

zzhu8192a
Aug 12, 2020
Author

ovstberg
Oct 20, 2020