Doc Change for Purpose of Deploying Databricks UDF's #885

dbeavon · 2021-04-06T22:20:27Z

dbeavon
Apr 6, 2021

By trial-and-error I was able to get UDF's working on databricks.

It wasn't until recently that I noticed there were actually some instructions for it here:
https://github.com/dotnet/spark/blob/main/deployment/README.md#databricks

Previously I had only been following the instructions at "docs.microsoft.com" that are found at the following locations...

... those docs are very vague about UDF's. It is unfortunate, since the UDF's are where we get most of the benefits of Spark. Notice that those docs simply say...
Microsoft.Spark.Worker helps Apache Spark execute your app, such as any user-defined functions (UDFs) you may have written.

Is there someone who can make a change to those databricks-specific docs, and get them to link to the README.md in this github project? That would have saved me from flailing around for so very long. The most important part of the github README is the stuff about deploying application assemblies that are used by the workers (below).

This step is only required if app assemblies need to be placed in the working directory of each Microsoft.Spark.Worker.

Oddly the instructions at "docs.microsoft.com" don't refer at all to these assemblies which are needed by the Microsoft.Spark.Worker. The omission is significant since it gives us the false impression that all workers will all have direct access to the contents of the app.zip which was deployed. Anyone who has worked with databricks/scala will know that the jars you add to the cluster are available to both the driver and workers. So it is natural to think the same thing about the app.zip.

dbeavon · 2021-04-06T22:38:15Z

dbeavon
Apr 6, 2021
Author

One other point... by trial-and-error I found an approach for getting UDF's working on databricks but it is a very different one that what is described in the README.

In my approach I was able to get things working by tinkering with the environment variable, DOTNET_ASSEMBLY_SEARCH_PATHS, and getting it to point to a files in /dbfs/FileStore/MyAppWhatever. That seemed to work insofar as my simple testing was concerned. But obviously I don't want to use it if it isn't the best practice. It may have been just by accident that I was able to get things working ... and I kept wondering what circumstances would cause my related assemblies to stop being loaded properly.

I finally found the README. Thankfully it was before heading too far down the wrong path.

Hopefully the approach in the README will also be able to support any *.deps.json annotations that may be needed.

0 replies

imback82 · 2021-04-06T23:26:18Z

imback82
Apr 6, 2021

@Niharikadutta do you happen to know how to update the doc.microsoft.com?

5 replies

Niharikadutta Apr 6, 2021
Collaborator

@imback82 I can work on updating the documentation for deploying UDFs.

dbeavon Apr 7, 2021
Author

@Niharikadutta Thanks.
Even a link back to the README would have been helpful and I wouldn't have spent so much time with DOTNET_ASSEMBLY_SEARCH_PATHS.

I realize (now) that whatever information I find here in this project is more authoritative than what is on docs.microsoft.com.

On a related note (related to databricks) ... I failed the first time I had tried to get my stuff to run over "databricks-connect" (my UDF's failed in particular). Now that I found the instructions in that README I may try again. Any chance we could add a tutorial here to help users get started with "databricks-connect"? Perhaps you can put a placeholder in here somewhere, and I'll send a PR once I've figured it out myself.

Niharikadutta Apr 7, 2021
Collaborator

@dbeavon Thanks for the suggestion and for offering to write a tutorial. Yes I believe we could add a placeholder in a tutorial section where you could open a PR. @imback82 thoughts?

imback82 Apr 7, 2021

Yes, that will be great! I think we can add a section for databricks-connect under https://github.com/dotnet/spark/tree/main/deployment#databricks.

Niharikadutta May 14, 2021
Collaborator

Hi @dbeavon , following up on this thread, I have updated the article for instructions on how to deploy UDFs in Databricks.
Also, if you're interested in writing a tutorial for running a .NET for Apache Spark job via databricks-connect please feel free to write a new section in the Submit a .NET for Apache Spark job to Databricks document. Here are some guides that you might find helpful:

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc Change for Purpose of Deploying Databricks UDF's #885

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Doc Change for Purpose of Deploying Databricks UDF's #885

dbeavon Apr 6, 2021

Replies: 2 comments · 5 replies

dbeavon Apr 6, 2021 Author

imback82 Apr 6, 2021

Niharikadutta Apr 6, 2021 Collaborator

dbeavon Apr 7, 2021 Author

Niharikadutta Apr 7, 2021 Collaborator

imback82 Apr 7, 2021

Niharikadutta May 14, 2021 Collaborator

dbeavon
Apr 6, 2021

Replies: 2 comments 5 replies

dbeavon
Apr 6, 2021
Author

imback82
Apr 6, 2021

Niharikadutta Apr 6, 2021
Collaborator

dbeavon Apr 7, 2021
Author

Niharikadutta Apr 7, 2021
Collaborator

Niharikadutta May 14, 2021
Collaborator