How can we properly include MLeap dependencies in a PySparkProcessor Job? #2339

GroovyDan · 2021-05-11T14:15:25Z

What did you find confusing? Please describe.
I have been trying to include MLeap in my PySparkProcessor Job so I can serialize a Spark Pipeline to use later in a serving container, as this is the expected format. The documentation surrounding how to do this appears to be outdated. I can't seem to get the dependencies correct and keep receiving different error. What is the correct way to do this? Can we get an updated example with the correct base image, mleap_spark_assembly.jar, and mleap version that will work?

Describe how documentation can be improved
I found some documentation that I tried to follow here:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html#Distributed-Data-Processing-using-Apache-Spark-and-SageMaker-Processing, but this did not have specific instructions for how to include MLeap. There was an example of how to include MLeap in a Glue Job, but following those same steps for a PySparkProcessor Job did not appear to work:
https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone/inference_pipeline_sparkml_xgboost_abalone.html#Serializing-the-trained-Spark-ML-Model-with-MLeap.
An updated example with what mleap version to use, as well as what base image to use would be very helpful.

Additional context
After attempting to follow the documentation, I also tried a few different things to get it to work. I tried downloading the needed .jar files from Maven and including them via the submit_jars parameter. I also tried using the configuration option to specify MLeap as a dependency via the spark.jars.packages property. I also tried extending one of the base spark processing containers to install the needed python package and then include the .jars in the run command but could not get it to work. The thread below describes some of the issue I was running into, and I referred to it often when trying to troubleshoot how to get the serialization to work:

combust/mleap-docs#8

The text was updated successfully, but these errors were encountered:

ahsan-z-khan added the type: documentation label Jun 15, 2021

martinRenou added the PySpark label Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we properly include MLeap dependencies in a PySparkProcessor Job? #2339

How can we properly include MLeap dependencies in a PySparkProcessor Job? #2339

GroovyDan commented May 11, 2021

How can we properly include MLeap dependencies in a PySparkProcessor Job? #2339

How can we properly include MLeap dependencies in a PySparkProcessor Job? #2339

Comments

GroovyDan commented May 11, 2021