Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we properly include MLeap dependencies in a PySparkProcessor Job? #2339

Open
GroovyDan opened this issue May 11, 2021 · 0 comments
Open

Comments

@GroovyDan
Copy link

What did you find confusing? Please describe.
I have been trying to include MLeap in my PySparkProcessor Job so I can serialize a Spark Pipeline to use later in a serving container, as this is the expected format. The documentation surrounding how to do this appears to be outdated. I can't seem to get the dependencies correct and keep receiving different error. What is the correct way to do this? Can we get an updated example with the correct base image, mleap_spark_assembly.jar, and mleap version that will work?

Describe how documentation can be improved
I found some documentation that I tried to follow here:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html#Distributed-Data-Processing-using-Apache-Spark-and-SageMaker-Processing, but this did not have specific instructions for how to include MLeap. There was an example of how to include MLeap in a Glue Job, but following those same steps for a PySparkProcessor Job did not appear to work:
https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone/inference_pipeline_sparkml_xgboost_abalone.html#Serializing-the-trained-Spark-ML-Model-with-MLeap.
An updated example with what mleap version to use, as well as what base image to use would be very helpful.

Additional context
After attempting to follow the documentation, I also tried a few different things to get it to work. I tried downloading the needed .jar files from Maven and including them via the submit_jars parameter. I also tried using the configuration option to specify MLeap as a dependency via the spark.jars.packages property. I also tried extending one of the base spark processing containers to install the needed python package and then include the .jars in the run command but could not get it to work. The thread below describes some of the issue I was running into, and I referred to it often when trying to troubleshoot how to get the serialization to work:

combust/mleap-docs#8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants