Mleap and python 3.8 #842

drei34 · 2023-01-26T17:32:27Z

Hi,

I have more of a question than a specific issue.

I was trying to use python 3.8 but my question is does mleap support this? What version would run? I know some changes were made which seems to suggest this, but unsure if they are in the stable release yet.

Thank you!

jsleight · 2023-01-26T18:16:16Z

I've been using mleap and python 3.8 for quite a while. Both mleap v0.20 and v0.21.x should work, I can't remember of 0.19 did or not (probably yes).

drei34 · 2023-02-10T22:38:03Z

@jsleight Thanks! Have you ever used serialize to bundle? I'm on python 3.8 and 0.20 mleap. Java 8 and Scala 2.12. I make the context as below, but I can't serialize a pipeline. I pasted some code below.

`
def gen_spark_session():
return SparkSession.builder.appName("happy").config(
"hive.exec.dynamic.partition", "True").config(
"hive.exec.dynamic.partition.mode", "nonstrict").config(
"spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").config(
"spark.jars.packages",
"ml.combust.mleap:mleap-spark_2.12:0.20.0,"
"ml.combust.mleap:mleap-spark-base_2.12:0.20.0"
).enableHiveSupport().getOrCreate()

spark = gen_spark_session()

dfTrainFake = spark.createDataFrame([
(7,20,3,6,1,10,3,53948,245351,1),
(7,20,3,6,1,10,3,53948,245351,1)
], ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId','label'])
print(type(dfTrainFake))

# Do some stuff
pipelineModel.serializeToBundle(locMLeapModelPath, trainPredictions)
`

jsleight · 2023-02-10T23:13:45Z

I suspect you need an additional jar(s). I have all of these as dependencies:

ml.combust.mleap:mleap-spark_2.12
ml.combust.mleap:mleap-core_2.12
ml.combust.mleap:mleap-runtime_2.12
ml.combust.mleap:mleap-avro_2.12
ml.combust.mleap:mleap-spark-extension_2.12
ml.combust.mleap:mleap-spark-base_2.12
ml.combust.bundle:bundle-ml_2.12
ml.combust.bundle:bundle-hdfs_2.12
ml.combust.mleap:mleap-tensorflow_2.12
ml.combust.mleap:mleap-xgboost-spark_2.12

That full list is probably overkill, but probably you need one/both of bundle ones.

drei34 · 2023-02-10T23:46:53Z

So unfortunately this still does not work. I get the same error. I did as below to get the spark session (is it right to do jars.packages?). I also did the two commands below to set the PATHS. It was complaining about a Python 3.7 / Python 3.8 (prob can add this to bash_rc). Is the only thing left to try all the jars.packages above?

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python

drei34 · 2023-02-13T21:47:55Z

I added all of these except the below and still no luck.

ml.combust.mleap:mleap-tensorflow_2.12 ml.combust.mleap:mleap-xgboost-spark_2.12 ml.combust.mleap:mleap-avro_2.12

I am doing this:

`
def gen_spark_session():
    return SparkSession.builder.appName("happy").config(
        "hive.exec.dynamic.partition", "True").config(
        "hive.exec.dynamic.partition.mode", "nonstrict").config(
        "spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").config(
        "spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark_extension_2.12:0.20.0,"
        "ml.combust.mleap:mleap-runtime_2.12:0.20.0,"
        "ml.combust.mleap:mleap-core_2.12:0.20.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.20.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.20.0"
        ).enableHiveSupport().getOrCreate()
       
spark = gen_spark_session()
`

drei34 · 2023-02-14T17:30:28Z

@jsleight Is it possible the problem is conda and SPARK? I.e. I am thinking the issue might be as in the link below. It seems they very carefully select some env variables to make this error go away (note it's not the same error as I have, but the JVM not having something is the complaint ... Thinking this is related) ... Thanks in advance or do you know someone who might know the issue here, sort of stuck. Also, it seems that I have py4j-0.10.9-src.zip ... I imagine is this OK?

Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?

ml.combust.mleap:mleap-tensorflow_2.12
ml.combust.mleap:mleap-xgboost-spark_2.12

ENV variables:
https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/

jsleight · 2023-02-14T18:38:35Z

Hmmm, I can't reproduce this. Can you post a complete minimal example?
E.g., using python 3.8.0, pyspark 3.2.2, and mleap 0.20.0, py4j 0.10.9.5, this code works for me.

import pyspark
import pyspark.ml
spark = (
    pyspark.sql.SparkSession.builder.appName("happy")
    .config("hive.exec.dynamic.partition", "True")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("spark.jars.excludes","net.sourceforge.f2j:arpack_combined_all")
    .config("spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.20.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.20.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.20.0"
    )
    .enableHiveSupport().getOrCreate()
)

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

drei34 · 2023-02-14T18:47:59Z

Thanks! So I have pyspark 3.1.3, mleap 0.20.0, python 3.8.15. I am on a google data proc 2.0 box and I put the jars above in /usr/lib/spark/jars .

I took your code (let's fix that) and it gave me this JVM error Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM. I have py4j 10.9, is that also what you have? My speculation is this: https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/ but I have not been able to fix my problem ... What are your system variables. Are you using anaconda?

drei34 · 2023-02-14T18:50:27Z

Specifically, if I close a terminal I get this error first: Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 477, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 3.7 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I fix this by setting as below, then I get the JVM one.

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python

I.e. this:

jsleight · 2023-02-14T19:00:06Z

Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?

you'll need the xgboost jar to serialize an xgboost model. The tensorflow jar is for if you want to serialize a tensorflow model.

jsleight · 2023-02-14T19:02:19Z

I'm using pip and virtualenv, don't have any relevant env variables.

I have py4j 0.10.9.5, but it is just being pulled in as a dependency from pyspark

Mleap 0.20.0 is built for spark 3.2.0 which might cause some of these problems

drei34 · 2023-02-14T19:32:39Z

OK thank you again - will try 0.19.0. This should work with 3.1.3 right? Seems so by the website: https://github.com/combust/mleap

Update: I downgraded mleap to 0.19.0 and pulled the old jars from maven, but still have the same error.

import pyspark
import pyspark.ml
spark = (
    pyspark.sql.SparkSession.builder.appName("happy")
    .config("hive.exec.dynamic.partition", "True")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("spark.jars.excludes","net.sourceforge.f2j:arpack_combined_all")
    .config("spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.19.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.19.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.19.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.19.0"
    )
    .enableHiveSupport().getOrCreate()
)

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

jsleight · 2023-02-14T19:59:00Z

https://github.com/combust/mleap#mleapspark-version has the version compatibility. 0.19.0 was spark 3.0.2. All of these compatibilities are just the versions which are explicitly tested, so other things might work but I don't have any conclusive evidence one way or another.

The class not being in the JVM would support your idea of something being weird with your env variables.

drei34 · 2023-02-14T21:56:35Z

Gotcha I'll keep looking. Another thing that's weird is I see the error is from here as below. Is there some request being sent somewhere? This seems to complaint that the Java side is empty so I wonder if the problem is some request that gets blocked due to some set ip (very unsure).

drei34 · 2023-02-15T18:14:06Z

Digging around more I think this should work: pyspark --packages ml.combust.mleap:mleap-spark_2.12:0.16.0 predict.py where here I am specifying the Scala and Mleap versions. This is from here: combust/mleap-docs#8 (comment) and similar to just specifying the packages when building the context, I think. I made the predict file be as above, but still no luck. I moved the MLEAP version down but this seems like a bad thing: I should not be going back to MLEAP versions from years ago to make this all work. This is Python 3.8, Mleap 0.16, PySpark 3.1.3 ... It would seem that this should all be OK but I'm at a loss as to what to do about this error ... I guess the only thing left to do is to make a conda env with: python 3.8.0, pyspark 3.2.2, and mleap 0.20.0, py4j 0.10.9.5 and hope it works. I believe there is something wrong w the paths on Google machines or something, I can't come up with anything else on this frustrating issue.

import pyspark
import pyspark.ml
 
spark=pyspark.sql.SparkSession.builder.config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.16.0,ml.combust.mleap:mleap-runtime_2.12:0.16.0,ml.combust.bundle:bundle-ml_2.12:0.16.0,ml.combust.bundle:bundle-hdfs_2.12:0.16.0"').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)
 
# Works just file on 1.4, not on 2.0.
pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)
 
from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

jsleight · 2023-02-15T19:28:11Z

My experience with the py4j Answer from java side is empty errors is usually that spark's jvm died. The test case I did above was with java 8, but I'm pretty sure java 11 will work too (and certainly java 11 works with latest mleap). Notably java 17 won't work because spark doesn't support that java version.

drei34 · 2023-02-15T19:48:50Z

Actually looking above in the stack trace I see this error as the root: Exception in thread "Thread-4" java.lang.NoClassDefFoundError: ml/combust/bundle/serializer/SerializationFormat ...

I added jars directly like below, and this is still happening. Really unsure why. Would it be possible to zoom with anyone on the team over this?

`
import pyspark
import pyspark.ml

spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.16.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']

df = spark.createDataFrame(
[
(7,20,3,6,1,10,3,53948,245351,1),
(7,20,3,6,1,10,3,53948,245351,1)
],
features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
`

drei34 · 2023-02-15T20:00:03Z

Basically my error looks like this and the issue seems to be some path problem: combust/mleap-docs#8 (comment)

jsleight · 2023-02-15T20:18:55Z

It is interesting that your error is complaining about a path with / in it. I think it is usually with . separators? Maybe I'm mis-remembering things though.

drei34 · 2023-02-21T17:07:09Z

So the .jar bundle-ml_2.12-0.19.0.jar (or 0.16.0) has this Class in it. And the path seems like it uses '/'. Another question I have I know the jars need to be added to the spark jars folder. Do they need to also be added to pyspark in some way? It's like pyspark does not see the right thing (pyspark 3.1.3, mleap <= 0.19.0).

drei34 · 2023-02-21T17:08:06Z

My speculation is py4j is not using the right jars in some way ... Conda might have it's own py4j and this is the confusion. But I'm still unsure. Just wondering if you saw this before ...

drei34 · 2023-02-21T17:21:07Z

The error seems to be from: /usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j and I see that there are imports happening. I put some print statements in the java_gateway.py file in this directory since that's where the exception is coming from and I want to see what is being imported. The weird part is most look like (I'm using your example still) org.apache.spark.ml.classification.LogisticRegression but the mleap ones do not have the prefix org.apache.spark but mleap was installed via pip. Why would this not be appended as such to the path? Basically the exception is complaining that ml.combust.mleap.spark.SimpleSparkSerializer is not found but I think it has access to org.apache.spark.ml.combust.mleap.spark.SimpleSparkSerializer and that's what it should be using ... Trying to figure out why this is happening also + add this hack fix to see if it even goes anywhere ...

drei34 · 2023-02-21T17:36:02Z

@jsleight Yeah so I fix that problem with the import (I know there's a better solution, but adding the prefix seems to bring in the needed class), but then I get another error. 'JavaPackage' object is not callable ... Unsure why, this error seems to be a problem if maybe you have wrong versions but I don't think that's the case. I'm on Java8, Scala 2.12, Mleap 0.19.0, PySpark 3.1.3 and Python 3.8.15 ...

drei34 · 2023-02-21T17:51:31Z

This is my spark context btw ... I added all the right jars, I think spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.19.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.19.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-spark-extension_2.12-0.19.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

jsleight · 2023-02-21T17:56:09Z

in my experience the JavaPackage object is not callable has always come from the Spark jvm not having the jars which it needs.

When I run your code examples above (using mleap 0.19.0) then it works for me.

drei34 · 2023-02-21T18:03:58Z

Is there a way to check that the context I pass has what it needs? I mean given the jars above, I should have everything. This gives a context so those jars are where they need to be.

drei34 · 2023-03-07T16:18:28Z

See #845

jsleight mentioned this issue Feb 10, 2023

Exception in thread "Thread-4" java.lang.NoClassDefFoundError: ml/combust/bundle/HasBundleRegistry #845

Closed

jsleight closed this as completed Feb 14, 2023

jsleight reopened this Feb 14, 2023

drei34 closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mleap and python 3.8 #842

Mleap and python 3.8 #842

drei34 commented Jan 26, 2023

jsleight commented Jan 26, 2023

drei34 commented Feb 10, 2023 •

edited

Loading

jsleight commented Feb 10, 2023

drei34 commented Feb 10, 2023 •

edited

Loading

drei34 commented Feb 13, 2023 •

edited

Loading

drei34 commented Feb 14, 2023 •

edited

Loading

jsleight commented Feb 14, 2023 •

edited

Loading

drei34 commented Feb 14, 2023

drei34 commented Feb 14, 2023 •

edited

Loading

jsleight commented Feb 14, 2023

jsleight commented Feb 14, 2023

drei34 commented Feb 14, 2023 •

edited

Loading

jsleight commented Feb 14, 2023

drei34 commented Feb 14, 2023

drei34 commented Feb 15, 2023 •

edited

Loading

jsleight commented Feb 15, 2023

drei34 commented Feb 15, 2023 •

edited

Loading

drei34 commented Feb 15, 2023

jsleight commented Feb 15, 2023

drei34 commented Feb 21, 2023

drei34 commented Feb 21, 2023

drei34 commented Feb 21, 2023 •

edited

Loading

drei34 commented Feb 21, 2023 •

edited

Loading

drei34 commented Feb 21, 2023 •

edited

Loading

jsleight commented Feb 21, 2023

drei34 commented Feb 21, 2023 •

edited

Loading

drei34 commented Mar 7, 2023

Mleap and python 3.8 #842

Mleap and python 3.8 #842

Comments

drei34 commented Jan 26, 2023

jsleight commented Jan 26, 2023

drei34 commented Feb 10, 2023 • edited Loading

jsleight commented Feb 10, 2023

drei34 commented Feb 10, 2023 • edited Loading

drei34 commented Feb 13, 2023 • edited Loading

drei34 commented Feb 14, 2023 • edited Loading

jsleight commented Feb 14, 2023 • edited Loading

drei34 commented Feb 14, 2023

drei34 commented Feb 14, 2023 • edited Loading

jsleight commented Feb 14, 2023

jsleight commented Feb 14, 2023

drei34 commented Feb 14, 2023 • edited Loading

jsleight commented Feb 14, 2023

drei34 commented Feb 14, 2023

drei34 commented Feb 15, 2023 • edited Loading

jsleight commented Feb 15, 2023

drei34 commented Feb 15, 2023 • edited Loading

drei34 commented Feb 15, 2023

jsleight commented Feb 15, 2023

drei34 commented Feb 21, 2023

drei34 commented Feb 21, 2023

drei34 commented Feb 21, 2023 • edited Loading

drei34 commented Feb 21, 2023 • edited Loading

drei34 commented Feb 21, 2023 • edited Loading

jsleight commented Feb 21, 2023

drei34 commented Feb 21, 2023 • edited Loading

drei34 commented Mar 7, 2023

drei34 commented Feb 10, 2023 •

edited

Loading

drei34 commented Feb 10, 2023 •

edited

Loading

drei34 commented Feb 13, 2023 •

edited

Loading

drei34 commented Feb 14, 2023 •

edited

Loading

jsleight commented Feb 14, 2023 •

edited

Loading

drei34 commented Feb 14, 2023 •

edited

Loading

drei34 commented Feb 14, 2023 •

edited

Loading

drei34 commented Feb 15, 2023 •

edited

Loading

drei34 commented Feb 15, 2023 •

edited

Loading

drei34 commented Feb 21, 2023 •

edited

Loading

drei34 commented Feb 21, 2023 •

edited

Loading

drei34 commented Feb 21, 2023 •

edited

Loading

drei34 commented Feb 21, 2023 •

edited

Loading