-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mleap and python 3.8 #842
Comments
I've been using mleap and python 3.8 for quite a while. Both mleap v0.20 and v0.21.x should work, I can't remember of 0.19 did or not (probably yes). |
@jsleight Thanks! Have you ever used serialize to bundle? I'm on python 3.8 and 0.20 mleap. Java 8 and Scala 2.12. I make the context as below, but I can't serialize a pipeline. I pasted some code below. ` spark = gen_spark_session() dfTrainFake = spark.createDataFrame([ # Do some stuff |
I suspect you need an additional jar(s). I have all of these as dependencies:
That full list is probably overkill, but probably you need one/both of bundle ones. |
So unfortunately this still does not work. I get the same error. I did as below to get the spark session (is it right to do jars.packages?). I also did the two commands below to set the PATHS. It was complaining about a Python 3.7 / Python 3.8 (prob can add this to bash_rc). Is the only thing left to try all the jars.packages above?
|
I added all of these except the below and still no luck.
I am doing this:
|
@jsleight Is it possible the problem is conda and SPARK? I.e. I am thinking the issue might be as in the link below. It seems they very carefully select some env variables to make this error go away (note it's not the same error as I have, but the JVM not having something is the complaint ... Thinking this is related) ... Thanks in advance or do you know someone who might know the issue here, sort of stuck. Also, it seems that I have py4j-0.10.9-src.zip ... I imagine is this OK? Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?
ENV variables: |
Hmmm, I can't reproduce this. Can you post a complete minimal example?
|
Thanks! So I have pyspark 3.1.3, mleap 0.20.0, python 3.8.15. I am on a google data proc 2.0 box and I put the jars above in /usr/lib/spark/jars . I took your code (let's fix that) and it gave me this JVM error |
Specifically, if I close a terminal I get this error first: I fix this by setting as below, then I get the JVM one.
I.e. this: |
you'll need the xgboost jar to serialize an xgboost model. The tensorflow jar is for if you want to serialize a tensorflow model. |
I'm using pip and virtualenv, don't have any relevant env variables. I have py4j 0.10.9.5, but it is just being pulled in as a dependency from pyspark Mleap 0.20.0 is built for spark 3.2.0 which might cause some of these problems |
OK thank you again - will try 0.19.0. This should work with 3.1.3 right? Seems so by the website: https://github.com/combust/mleap Update: I downgraded mleap to 0.19.0 and pulled the old jars from maven, but still have the same error.
|
https://github.com/combust/mleap#mleapspark-version has the version compatibility. 0.19.0 was spark 3.0.2. All of these compatibilities are just the versions which are explicitly tested, so other things might work but I don't have any conclusive evidence one way or another. The class not being in the JVM would support your idea of something being weird with your env variables. |
Digging around more I think this should work:
|
My experience with the py4j |
Actually looking above in the stack trace I see this error as the root: I added jars directly like below, and this is still happening. Really unsure why. Would it be possible to zoom with anyone on the team over this? ` spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.16.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate() features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId'] df = spark.createDataFrame( pipeline = pyspark.ml.Pipeline(stages=[ from mleap.pyspark import spark_support |
Basically my error looks like this and the issue seems to be some path problem: combust/mleap-docs#8 (comment) |
It is interesting that your error is complaining about a path with |
So the .jar bundle-ml_2.12-0.19.0.jar (or 0.16.0) has this Class in it. And the path seems like it uses '/'. Another question I have I know the jars need to be added to the spark jars folder. Do they need to also be added to pyspark in some way? It's like pyspark does not see the right thing (pyspark 3.1.3, mleap <= 0.19.0). |
My speculation is py4j is not using the right jars in some way ... Conda might have it's own py4j and this is the confusion. But I'm still unsure. Just wondering if you saw this before ... |
The error seems to be from: |
@jsleight Yeah so I fix that problem with the import (I know there's a better solution, but adding the prefix seems to bring in the needed class), but then I get another error. |
This is my spark context btw ... I added all the right jars, I think |
in my experience the When I run your code examples above (using mleap 0.19.0) then it works for me. |
Is there a way to check that the context I pass has what it needs? I mean given the jars above, I should have everything. This gives a context so those jars are where they need to be. |
See #845 |
Hi,
I have more of a question than a specific issue.
I was trying to use python 3.8 but my question is does mleap support this? What version would run? I know some changes were made which seems to suggest this, but unsure if they are in the stable release yet.
Thank you!
The text was updated successfully, but these errors were encountered: