Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Most recent opensearch-spark jar causing errors with Pyspark #536

Closed
Marwen94 opened this issue Dec 8, 2024 · 6 comments
Closed
Labels
bug Something isn't working untriaged

Comments

@Marwen94
Copy link

Marwen94 commented Dec 8, 2024

What is the bug?

We have the following ERROR when reading from AWS OpenSearch (not Serverless) using opensearch-spark-30_2.13-1.1.0.jar (the latest as per today):

 java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: org.opensearch.spark.sql.DefaultSource15 Unable to get public no-arg constructor
	at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:584)
	at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:675)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1234)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1266)
	at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1301)
	at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1386)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at scala.collection.Iterator.foreach(Iterator.scala:943)

How can one reproduce the bug?

pyspark --jars /Downloads/opensearch-spark-30_2.13-1.1.0.jar script_read_from_OS.py

script_read_from_OS.py :

os_options = { "opensearch.nodes": OPENSEARCH_NODES, "opensearch.port": "443", "opensearch.resource": INDEX,
... "opensearch.net.http.auth.user" : USERNAME, "opensearch.net.http.auth.pass" : PASS, "opensearch.net.ssl" : "true", 'opensearch.nodes.wan.only' : 'true' }
 df = spark.read.format("org.opensearch.spark.sql").options(**os_options).load()

What is the expected behavior?

We expect to be able to do a simple read from

What is your host/environment?

MacOS
Spark 3.2.4
Scala 2.13.15
OpenSearch version 1.2

Do you have any screenshots?

Screenshot 2024-12-08 at 21 36 50

Do you have any additional context?

I also tested this in AWS Glue Version 3 and 4 and I have a slightly different error:

An error occurred while calling o125.load. org.apache.spark.sql.sources.DataSourceRegister: Provider org.opensearch.spark.sql.DefaultSource15 could not be instantiated

I tested many solution routes discussed in different old issues but no luck:

#243
#153 (comment)

This issue is similar to this one: #505, I opened a new one due to inactivity of the previous one and to give more context.
Thank you!

@Marwen94 Marwen94 added bug Something isn't working untriaged labels Dec 8, 2024
@Xtansia Xtansia removed the untriaged label Dec 8, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Dec 8, 2024

@Marwen94 The root exception in you stacktrace of NoClassDefFound: scala/$less$colon$less does imply some kind of Scala version mismatch. I believe AWS Glue only supports Scala 2.12, and Spark 3.x by default is also Scala 2.12 unless you specifically use the Scala 2.13 distribution. Could you please confirm if your issue persists when using the 2.12 version of the opensearch connector?

@Marwen94
Copy link
Author

Marwen94 commented Dec 9, 2024

Hello @Xtansia, indeed it did work using the 2.12 locally and in Glue. Thanks.

It is weird though that the version 2.13 did not work locally with Scala 2.12.

@Marwen94 Marwen94 closed this as completed Dec 9, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Dec 9, 2024

@Marwen94 In Scala's versioning 2.12 and 2.13 are different major versions and as such not inter-compatible https://docs.scala-lang.org/overviews/core/binary-compatibility-of-scala-releases.html#guarantees-and-versioning

@Marwen94
Copy link
Author

Marwen94 commented Dec 9, 2024

@Xtansia, I actually get this Scala type issue when reading from Opensearch :

My script:

spark = SparkSession.builder.appName('Opensearch')\
.config('spark.jars', 'Downloads/opensearch-spark-30_2.12-1.3.0.jar')\
.getOrCreate()
sc = spark.sparkContext

os_options = { "opensearch.nodes": OPENSEARCH_NODES,
              "opensearch.port": "443",
              "opensearch.resource": INDEX,
              "opensearch.net.http.auth.user" : USERNAME,
              "opensearch.net.http.auth.pass" : PASS,
              "opensearch.net.ssl" : "true",
              'opensearch.nodes.wan.only' : 'true',
              "opensearch.read.field.empty.as.null": "true",
              "opensearch.read.field.as.array.exclude":"true"
             }

df = spark.read.format("org.opensearch.spark.sql").options(**os_options).load()

df.count()

Error:

pType, anyToMicros, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 44, time), TimestampType, ObjectType(class java.lang.Object)), true, false, true) AS time#644
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 45, type), StringType, ObjectType(class java.lang.String)), true, false, true) AS type#645.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1561)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:193)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:389)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	... 1 more
Caused by: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of string
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_68$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_0.writeFields_14_5$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_0.writeFields_14$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:207)
	... 18 more

My OpenSearch index contain nested Struct and Arrays.

Any ideas?

Thank you!

@Marwen94 Marwen94 reopened this Dec 9, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Dec 9, 2024

You will need to set opensearch.read.field.as.array.include to a comma-separated list of all the names of any fields which are arrays.

@Marwen94
Copy link
Author

That indeed works, thank you so much @Xtansia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

2 participants