You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug?
The Spark connector (Flint) appears to have issues with JSON. In FlintJacksonParser, it is calling a method on a Spark class to create a JsonFactory.
The Flint Jar currently has Jackson shaded. Calling a Spark class to create the JsonFactory will return an instance of com.fasterxml.jackson.core.JsonFactory. Since Jackson was shaded, FlintJacksonParser is expecting an instance of shaded.flint.com.fasterxml.jackson.core.JsonFactory.
This mismatch causes problems with at least the spark-shell any time Flint needs to parse JSON.
Not shading Jackson libraries and relying on them in the Spark distribution fixes this problem.
How can one reproduce the bug?
Steps to reproduce the behavior:
Start an OpenSearch server
Configure a Spark server to with the Flint extension. Add the settings so that it can connect to the OpenSearch server.
Run spark-shell on the host running the Spark server
Try to show the results of a query of an OpenSearch index. For example:
spark.sql("SELECT * FROM dev.default.test").show()
What is the expected behavior?
The results should be formatted and displayed. For example:
Plugins: Spark 3.5.3 with flint-spark-integration-assembly-0.7.0-SNAPSHOT.jar
Do you have any screenshots?
N/A
Do you have any additional context?
Stack Trace:
java.lang.NoSuchMethodError: 'shaded.flint.com.fasterxml.jackson.core.JsonFactory org.apache.spark.sql.catalyst.json.JSONOptions.buildJsonFactory()'
at org.apache.spark.sql.flint.json.FlintJacksonParser.<init>(FlintJacksonParser.scala:53)
at org.apache.spark.sql.flint.FlintPartitionReader.parser$lzycompute(FlintPartitionReader.scala:31)
at org.apache.spark.sql.flint.FlintPartitionReader.parser(FlintPartitionReader.scala:31)
at org.apache.spark.sql.flint.FlintPartitionReader.safeParser$lzycompute(FlintPartitionReader.scala:39)
at org.apache.spark.sql.flint.FlintPartitionReader.safeParser(FlintPartitionReader.scala:37)
at org.apache.spark.sql.flint.FlintPartitionReader.next(FlintPartitionReader.scala:53)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
at scala.Option.exists(Option.scala:376)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
24/12/06 22:30:05 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (9585a65ef886 executor driver): java.lang.NoSuchMethodError: 'shaded.flint.com.fasterxml.jackson.core.JsonFactory org.apache.spark.sql.catalyst.json.JSONOptions.buildJsonFactory()'
at org.apache.spark.sql.flint.json.FlintJacksonParser.<init>(FlintJacksonParser.scala:53)
at org.apache.spark.sql.flint.FlintPartitionReader.parser$lzycompute(FlintPartitionReader.scala:31)
at org.apache.spark.sql.flint.FlintPartitionReader.parser(FlintPartitionReader.scala:31)
at org.apache.spark.sql.flint.FlintPartitionReader.safeParser$lzycompute(FlintPartitionReader.scala:39)
at org.apache.spark.sql.flint.FlintPartitionReader.safeParser(FlintPartitionReader.scala:37)
at org.apache.spark.sql.flint.FlintPartitionReader.next(FlintPartitionReader.scala:53)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
at scala.Option.exists(Option.scala:376)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
24/12/06 22:30:05 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (9585a65ef886 executor driver): java.lang.NoSuchMethodError: 'shaded.flint.com.fasterxml.jackson.core.JsonFactory org.apache.spark.sql.catalyst.json.JSONOptions.buildJsonFactory()'
at org.apache.spark.sql.flint.json.FlintJacksonParser.<init>(FlintJacksonParser.scala:53)
at org.apache.spark.sql.flint.FlintPartitionReader.parser$lzycompute(FlintPartitionReader.scala:31)
at org.apache.spark.sql.flint.FlintPartitionReader.parser(FlintPartitionReader.scala:31)
at org.apache.spark.sql.flint.FlintPartitionReader.safeParser$lzycompute(FlintPartitionReader.scala:39)
at org.apache.spark.sql.flint.FlintPartitionReader.safeParser(FlintPartitionReader.scala:37)
at org.apache.spark.sql.flint.FlintPartitionReader.next(FlintPartitionReader.scala:53)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
at scala.Option.exists(Option.scala:376)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
The text was updated successfully, but these errors were encountered:
What is the bug?
The Spark connector (Flint) appears to have issues with JSON. In FlintJacksonParser, it is calling a method on a Spark class to create a JsonFactory.
https://github.com/opensearch-project/opensearch-spark/blob/main/flint-spark-integration/src/main/scala/org/apache/spark/sql/flint/json/FlintJacksonParser.scala#L53
The Flint Jar currently has Jackson shaded. Calling a Spark class to create the
JsonFactory
will return an instance ofcom.fasterxml.jackson.core.JsonFactory
. Since Jackson was shaded,FlintJacksonParser
is expecting an instance ofshaded.flint.com.fasterxml.jackson.core.JsonFactory
.This mismatch causes problems with at least the
spark-shell
any time Flint needs to parse JSON.Not shading Jackson libraries and relying on them in the Spark distribution fixes this problem.
How can one reproduce the bug?
Steps to reproduce the behavior:
spark-shell
on the host running the Spark serverWhat is the expected behavior?
The results should be formatted and displayed. For example:
What is your host/environment?
flint-spark-integration-assembly-0.7.0-SNAPSHOT.jar
Do you have any screenshots?
N/A
Do you have any additional context?
Stack Trace:
The text was updated successfully, but these errors were encountered: