[DNM] Support spark320 #669

zhouyuan · 2021-12-30T06:42:06Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Signed-off-by: Yuan Zhou <[email protected]>

github-actions · 2021-12-30T06:42:22Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/native-sql-engine/issues

Then could you also rename commit message and pull request title in the following format?

[NSE-${ISSUES_ID}] ${detailed message}

See also:

Other pull requests

zhouyuan · 2022-01-19T06:33:46Z

@PHILO-HE

PHILO-HE · 2022-01-19T06:47:11Z

I will re-pick this work.
An initial thought is: this patch will not be directly merged into master, instead, we will separate the divergence between spark 3.1 and 3.2 into modules and the classes under these modules will be called through shim layer which can recognize spark versions.

PHILO-HE · 2022-02-23T03:07:08Z

...src/main/scala/org/apache/spark/sql/execution/adaptive/ColumnarCustomShuffleReaderExec.scala

@@ -48,11 +48,11 @@ case class ColumnarCustomShuffleReaderExec(
        partitionSpecs.map(_.asInstanceOf[PartialMapperPartitionSpec].mapIndex).toSet.size ==
          partitionSpecs.length) {
      child match {
-        case ShuffleQueryStageExec(_, s: ColumnarShuffleExchangeAdaptor) =>
+        case ShuffleQueryStageExec(_, s: ColumnarShuffleExchangeAdaptor, _) =>


Refactor the code to fix compatibility issues.
Also keep shim layer approach, but commented.

PHILO-HE · 2022-02-23T03:07:25Z

...sql-engine/core/src/main/scala/org/apache/spark/sql/execution/ShuffledColumnarBatchRDD.scala

@@ -77,7 +77,7 @@ class ShuffledColumnarBatchRDD(
  override def getPreferredLocations(partition: Partition): Seq[String] = {
    val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
    partition.asInstanceOf[ShuffledColumnarBatchRDDPartition].spec match {
-      case CoalescedPartitionSpec(startReducerIndex, endReducerIndex) =>
+      case CoalescedPartitionSpec(startReducerIndex, endReducerIndex, _) =>


Refactor the code to fix compatibility issues.

PHILO-HE · 2022-02-23T03:16:23Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

@@ -36,12 +36,15 @@ import org.apache.spark.sql.execution.python.ArrowEvalPythonExec
 import org.apache.spark.sql.execution.python.ColumnarArrowEvalPythonExec
 import org.apache.spark.sql.execution.window.WindowExec

-case class RowGuard(child: SparkPlan) extends SparkPlan {


UnaryExecNode is the child of SparkPlan.

PHILO-HE · 2022-02-23T03:17:30Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

-  def children: Seq[SparkPlan] = Seq(child)
+  //def children: Seq[SparkPlan] = Seq(child)
+
+  override protected def withNewChildInternal(newChild: SparkPlan): RowGuard =


PHILO-HE · 2022-02-23T03:39:17Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

@@ -70,7 +73,7 @@ case class ColumnarGuardRule() extends Rule[SparkPlan] {
          ColumnarArrowEvalPythonExec(plan.udfs, plan.resultAttrs, plan.child, plan.evalType)
        case plan: BatchScanExec =>
          if (!enableColumnarBatchScan) return false
-          new ColumnarBatchScanExec(plan.output, plan.scan)
+          new ColumnarBatchScanExec(plan.output, plan.scan, plan.runtimeFilters)


Fixed through shim layer.

PHILO-HE · 2022-02-23T03:39:37Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

@@ -133,9 +136,9 @@ case class ColumnarGuardRule() extends Rule[SparkPlan] {
          left match {
            case exec: BroadcastExchangeExec =>
              new ColumnarBroadcastExchangeExec(exec.mode, exec.child)
-            case BroadcastQueryStageExec(_, plan: BroadcastExchangeExec) =>
+            case BroadcastQueryStageExec(_, plan: BroadcastExchangeExec, _) =>


Fixed through code refactor.

PHILO-HE · 2022-02-23T03:39:44Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

@@ -147,9 +150,9 @@ case class ColumnarGuardRule() extends Rule[SparkPlan] {
          right match {
            case exec: BroadcastExchangeExec =>
              new ColumnarBroadcastExchangeExec(exec.mode, exec.child)
-            case BroadcastQueryStageExec(_, plan: BroadcastExchangeExec) =>
+            case BroadcastQueryStageExec(_, plan: BroadcastExchangeExec, _) =>


Fixed through code refactor.

PHILO-HE · 2022-02-23T03:40:07Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

@@ -239,7 +242,7 @@ case class ColumnarGuardRule() extends Rule[SparkPlan] {
      case p if !supportCodegen(p) =>
        // insert row guard them recursively
        p.withNewChildren(p.children.map(insertRowGuardOrNot))
-      case p: CustomShuffleReaderExec =>


Fixed through shim layer.

PHILO-HE · 2022-02-23T04:02:21Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

      plan.child match {
        case shuffle: ColumnarShuffleExchangeAdaptor =>
          logDebug(s"Columnar Processing for ${plan.getClass} is currently supported.")
          CoalesceBatchesExec(
            ColumnarCustomShuffleReaderExec(plan.child, plan.partitionSpecs))
-        case ShuffleQueryStageExec(_, shuffle: ColumnarShuffleExchangeAdaptor) =>
+        case ShuffleQueryStageExec(_, shuffle: ColumnarShuffleExchangeAdaptor, _) =>


Refactor the code, similiar to ColumnarCustomShuffleReaderExec.

PHILO-HE · 2022-02-23T06:21:35Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

-        BroadcastQueryStageExec(
-          curPlan.id,
-          BroadcastExchangeExec(
+        val newBroadcast = BroadcastExchangeExec(


Fixed through shim layer.

PHILO-HE · 2022-02-23T06:22:11Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

-          curPlan.id,
-          BroadcastExchangeExec(
+          curPlan.id, newBroadcast, newBroadcast.doCanonicalize)
+      case ReusedExchangeExec(_, originalBroadcastPlan: ColumnarBroadcastExchangeAdaptor) =>


Similar to the above.

PHILO-HE · 2022-02-23T06:43:08Z

native-sql-engine/core/src/main/scala/com/intel/oap/execution/ColumnarBatchScanExec.scala

@@ -25,8 +25,8 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}

-class ColumnarBatchScanExec(output: Seq[AttributeReference], @transient scan: Scan)
-    extends BatchScanExec(output, scan) {
+class ColumnarBatchScanExec(output: Seq[AttributeReference], @transient scan: Scan, runtimeFilters: Seq[Expression])


Move this class into shim layers respectively for spark 3.1/3.2.
Code related to Gazelle config is removed to get rid of GazellePluginConfig dependency. If not, cyclic dependency needs to be handled.

PHILO-HE · 2022-02-23T07:10:52Z

native-sql-engine/core/pom.xml

@@ -254,12 +254,6 @@
      <version>${hadoop.version}</version>
      <scope>provided</scope>
    </dependency>
-    <dependency>


Pending. Must be deleted?

PHILO-HE · 2022-02-23T07:11:31Z

arrow-data-source/pom.xml

@@ -31,6 +31,27 @@
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
+        <repository>


Just applicable for then spark snapshot. For release version, it is not needed.

PHILO-HE · 2022-02-23T07:15:27Z

...w-data-source/common/src/main/scala/com/intel/oap/sql/execution/RowToArrowColumnarExec.scala

@@ -241,10 +241,12 @@ object RowToColumnConverter {
 * populate with [[RowToColumnConverter]], but the performance requirements are different and it
 * would only be to reduce code.
 */
-case class RowToArrowColumnarExec(child: SparkPlan) extends UnaryExecNode {
+trait RowToArrowColumnarTransition extends UnaryExecNode


Looks useless. Keep the following extending relation unchanged.

PHILO-HE · 2022-02-23T07:19:25Z

...et/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

      // Try to push down filters when filter push-down is enabled.
      val pushed = if (enableParquetFilterPushDown) {
        val parquetSchema = footerFileMetaData.getSchema
        val parquetFilters = new ParquetFilters(parquetSchema, pushDownDate, pushDownTimestamp,
-          pushDownDecimal, pushDownStringStartWith, pushDownInFilterThreshold, isCaseSensitive)
+          pushDownDecimal, pushDownStringStartWith, pushDownInFilterThreshold, isCaseSensitive,


Fixed through shim layer.

PHILO-HE · 2022-02-23T07:21:32Z

arrow-data-source/standard/src/main/scala/com/intel/oap/spark/sql/ArrowWriteExtension.scala

@@ -137,6 +137,9 @@ object ArrowWriteExtension {
  private case class ColumnarToFakeRowLogicAdaptor(child: LogicalPlan)
      extends OrderPreservingUnaryNode {
    override def output: Seq[Attribute] = child.output
+    override protected def withNewChildInternal(newChild: LogicalPlan): ColumnarToFakeRowLogicAdaptor =


Adding this method for both spark 3.1 & 3.2 and override keyword is omitted intentionally for compatibility consideration.

PHILO-HE · 2022-02-23T07:44:34Z

...ard/src/main/scala/com/intel/oap/spark/sql/execution/datasources/arrow/ArrowFileFormat.scala

@@ -94,6 +94,10 @@ class ArrowFileFormat extends FileFormat with DataSourceRegister with Serializab
          override def close(): Unit = {
            writeQueue.close()
          }
+
+          override def path(): String = {


Fixed through shim layer.
TODO: it seems we can put the new interface method here and omit override key word.

PHILO-HE · 2022-02-23T08:01:39Z

...-sql-engine/core/src/main/scala/com/intel/oap/execution/ColumnarBasicPhysicalOperators.scala

@@ -99,7 +99,7 @@ case class ColumnarConditionProjectExec(
    }
  }

-  def isNullIntolerant(expr: Expression): Boolean = expr match {
+  override def isNullIntolerant(expr: Expression): Boolean = expr match {


This impl is as same as the impl in it's parent class, PredicateHelper, in spark 3.2. For keeping the code for spark 3.1 workable, we just changed the method name. For spark3.1. isNullIntolerant is not an interface method or abstract method.

PHILO-HE · 2022-02-23T08:28:24Z

...ve-sql-engine/core/src/main/scala/com/intel/oap/execution/ColumnarShuffledHashJoinExec.scala

@@ -55,6 +55,57 @@ import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, BuildSide
 import org.apache.spark.sql.execution.joins.{HashJoin,ShuffledJoin,BaseJoinExec}
 import org.apache.spark.sql.execution.joins.HashedRelationInfo

+trait ColumnarShuffledJoin extends BaseJoinExec {


Looks as same as ShuffledJoin in Spark 3.2. Let's not add this trait.

PHILO-HE · 2022-02-23T08:30:13Z

...e-sql-engine/core/src/main/scala/com/intel/oap/execution/ColumnarBroadcastHashJoinExec.scala

@@ -59,7 +59,7 @@ case class ColumnarBroadcastHashJoinExec(
    nullAware: Boolean = false)
    extends BaseJoinExec
    with ColumnarCodegenSupport
-    with ShuffledJoin {
+    with ColumnarShuffledJoin {


I think we can still let it extend ShuffledJoin. Need verify in compile and test.

PHILO-HE · 2022-02-23T08:30:55Z

...e-sql-engine/core/src/main/scala/com/intel/oap/execution/ColumnarBroadcastHashJoinExec.scala


  override lazy val outputPartitioning: Partitioning = {
    joinType match {
-      case _: InnerLike if broadcastHashJoinOutputPartitioningExpandLimit > 0 =>
+      case _: InnerLike if conf.broadcastHashJoinOutputPartitioningExpandLimit > 0 =>


Fixed through shim layer.

PHILO-HE · 2022-02-23T08:32:33Z

native-sql-engine/core/src/main/scala/com/intel/oap/execution/ColumnarHashAggregateExec.scala

@@ -307,7 +306,7 @@ case class ColumnarHashAggregateExec(
            val aggregateFunc = exp.aggregateFunction
            val out_res = aggregateFunc.children.head.asInstanceOf[Literal].value
            aggregateFunc match {
-              case Sum(_) =>
+              case Sum(_, _) =>


Just use type matching.

PHILO-HE · 2022-02-23T08:37:09Z

...ve-sql-engine/core/src/main/scala/com/intel/oap/expression/ColumnarExpressionConverter.scala

@@ -271,11 +271,7 @@ object ColumnarExpressionConverter extends Logging {
            columnarDivide,
            expr)
        }
-      case oaps: com.intel.oap.expression.ColumnarScalarSubquery =>


ColumnarScalarSubquery is useless, so we removed the code where it is used. Not relevant to spark 3.1/3.2 compatibility.

PHILO-HE · 2022-02-23T08:37:32Z

native-sql-engine/core/src/main/scala/com/intel/oap/expression/ColumnarSubquery.scala

@@ -1,120 +0,0 @@
-/*


PHILO-HE · 2022-02-23T08:39:07Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

@@ -69,7 +69,7 @@ case class ColumnarPreOverrides() extends Rule[SparkPlan] {
      ColumnarArrowEvalPythonExec(plan.udfs, plan.resultAttrs, columnarChild, plan.evalType)
    case plan: BatchScanExec =>
      logDebug(s"Columnar Processing for ${plan.getClass} is currently supported.")
-      new ColumnarBatchScanExec(plan.output, plan.scan)
+      new ColumnarBatchScanExec(plan.output, plan.scan, plan.runtimeFilters)


Fixed through shim layer.

PHILO-HE · 2022-02-23T09:05:19Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

@@ -44,7 +44,7 @@ case class ColumnarPreOverrides() extends Rule[SparkPlan] {
  var isSupportAdaptive: Boolean = true

  def replaceWithColumnarPlan(plan: SparkPlan): SparkPlan = plan match {
-    case RowGuard(child: CustomShuffleReaderExec) =>
+    case RowGuard(child: AQEShuffleReadExec) =>


In spark 3.2, CustomShuffleReaderExec is renamed to AQEShuffleReadExec.
ColumnarCustomShuffleReaderExec's paritial code is ported from CustomShuffleReaderExec/AQEShuffleReadExec.

ShufflePartitionUtils.scala also imported CustomShuffleReaderExec, so also needs to be fixed.

PHILO-HE · 2022-02-23T11:05:33Z

...ngine/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBroadcastExchangeExec.scala

-
+
+  @transient
+  private lazy val maxBroadcastRows = mode match {


Fixed through shim layer.
TDOO: whether the update on spark 3.2 is applicable to 3.1.

PHILO-HE · 2022-02-23T13:11:37Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

@@ -44,7 +44,7 @@ case class ColumnarPreOverrides() extends Rule[SparkPlan] {
  var isSupportAdaptive: Boolean = true

  def replaceWithColumnarPlan(plan: SparkPlan): SparkPlan = plan match {
-    case RowGuard(child: CustomShuffleReaderExec) =>
+    case RowGuard(child: AQEShuffleReadExec) =>


Add guard logic to check the type on shim layer.

PHILO-HE · 2022-02-23T13:12:30Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/ColumnarOverrides.scala

@@ -204,17 +204,17 @@ case class ColumnarPreOverrides() extends Rule[SparkPlan] {
      logDebug(s"Columnar Processing for ${plan.getClass} is currently supported.")
      plan

-    case plan: CustomShuffleReaderExec if columnarConf.enableColumnarShuffle =>
+    case plan: AQEShuffleReadExec if columnarConf.enableColumnarShuffle =>


Add guard logic to check the type on shim layer.
plan.child, plan.partitionSpecs should be obtained on shim layer.

PHILO-HE · 2022-02-24T02:45:18Z

native-sql-engine/core/src/main/scala/com/intel/oap/extension/columnar/ColumnarGuardRule.scala

  def output: Seq[Attribute] = child.output
  protected def doExecute(): RDD[InternalRow] = {
    throw new UnsupportedOperationException
  }
-  def children: Seq[SparkPlan] = Seq(child)


A parent class, UnaryLike, has already defined a val called children for spark 3.2.
If we also let this class extend UnaryExecNode for spark 3.1, UnaryExecNode already contains this childiren method. So if we let this class extend UnaryExecNode for both spark 3.1 & 3.2, the children method here can be deleted.

PHILO-HE · 2022-02-24T06:09:04Z

native-sql-engine/core/src/main/scala/org/apache/spark/shuffle/ColumnarShuffleWriter.scala

@@ -84,7 +84,7 @@ class ColumnarShuffleWriter[K, V](
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    if (!records.hasNext) {
      partitionLengths = new Array[Long](dep.partitioner.numPartitions)
-      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, null)
+      shuffleBlockResolver.writeMetadataFileAndCommit(dep.shuffleId, mapId, partitionLengths, null, null)


Fixed through shim layer.

PHILO-HE · 2022-02-24T06:12:56Z

...ve-sql-engine/core/src/main/scala/org/apache/spark/shuffle/sort/ColumnarShuffleManager.scala

@@ -108,7 +108,6 @@ class ColumnarShuffleManager(conf: SparkConf) extends ShuffleManager with Loggin
          shuffleExecutorComponents)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(
-          shuffleBlockResolver,


Fixed through shim layer.

PHILO-HE · 2022-02-24T07:06:53Z

...ngine/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBroadcastExchangeExec.scala

  @transient
  private[sql] lazy val relationFuture: java.util.concurrent.Future[broadcast.Broadcast[Any]] = {
    SQLExecution.withThreadLocalCaptured[broadcast.Broadcast[Any]](
-      sqlContext.sparkSession,
+      session,


Fixed through shim layer.
They are accessible in parent class, SparkPlan. For spark 3.1, it is sqlContext. But for spark 3.2, it is session.

PHILO-HE · 2022-02-24T09:34:48Z

...ne/core/src/main/scala/org/apache/spark/sql/execution/python/ColumnarArrowPythonRunner.scala

      releasedOrClosed: AtomicBoolean,
      context: TaskContext): Iterator[ColumnarBatch] = {

-    new ReaderIterator(stream, writerThread, startTime, env, worker, releasedOrClosed, context) {
+    new ReaderIterator(stream, writerThread, startTime, env, worker, pid, releasedOrClosed, context) {


Fixed by introducing an abstract child class in shim layer.

PHILO-HE · 2022-02-25T06:06:28Z

...gine/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

@@ -1,729 +0,0 @@
-/*


Move into spark3.1 shim layer. For spark3.2, use spark3.2 dependency.

PHILO-HE · 2022-02-25T06:06:57Z

native-sql-engine/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

@@ -1,933 +0,0 @@
-/*


Move into spark3.1 shim layer. For spark3.2, use spark3.2 dependency.

zhouyuan · 2022-03-01T06:51:08Z

replaced by #742

zhouyuan added 3 commits October 13, 2021 18:33

integrate w/ spark 3.2

8e30550

Signed-off-by: Yuan Zhou <[email protected]>

fix spark 320 tag

8e00205

Signed-off-by: Yuan Zhou <[email protected]>

fix data source

dec6013

Signed-off-by: Yuan Zhou <[email protected]>

PHILO-HE self-assigned this Jan 19, 2022

PHILO-HE reviewed Feb 23, 2022

View reviewed changes

PHILO-HE reviewed Feb 24, 2022

View reviewed changes

PHILO-HE reviewed Feb 25, 2022

View reviewed changes

zhouyuan closed this Mar 1, 2022

[DNM] Support spark320 #669

[DNM] Support spark320 #669

Conversation

zhouyuan commented Dec 30, 2021

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Dec 30, 2021

zhouyuan commented Jan 19, 2022

PHILO-HE commented Jan 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PHILO-HE Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PHILO-HE Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhouyuan commented Mar 1, 2022

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 23, 2022 •

edited

Loading

PHILO-HE Feb 24, 2022 •

edited

Loading