[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

Kimahriman · 2021-11-11T13:55:00Z

What changes were proposed in this pull request?

This PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.

A few notes:

Disabled subexpression elimination for lambda functions (this already was the case because it was CodegenFallback). I plan to explore supprting subexpression elimination inside lambda functions later on, as it will require special handling.
I set the AtomicReference for all lambda values as well in case a child expression reverts to interpreted evaluation for any reason (CodegenFallback or otherwise)

Why are the changes needed?

To improve performance of array higher-order function operations, letting the children be codegen'd and participate in WholeStageCodegen

Does this PR introduce any user-facing change?

No, only performance improvements.

How was this patch tested?

Existing unit tests, let me know if there's other codegen-specific unit tests I should add.

AmplabJenkins · 2021-11-11T14:08:36Z

Can one of the admins verify this patch?

Kimahriman · 2022-01-02T14:50:13Z

@viirya @HyukjinKwon @cloud-fan any thoughts or know who might have thoughts?

Tagar · 2022-03-15T05:31:46Z

@Kimahriman just out of curiosity, how much did the performance improve?

Kimahriman · 2022-03-15T11:54:15Z

It's hard to say because when I tested this out on my production jobs (actually still actively using it), I had several other changes too. I'm not sure if there are any benchmarks involving HOFs? Though it's highly dependent on what the lambda function is, and honestly that's one of the main benefits, the lambda functions themselves can be codegen'd instead of eval'd.

I also have a larger goal to support subexpression elimination inside lambda functions, because that's where I've found our biggest problem is. #34727 is also part of that goal.

jaceklaskowski

There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).

jaceklaskowski · 2023-04-03T19:24:52Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+    }
+
+    val result = f(lambdaVars)
+    namedLambdas.foreach(v => currentLambdaVars.remove(v.variableName))


nit: s/v/lamba? Don't want to ask for namedLambda as used earlier, but v does not fit really.

I'd also consider namedLambdas.map(_.variableName).foreach(currentLambdaVars.remove)

Yeah I never know all the exact ways that using _ does and doesn't work when you use a function call, struggle with the cleanest way to write things like this. I like the second one I think

jaceklaskowski · 2023-04-03T19:27:16Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+  }
+
+  def getLambdaVar(name: String): ExprCode = {
+    currentLambdaVars.getOrElse(name, {


nit: Are these curly brackets required?

I saw this format in other parts of the code, but I also have seen it without, so can remove

jaceklaskowski · 2023-04-03T19:28:30Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+
+  // We need to include the Expr ID in the Codegen variable name since several tests bypass
+  // `UnresolvedNamedLambdaVariable.freshVarName`
+  lazy val variableName = s"${name}_${exprId.id}"


It does not seem consistent with simpleString. Is this intentional?

Hmm must have thought it was used in the actual generated code or something but doesn't look like it is. In fact I don't think variableName is really needed at all, can just use exprId.id as the map key, and name for the code comment

jaceklaskowski · 2023-04-03T19:34:42Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@@ -350,6 +445,49 @@ case class ArrayTransform(
    result
  }

+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    ctx.withLambdaVars(Seq(elementVar) ++ indexVar, { lambdaExprs =>


nit: Be consistent with {s; here, { is before an input argument to a function literal while a few lines below it's after =>.

I think I did that because of the instances where I used the case expression to pull out the single element list, will update these for consistency

Kimahriman · 2023-04-04T20:44:54Z

There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).

Thanks for the review! I tried to get as much common code in the parent classes as I could, can take another pass to see if anything jumps out for deduping

cloud-fan · 2023-05-19T04:12:38Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

  private def childrenToRecurse(expr: Expression): Seq[Expression] = expr match {
    case _: CodegenFallback => Nil
    case c: ConditionalExpression => c.alwaysEvaluatedInputs.map(skipForShortcut)
+    case h: HigherOrderFunction => h.arguments


do we need to do the same for commonChildrenToRecurse?

I don't think so. That only cares if the current expression is a ConditionalExpression. The default is Nil if it's not that so I don't think it needs any special handling for HOFs

cloud-fan · 2023-05-19T04:19:14Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@@ -130,6 +134,23 @@ case class LambdaFunction(

  override def eval(input: InternalRow): Any = function.eval(input)

+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val functionCode = function.genCode(ctx)


seems we can just return function.genCode(ctx)?

The code here is only to optimize the non-nullable case, which should be handled by the function body codegen already.

I'm trying to wrap my ahead around this again. I think they might be doing different things? Like for ArrayTransform, the nullSafeCodeGen is only optimizing the non-nullable case of the result of ArrayTransform, whereas this is optimizing the non-nullable case of the result of the lambda function being called for each element in the ArrayTransform

But maybe this already handles that:

val resultNull = if (function.nullable) Some(functionCode.isNull.toString) else None val resultAssignment = CodeGenerator.setArrayElement(arrayData, dataType.elementType, i, copy, isNull = resultNull)

I'll need to think about this a little more

Okay nevermind I see what you mean now, it's just assigning the same values to new variables for no reason. Changed to just return function.genCode(ctx)

cloud-fan · 2023-05-19T04:24:03Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+      elementVar: NamedLambdaVariable, index: String): String = {
+    val elementType = elementVar.dataType
+    val elementAtomic = ctx.addReferenceObj(elementVar.name, elementVar.value)
+    val extractElement = CodeGenerator.getValue(arrayName, elementType, index)


With codegen, I don't think we need a global atomic reference to represent the lambda variable. We can just generate local variables in the java code.

The atomic references are for cases where you might have a CodegenFallback expression inside your lambda function. I think the only way an expression.eval(input) will work is if this atomic reference is set to the lambda variable value as well. I could potentially try to detect if the function has a CodegenFallback expression, but I wasn't sure if there were other reasons some expression would fall back to interpreted inside the lambda function that I couldn't detect.

Finally got back around to this, added tests showing why the atomic reference is needed in case of CodegenFallback inside a lambda expression

jaceklaskowski

More nitting 😉

jaceklaskowski · 2023-06-22T09:00:11Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   */
+  var currentLambdaVars: mutable.Map[Long, ExprCode] = mutable.HashMap.empty
+
+  def withLambdaVars(namedLambdas: Seq[NamedLambdaVariable],


Can you put namedLambdas... on a separate line?

jaceklaskowski · 2023-06-22T09:01:57Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+
+  def withLambdaVars(namedLambdas: Seq[NamedLambdaVariable],
+      f: Seq[ExprCode] => ExprCode): ExprCode = {
+    val lambdaVars = namedLambdas.map { namedLambda =>


nit: Replace namedLambda to lambda?

jaceklaskowski · 2023-06-22T09:03:04Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+  }
+
+  def getLambdaVar(id: Long): ExprCode = {
+    currentLambdaVars.getOrElse(id,


Can you move id to its own new line? 🙏

jaceklaskowski · 2023-06-22T09:05:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+  protected def nullSafeCodeGen(
+      ctx: CodegenContext,
+      ev: ExprCode,
+      f: String => String): ExprCode = {


I'd be very happy if you use this parameter formatting style in the other places in your PR. Makes reading so much easier, esp. with functions with 5+ params. Can you make such change? 🙏

jaceklaskowski · 2023-06-22T09:11:41Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+      })
+    })
+  }
+


Looks like a copy and paste of exists, doesn't it? Can we have a parent class for some sharing? Unless I'm mistaken, the generated code block is the exact copy except $forall (vs $exists).

I'll have to look at this a little bit. It's tricky because there's a few places they are different (default true vs false, !'d for one and not the other, the followThreeValuedLogic flag for exists)

Kimahriman · 2023-06-23T00:37:28Z

More nitting 😉

Yeah I struggle with the right formatting for multiline things in scala, tried to update all the suggestions, thanks for the tips!

chris-twiner · 2024-10-10T12:40:16Z

@Kimahriman just out of curiosity, how much did the performance improve?

I just wanted to add to the above response that I've implemented a compilation scheme here, as part of Quality, and we saw perf boosts of up to 40%, after that adding further lambdas triggered the cost of code generation being higher than the saving. It's definitely usage dependant though, the more work done in the function the higher the cost (and therefore potential saving by compilation), a small boost is noticeable on removal of the atomic under similar ideal circumstances.

edit - the source

github-actions bot added the SQL label Nov 11, 2021

Kimahriman mentioned this pull request Nov 11, 2021

[SPARK-37019][SQL] Add codegen support to array transform #34294

Closed

Kimahriman force-pushed the array-hof-codegen branch from f46fb71 to 217960e Compare December 3, 2021 23:17

Kimahriman force-pushed the array-hof-codegen branch from 217960e to ce082f3 Compare December 23, 2021 14:35

Kimahriman force-pushed the array-hof-codegen branch from ce082f3 to d4a2f63 Compare January 2, 2022 14:45

Kimahriman force-pushed the array-hof-codegen branch from d4a2f63 to aaa4be4 Compare January 28, 2022 20:56

Kimahriman force-pushed the array-hof-codegen branch from aaa4be4 to 3da6342 Compare April 25, 2022 00:08

Kimahriman force-pushed the array-hof-codegen branch from 3da6342 to c3236c0 Compare May 7, 2022 14:42

Kimahriman force-pushed the array-hof-codegen branch from c3236c0 to 2e9f4d3 Compare June 5, 2022 17:48

Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from 9cec788 to 8b898b0 Compare July 12, 2022 11:44

Kimahriman force-pushed the array-hof-codegen branch from 8b898b0 to 194e457 Compare August 28, 2022 22:17

Kimahriman force-pushed the array-hof-codegen branch from 194e457 to 1a52017 Compare September 23, 2022 23:56

Kimahriman force-pushed the array-hof-codegen branch from 1a52017 to b71b633 Compare October 18, 2022 12:01

Kimahriman force-pushed the array-hof-codegen branch from b71b633 to a565a82 Compare November 6, 2022 13:32

Kimahriman force-pushed the array-hof-codegen branch from a565a82 to 92d9a9f Compare November 22, 2022 00:13

Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from a565a82 to 92d9a9f Compare November 30, 2022 23:36

Kimahriman force-pushed the array-hof-codegen branch from 92d9a9f to 03c2dc6 Compare January 1, 2023 14:36

Kimahriman force-pushed the array-hof-codegen branch from 03c2dc6 to 572b666 Compare February 4, 2023 13:36

Kimahriman force-pushed the array-hof-codegen branch from 572b666 to 4a7dba9 Compare March 3, 2023 12:54

Kimahriman mentioned this pull request Mar 18, 2023

[SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() #40473

Closed

Kimahriman force-pushed the array-hof-codegen branch from 4a7dba9 to 5b13cd2 Compare March 30, 2023 11:46

jaceklaskowski reviewed Apr 3, 2023

View reviewed changes

Kimahriman force-pushed the array-hof-codegen branch from dcfb17c to a565a82 Compare April 24, 2023 17:04

Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from dcfb17c to 05f05cf Compare April 29, 2023 13:30

Kimahriman force-pushed the array-hof-codegen branch from 05f05cf to 8250106 Compare May 6, 2023 14:34

cloud-fan reviewed May 19, 2023

View reviewed changes

Kimahriman force-pushed the array-hof-codegen branch from d8e4452 to 6f99d94 Compare June 21, 2023 11:13

jaceklaskowski reviewed Jun 22, 2023

View reviewed changes

Kimahriman closed this Jun 23, 2023

Kimahriman reopened this Jun 23, 2023

Kimahriman force-pushed the array-hof-codegen branch from 674ce58 to b272364 Compare August 13, 2023 12:57

Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from bad044e to 1998566 Compare October 4, 2023 11:41

Kimahriman force-pushed the array-hof-codegen branch from 1998566 to 301e427 Compare January 1, 2024 13:25

Kimahriman force-pushed the array-hof-codegen branch from 301e427 to 4d04e41 Compare January 21, 2024 00:00

Kimahriman force-pushed the array-hof-codegen branch from 4d04e41 to 72505bd Compare March 16, 2024 15:23

Kimahriman force-pushed the array-hof-codegen branch from 72505bd to d3bb6fc Compare May 15, 2024 11:14

Kimahriman force-pushed the array-hof-codegen branch from d3bb6fc to 9da319d Compare August 16, 2024 11:43

Kimahriman force-pushed the array-hof-codegen branch from 9da319d to 6f8115f Compare October 1, 2024 11:13

Kimahriman force-pushed the array-hof-codegen branch from 6f8115f to 5c82d9c Compare November 25, 2024 12:28

Kimahriman added 5 commits February 8, 2025 08:55

Add codegen support to array functions

f49836d

Remove unnecessary variableName and clean up some formatting

60ce914

Remove unnecessary extra variable copies

0b1fe32

Improve some styling

a4068d0

Add tests for codegen fallback inside HOF

419b1fd

Kimahriman force-pushed the array-hof-codegen branch from 5c82d9c to 419b1fd Compare February 8, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

Kimahriman commented Nov 11, 2021 •

edited

Loading

AmplabJenkins commented Nov 11, 2021

Kimahriman commented Jan 2, 2022

Tagar commented Mar 15, 2022

Kimahriman commented Mar 15, 2022

jaceklaskowski left a comment

jaceklaskowski Apr 3, 2023

Kimahriman Apr 4, 2023

jaceklaskowski Apr 3, 2023

Kimahriman Apr 4, 2023

jaceklaskowski Apr 3, 2023

Kimahriman Apr 4, 2023

jaceklaskowski Apr 3, 2023

Kimahriman Apr 4, 2023

Kimahriman commented Apr 4, 2023

cloud-fan May 19, 2023

Kimahriman May 19, 2023

cloud-fan May 19, 2023

cloud-fan May 19, 2023

Kimahriman May 19, 2023

Kimahriman May 19, 2023

Kimahriman May 21, 2023

cloud-fan May 19, 2023

Kimahriman May 19, 2023

Kimahriman Oct 1, 2024

jaceklaskowski left a comment

jaceklaskowski Jun 22, 2023

jaceklaskowski Jun 22, 2023

jaceklaskowski Jun 22, 2023

jaceklaskowski Jun 22, 2023

jaceklaskowski Jun 22, 2023

Kimahriman Jun 23, 2023 •

edited

Loading

Kimahriman commented Jun 23, 2023

chris-twiner commented Oct 10, 2024 •

edited

Loading

[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

Are you sure you want to change the base?

[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

Conversation

Kimahriman commented Nov 11, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Nov 11, 2021

Kimahriman commented Jan 2, 2022

Tagar commented Mar 15, 2022

Kimahriman commented Mar 15, 2022

jaceklaskowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kimahriman commented Apr 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaceklaskowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kimahriman Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Kimahriman commented Jun 23, 2023

chris-twiner commented Oct 10, 2024 • edited Loading

Kimahriman commented Nov 11, 2021 •

edited

Loading

Kimahriman Jun 23, 2023 •

edited

Loading

chris-twiner commented Oct 10, 2024 •

edited

Loading