-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558
base: master
Are you sure you want to change the base?
Conversation
Can one of the admins verify this patch? |
f46fb71
to
217960e
Compare
217960e
to
ce082f3
Compare
ce082f3
to
d4a2f63
Compare
@viirya @HyukjinKwon @cloud-fan any thoughts or know who might have thoughts? |
d4a2f63
to
aaa4be4
Compare
@Kimahriman just out of curiosity, how much did the performance improve? |
It's hard to say because when I tested this out on my production jobs (actually still actively using it), I had several other changes too. I'm not sure if there are any benchmarks involving HOFs? Though it's highly dependent on what the lambda function is, and honestly that's one of the main benefits, the lambda functions themselves can be codegen'd instead of eval'd. I also have a larger goal to support subexpression elimination inside lambda functions, because that's where I've found our biggest problem is. #34727 is also part of that goal. |
aaa4be4
to
3da6342
Compare
3da6342
to
c3236c0
Compare
c3236c0
to
2e9f4d3
Compare
9cec788
to
8b898b0
Compare
8b898b0
to
194e457
Compare
194e457
to
1a52017
Compare
1a52017
to
b71b633
Compare
b71b633
to
a565a82
Compare
a565a82
to
92d9a9f
Compare
a565a82
to
92d9a9f
Compare
92d9a9f
to
03c2dc6
Compare
03c2dc6
to
572b666
Compare
572b666
to
4a7dba9
Compare
4a7dba9
to
5b13cd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).
} | ||
|
||
val result = f(lambdaVars) | ||
namedLambdas.foreach(v => currentLambdaVars.remove(v.variableName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/v/lamba? Don't want to ask for namedLambda
as used earlier, but v
does not fit really.
I'd also consider namedLambdas.map(_.variableName).foreach(currentLambdaVars.remove)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I never know all the exact ways that using _
does and doesn't work when you use a function call, struggle with the cleanest way to write things like this. I like the second one I think
} | ||
|
||
def getLambdaVar(name: String): ExprCode = { | ||
currentLambdaVars.getOrElse(name, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Are these curly brackets required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this format in other parts of the code, but I also have seen it without, so can remove
|
||
// We need to include the Expr ID in the Codegen variable name since several tests bypass | ||
// `UnresolvedNamedLambdaVariable.freshVarName` | ||
lazy val variableName = s"${name}_${exprId.id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seem consistent with simpleString
. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm must have thought it was used in the actual generated code or something but doesn't look like it is. In fact I don't think variableName
is really needed at all, can just use exprId.id
as the map key, and name
for the code comment
@@ -350,6 +445,49 @@ case class ArrayTransform( | |||
result | |||
} | |||
|
|||
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | |||
ctx.withLambdaVars(Seq(elementVar) ++ indexVar, { lambdaExprs => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Be consistent with {
s; here, {
is before an input argument to a function literal while a few lines below it's after =>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did that because of the instances where I used the case expression to pull out the single element list, will update these for consistency
Thanks for the review! I tried to get as much common code in the parent classes as I could, can take another pass to see if anything jumps out for deduping |
dcfb17c
to
a565a82
Compare
dcfb17c
to
05f05cf
Compare
05f05cf
to
8250106
Compare
private def childrenToRecurse(expr: Expression): Seq[Expression] = expr match { | ||
case _: CodegenFallback => Nil | ||
case c: ConditionalExpression => c.alwaysEvaluatedInputs.map(skipForShortcut) | ||
case h: HigherOrderFunction => h.arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do the same for commonChildrenToRecurse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. That only cares if the current expression is a ConditionalExpression
. The default is Nil
if it's not that so I don't think it needs any special handling for HOFs
@@ -130,6 +134,23 @@ case class LambdaFunction( | |||
|
|||
override def eval(input: InternalRow): Any = function.eval(input) | |||
|
|||
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | |||
val functionCode = function.genCode(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we can just return function.genCode(ctx)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is only to optimize the non-nullable case, which should be handled by the function body codegen already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to wrap my ahead around this again. I think they might be doing different things? Like for ArrayTransform
, the nullSafeCodeGen
is only optimizing the non-nullable case of the result of ArrayTransform
, whereas this is optimizing the non-nullable case of the result of the lambda function being called for each element in the ArrayTransform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe this already handles that:
val resultNull = if (function.nullable) Some(functionCode.isNull.toString) else None
val resultAssignment = CodeGenerator.setArrayElement(arrayData, dataType.elementType,
i, copy, isNull = resultNull)
I'll need to think about this a little more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay nevermind I see what you mean now, it's just assigning the same values to new variables for no reason. Changed to just return function.genCode(ctx)
elementVar: NamedLambdaVariable, index: String): String = { | ||
val elementType = elementVar.dataType | ||
val elementAtomic = ctx.addReferenceObj(elementVar.name, elementVar.value) | ||
val extractElement = CodeGenerator.getValue(arrayName, elementType, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With codegen, I don't think we need a global atomic reference to represent the lambda variable. We can just generate local variables in the java code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomic references are for cases where you might have a CodegenFallback
expression inside your lambda function. I think the only way an expression.eval(input)
will work is if this atomic reference is set to the lambda variable value as well. I could potentially try to detect if the function has a CodegenFallback
expression, but I wasn't sure if there were other reasons some expression would fall back to interpreted inside the lambda function that I couldn't detect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally got back around to this, added tests showing why the atomic reference is needed in case of CodegenFallback
inside a lambda expression
d8e4452
to
6f99d94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More nitting 😉
*/ | ||
var currentLambdaVars: mutable.Map[Long, ExprCode] = mutable.HashMap.empty | ||
|
||
def withLambdaVars(namedLambdas: Seq[NamedLambdaVariable], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put namedLambdas...
on a separate line?
|
||
def withLambdaVars(namedLambdas: Seq[NamedLambdaVariable], | ||
f: Seq[ExprCode] => ExprCode): ExprCode = { | ||
val lambdaVars = namedLambdas.map { namedLambda => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Replace namedLambda
to lambda
?
} | ||
|
||
def getLambdaVar(id: Long): ExprCode = { | ||
currentLambdaVars.getOrElse(id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move id
to its own new line? 🙏
protected def nullSafeCodeGen( | ||
ctx: CodegenContext, | ||
ev: ExprCode, | ||
f: String => String): ExprCode = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be very happy if you use this parameter formatting style in the other places in your PR. Makes reading so much easier, esp. with functions with 5+ params. Can you make such change? 🙏
}) | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a copy and paste of exists
, doesn't it? Can we have a parent class for some sharing? Unless I'm mistaken, the generated code block is the exact copy except $forall
(vs $exists
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to look at this a little bit. It's tricky because there's a few places they are different (default true vs false, !
'd for one and not the other, the followThreeValuedLogic
flag for exists)
Yeah I struggle with the right formatting for multiline things in scala, tried to update all the suggestions, thanks for the tips! |
674ce58
to
b272364
Compare
bad044e
to
1998566
Compare
1998566
to
301e427
Compare
301e427
to
4d04e41
Compare
4d04e41
to
72505bd
Compare
72505bd
to
d3bb6fc
Compare
d3bb6fc
to
9da319d
Compare
9da319d
to
6f8115f
Compare
I just wanted to add to the above response that I've implemented a compilation scheme here, as part of Quality, and we saw perf boosts of up to 40%, after that adding further lambdas triggered the cost of code generation being higher than the saving. It's definitely usage dependant though, the more work done in the function the higher the cost (and therefore potential saving by compilation), a small boost is noticeable on removal of the atomic under similar ideal circumstances. edit - the source |
6f8115f
to
5c82d9c
Compare
5c82d9c
to
419b1fd
Compare
What changes were proposed in this pull request?
This PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.
A few notes:
Why are the changes needed?
To improve performance of array higher-order function operations, letting the children be codegen'd and participate in WholeStageCodegen
Does this PR introduce any user-facing change?
No, only performance improvements.
How was this patch tested?
Existing unit tests, let me know if there's other codegen-specific unit tests I should add.