-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Spark-SQL] Optimize the Constant Folding for Expression #482
[WIP][Spark-SQL] Optimize the Constant Folding for Expression #482
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14385/ |
// always be null | ||
object alwaysNull extends Nullability(-1) | ||
// never be null | ||
object neverNull extends Nullability(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a naming issue, objects like this should be capitalized (PossibleNull, AlwaysNull, NeverNull). It might also be good to rename PossibleNull to PossiblyNull.
I'm not sure if we want to overload the nullability interface in this way. This nullability bit currently is mostly intended for use when determining storage layout of tuples (i.e. do we need to keep a null bit around or can we operate on primitives). What is going on here is something like constraint propagation + null domination, which we might be able to accomplish more generally. Also, I'm not sure if its a good idea to provide a default implementation of nullability as there are quite a few operators whose nullability isn't defined by their children (many aggregate operators for example). Perhaps there is a simpler way to do this using an improved rule in the optimizer. |
@marmbrus I will review the nullability for aggregate operators etc. and it also may could be folded by statically analyze the nullability. (for example "sum(a)" but "a" could be an expression "Always Be Null"). I was thinking not to extend the original nullability of expression, however only the "Never Be Null" and "Always Be Null" will be helpful in constant folding rule, but the boolean value of nullability is only represented as "Never Be Null" or "Possibly Be Null". I think we need to find some other way to show "Always Be Null" state for expression if we don't want to extend the nullability. |
I have 2 options for the changes: But still seems the "constraint propagation" is the best. Not sure if you have better idea on this. |
I like the option of doing it similar to BooleanSimplification. That seems nice and explicit. I also think this rule will play nicely with more general constraint propagation when we add that in the future. |
@marmbrus , can you please review this again? I've rewritten the code as rule-based. |
// Skip redundant folding of literals. | ||
case l: Literal => l | ||
// if it's foldable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can omit this comment since the line beneath reads if e.foldable
.
Thanks for working on this! I think the rule based version is a lot clearer. |
Thank you @marmbrus , I've updated the code a little bit. I've created a new rule "NullPropagation", it transforms the expression tree from bottom to top, and keep unchanged for the rule constant folding. |
test this please |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14534/ |
Can you please fix the style issues. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
@@ -114,37 +114,37 @@ package object dsl { | |||
def attr = analysis.UnresolvedAttribute(s) | |||
|
|||
/** Creates a new AttributeReference of type boolean */ | |||
def boolean = AttributeReference(s, BooleanType, nullable = false)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for changing all of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "def boolean" / "def string" should return an Attribute with nullability = true, that does not harm for the correctness, otherwise, it may bring a wrong hint for the optimization of rule NullPropagation.
For example:
val row = new GenericRow(Array[Any] ("a", null))
val c1 = 'a.string.at(0)
val c2 = 'b.string.at(1) // nullable should be true, otherwise does't reflect the real situation.
assert(evaluate(IsNull(c2), row) == true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
@marmbrus I have updated the code as your comment, can you review/retest it? |
test this please |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
baseValue(o) | ||
} | ||
val value = child.eval(input) | ||
if(value == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after if
, below too.
Fixed. Can you retest this please? |
test this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Thanks for doing this :) @pwendell, can you please merge? |
I've merged this. Thanks! |
Currently, expression does not support the "constant null" well in constant folding. e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime. For example: ``` explain select isnull(key+null) from src; == Logical Plan == Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28] MetastoreRelation default, src, None == Optimized Logical Plan == Project [true AS c_0#28] MetastoreRelation default, src, None == Physical Plan == Project [true AS c_0#28] HiveTableScan [], (MetastoreRelation default, src, None), None ``` I've create a new Optimization rule called NullPropagation for such kind of constant folding. Author: Cheng Hao <[email protected]> Author: Michael Armbrust <[email protected]> Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits: 2f14b50 [Cheng Hao] Fix code style issues 68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation 29c8166 [Cheng Hao] Update the code for feedback of code review 50444cc [Cheng Hao] Remove the unnecessary null checking 80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding 27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482 9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation. 543ef9d [Cheng Hao] fix code style issues 9cf0396 [Cheng Hao] update code according to the code review comment 536c005 [Cheng Hao] Add Exceptional case for constant folding 3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules 2645d4f [Cheng Hao] Constant Folding(null propagation) (cherry picked from commit 3eb53bd) Signed-off-by: Reynold Xin <[email protected]>
Added StreamingContext.awaitTermination to streaming examples StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn. This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait. The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9.
Currently, expression does not support the "constant null" well in constant folding. e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime. For example: ``` explain select isnull(key+null) from src; == Logical Plan == Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28] MetastoreRelation default, src, None == Optimized Logical Plan == Project [true AS c_0#28] MetastoreRelation default, src, None == Physical Plan == Project [true AS c_0#28] HiveTableScan [], (MetastoreRelation default, src, None), None ``` I've create a new Optimization rule called NullPropagation for such kind of constant folding. Author: Cheng Hao <[email protected]> Author: Michael Armbrust <[email protected]> Closes apache#482 from chenghao-intel/optimize_constant_folding and squashes the following commits: 2f14b50 [Cheng Hao] Fix code style issues 68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation 29c8166 [Cheng Hao] Update the code for feedback of code review 50444cc [Cheng Hao] Remove the unnecessary null checking 80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding 27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests b28e03a [Cheng Hao] Merge pull request apache#1 from marmbrus/pr/482 9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation. 543ef9d [Cheng Hao] fix code style issues 9cf0396 [Cheng Hao] update code according to the code review comment 536c005 [Cheng Hao] Add Exceptional case for constant folding 3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules 2645d4f [Cheng Hao] Constant Folding(null propagation)
Added StreamingContext.awaitTermination to streaming examples StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn. This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait. The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9. (cherry picked from commit 0367981) Signed-off-by: Patrick Wendell <[email protected]>
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.
For example:
I've create a new Optimization rule called NullPropagation for such kind of constant folding.