-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31115][SQL] Detect known Janino bug janino-compiler/janino#113 and apply workaround automatically as a fail-back via avoid using switch statement in generated code #27872
Conversation
…generated code to avoid Janino bug
// the query fails with switch statement, whereas it passes with if-else statement. | ||
// Note that the value depends on the Spark logic as well - different Spark versions may | ||
// require different value to ensure the test failing with switch statement. | ||
val numNewFields = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I was crafting the patch against Spark 2.3 - in Spark 2.3, setting this to 100 throws exception which is not from Janino bug, but from either hitting 64KB limit or parameter limitation on method signature. (That's why I added the details on exceptions when the value exceeds upper limit.)
For Spark 2.3, 70
is the thing making switch statement
failing and if ~ else if ~ else statement
passing.
Test build #119662 has finished for PR 27872 at commit
|
Retest this, please |
Test build #119656 has finished for PR 27872 at commit
|
Test build #119670 has finished for PR 27872 at commit
|
if (canBeComputedUsingSwitch && hset.size <= SQLConf.get.optimizerInSetSwitchThreshold) { | ||
val sqlConf = SQLConf.get | ||
if (canBeComputedUsingSwitch && hset.size <= sqlConf.optimizerInSetSwitchThreshold && | ||
sqlConf.codegenUseSwitchStatement) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of an added configuration, could this be a try / catch
? first try with switch, if it fails, then log an error and use if-else? avoids the users having to set a config, and then unset it again when we have the right fix from janino
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be ideal. Thanks! I over-thought a bit and thought it would be hard to deal with it, but looks like not that complicated. Let me update the code and PR information.
@@ -178,27 +179,48 @@ case class ExpandExec( | |||
|${ev.code} | |||
|${outputColumns(col).isNull} = ${ev.isNull}; | |||
|${outputColumns(col).value} = ${ev.value}; | |||
""".stripMargin | |||
""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be a right indentation so fixed.
@@ -647,7 +651,7 @@ case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: Int) | |||
} | |||
|
|||
${ctx.registerComment( | |||
s"""Codegend pipeline for stage (id=$codegenStageId) | |||
s"""Codegen pipeline for stage (id=$codegenStageId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a typo as well.
cc. @dongjoon-hyun and @maropu as well, as they showed reactions in #27860 |
It may cause performance regression in some cases, @HeartSaVioR . cc @kiszk and @rednaxelafx and @gatorsmile , too. |
It would be appreciated if we can elaborate the possible performance regression (even it's like a "thinking out loud"); "may" and "some cases" are too ambiguous. If the query doesn't fall into the case via hitting Janino issue, it shouldn't bring perf. regression, as it tries to generate code with If the query fall into the case via hitting Janino issue, either the query has been failing or has been failing back to non-codegen. Taking alternative is definitely better than failing the query; the remaining one is whether using if ~ else if chain (+ cost to regenerate code and compile) is worse than going through non-codegen. |
I just enriched some information in PR title/description to make clear that workaround is applied as a fail-back. I'm sorry if anyone is confused because of lack of information. |
Hi, @HeartSaVioR , thanks for the work! btw, do you know what's an user query to reproduce this issue? If this issue occurs frequently on the user side, it might be worth adding the workaround. its a corner case, I personally think its ok just to fall back into the interpreter mode for avoiding the maintenance overhead of the workaround. |
* is due to the known bug, it generates workaround code via touching flag in CodegenContext and | ||
* compile again. | ||
*/ | ||
private def doGenCodeAndCompile(): CompileResult = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle the non-whole stage codegen case? e.g., GenerateMutableProjection
via CodeGeneratorWithInterpretedFallback
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
is now used specific to ExpandExec and InSet; originally what I tracked was only ExpandExec, which doesn't fall into the case if I understand correctly. Maybe InSet has upper/lower limit configuration which wouldn't trigger the issue - just apply to ExpandExec only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I personally think we'd better fix the generated code of ExpandExec
(I'm not sure now that its worth fixing this issue).
Test build #119682 has finished for PR 27872 at commit
|
I have one, but I cannot share since the query is from actual customer. If you're OK with just generated code, I've attached the file in Janino issue janino-compiler/janino#113. To share the information which directly impacts to hit the Janino bug, there're 70+ columns, grouped by 2 columns, and applies 60 aggregate functions (30 non-distinct, 30 distinct).
I'm not sure whether it'd be an edge-case happening very rarely; I'd like to hear more voices, but if the consensus would be treating this as rare case, I'd agree that we would like to avoid the bandaid fix. Let's see others' voices as well. |
retest this, please |
Test build #119681 has finished for PR 27872 at commit
|
Test build #119686 has finished for PR 27872 at commit
|
Test build #119693 has finished for PR 27872 at commit
|
I have mixed feelings about this PR. The good part: It does do what it advertised, and should not result in performance regressions since the code that would have passed compilation would stay as-is, and only the code that triggers error would go to the fallback mode which does come with slightly bigger code size, but wouldn't really affect runtime performance if the generated code is still JIT compiled by an optimizing JIT. The less-good part: I feel like the codegen system may make good use of a good retry framework, so that it may be able to do things that it couldn't do right now. The code presented here is an attempt at a one-off case of retry, and I'm somewhat concerned about future PRs that try to pile on top of it to implement retry for other scenarios, and just let the code "evolve organically" without a good base design first. The the main focus is to buy some time before we can get Janino to release 3.0.16, my hunch is it's possible to slightly tune the codegen for ExpandExec to make it generate just small enough code to work around the customer issue bugging you right now. e.g. if a part of the problem was caused by the case body of the ExprCode(code = EmptyBlock, isNull = TrueLiteral, JavaCode.defaultLiteral(dataType)) (from Spark 2.4 / 3.0 / master: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L69) WDYT @HeartSaVioR ? |
} | ||
) | ||
|
||
val aggExprs: Array[Column] = Range(1, numNewFields).map { idx => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: How about (1 to numNewFields)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using steps
parameter and eventually removed it. Given we seem to be between neutral to negative on adopting this patch, I'll defer addressing the nit for now.
Thanks for spending your time to analyze the patch and provide and detailed voice, @rednaxelafx . The rationalization of this patch is to address known issue "before" we reach the real solution, as the real solution is out of our control; no one can answer when Janino 3.0.16 will be released, and even whether it will be released or not. Janino maintainer refactors everything by oneself in 3.1.x and makes things be unstable - recent version of Janino makes lots of Spark UTs fail which makes me less confident we can migrate to Janino 3.1.x. I totally agree the patch is a band-aid fix and not well designed to extend (I didn't intend it be extendable though), and there're better options available (more to less preferable personally)
WDYT? Would it be better to raise this to discussion in dev@ list? |
Btw, I'm happy to hear if the approach of this patch (retry) initiates to imagine the new idea of improvement. That would be enough worthy. |
Sounds good, though yeah ideally we get a 3.0.16 update instead to fix it. Maybe worth waiting a beat but not holding up 3.0. |
My 2 cents for 4 while we are waiting for 3.0.16. I am also neutral on other options except 2. |
UPDATE: I’ve managed to fix all of Spark test failures we’ve seen with Janino 3.1.1 - see #27860. (Say, dealt with option 1) We still need to wait for these patches to be merged and released (so this patch is not still outdated) but in better situation as both 3.1.2 and 3.0.16 will work, worth to wait a bit more to see how things will go with Janino side. |
Thanks for doing this fix! Among the options, I'd prefer to tune ExpandExec like @rednaxelafx proposed to work around this issue for short term. Then we can either wait for a future release of 3.0.16, or Janino 3.1.2 that fixes all test failures in Spark. |
Honestly I'm not sure about the details I got the proposal from @rednaxelafx - what I can imagine for only touching ExpandExec to get around is option 4 from my proposal, and if he meant other trick/technique than I don't get it. It would be nice if someone could help me understand via elaborating a bit. |
I think the option 4 looks fine to me. btw, splitting large code into pieces in
I just want to know the actual performance numbers of this approach. I think splitting large code into small parts might improve performance.
To reproduce the issue, could you build the simple query that you can show us based on your private customer's query? I think the query can make us understood more for the issue. |
This patch is intended to add workaround for the bug until the actual patch will be placed into Janino; it would be nice if we could develop the ideas of "improvement" via separate thread. I'm not an expert of JVM so not clear how much JIT would help (if what we expect from making method lines smaller is just inlining the method, that would be technically same as before). That's why I've been also investigating to go forward with option 1 as well; this patch can be simply abandoned if we are OK with #27860 (for sure, after official release of Janino).
I've added UT which fails in master branch. I've also added comments around UT to explain the details. I don't think it would require very complicated query; lots of columns and enough number of distinct aggregation function will trigger the bug. |
My suggestion for tuning codegen may have been too cryptic, sorry about not providing more background explanation. Let me explain why I arrived at the My assumptions:
From what @HeartSaVioR was able to share in janino-compiler/janino#113, specifically in the attachment https://github.com/janino-compiler/janino/files/4295191/error-codegen.log, we can see that a large portion of the switch...case body is filled with code like the following: /* 4677 */ case 0:
/* 4678 */ final long expand_value129 = -1L;
/* 4679 */ expand_isNull66 = true;
/* 4680 */ expand_value66 = expand_value129;
/* 4681 */
/* 4682 */ final long expand_value130 = -1L;
/* 4683 */ expand_isNull67 = true;
/* 4684 */ expand_value67 = expand_value130;
/* 4685 */
/* 4686 */ final long expand_value131 = -1L;
/* 4687 */ expand_isNull68 = true;
/* 4688 */ expand_value68 = expand_value131;
/* 4689 */
/* 4690 */ final long expand_value132 = -1L;
/* 4691 */ expand_isNull69 = true;
/* 4692 */ expand_value69 = expand_value132; If we zoom in on the 3 statements here: /* 4678 */ final long expand_value129 = -1L;
/* 4679 */ expand_isNull66 = true;
/* 4680 */ expand_value66 = expand_value129;
Janino, on the other hand, doesn't handle all of Java language level constant expression constructs. Specifically, it does not support local constant declarations. So it compiles the code as 3 assignments:
So every "chunk" of code above would be a few more bytes of bytecode. It may be small for just one chunk, but the codegen for So where does this "value = -1L; isNull = true" thingy come from? We can infer that it's generated from We can easily make Spark 2.3.x generate slightly better code, by improving if (value == null) {
ev.isNull = "true"
ev.copy(s"final $javaType ${ev.value} = ${ctx.defaultValue(dataType)};")
} with something like: if (value == null) {
ev.copy(
isNull = "true",
value = ctx.defaultValue(dataType)
)
} (the which would improve the example above into: /* 4678 */ // THIS IS GONE: final long expand_value129 = -1L;
/* 4679 */ expand_isNull66 = true;
/* 4680 */ expand_value66 = -1L; so that Janino can generate less bytecode for this particular query. |
Thanks for the detailed explanation, @rednaxelafx ! Btw, as I shared earlier, I have been investigating with Janino itself - I fixed all of bugs on 3.1.x which made Spark UT failing, and hopefully Janino maintainer made a branch for 3.0.x as well. Maybe we can get both of 3.0.16 & 3.1.2 versions in this week (I've asked it from janino-compiler/janino#115 (comment)), and then we can simply upgrade Janino to abandon this PR. I'm running UT in #27860 for Janino 3.1.2 (custom jar via Jitpack) as well as #27932 for Janino 3.0.16 (custom jar via Jitpack). Once both builds run fine, I'll forward the result to Janino maintainer. |
Both builds run fine - I've shared the result to Janino maintainer. Looks like he is open to release 3.0.16, but most likely one-time release and 3.0.x would never be LTS. So that would be ideal if we feel OK to go with Janino 3.1.2, but if we concern about stability we could just go with 3.0.16. |
I've finally updated my two WIP PRs to official releases, which invalidates the purpose of this PR. Let's move on, but decide which version would be better before moving on, 3.1.2 vs 3.0.16. (For sure, looks like Janino maintainer wants to see us adopt 3.1.x, but the decision is up to us.) #27860 for Janino 3.1.2 cc. to everyone commented here to see preference for Janino version. @squito @dongjoon-hyun @maropu @kiszk @rednaxelafx @srowen @viirya |
I don't have a strong preference, but if 3.1.2 works, seems like the time to jump to it is in Spark 3.0, all else equal? |
@HeartSaVioR thank you so much for sorting it out! It's great to see the problems fixed on the Janino side. My preference on the version to upgrade to is on the conservative side. It'd be nice if we can take a baby step at this stage of Spark 3.0 release... so +1 on 3.0.16 from me for 3.0 (and maybe 2.4 too), and 3.1.2 on master. |
But, if we use janino v3.0.16 for spark v3.0.0, future janino 3.1.x releases for bugfixes have no luck for our 3.0.x maintance releases? |
Looks like Janino maintainer explicitly mentioned 3.0.x line is deprecated in Janino changelog.
Upgrading Janino to 3.0.16 in Spark 2.4.x wouldn't be any issue as I expect only few of new Spark 2.4.x releases, but for Spark 3.0.0 we expect many further bugfix releases which looks to me that @maropu voice is valid. I'm not sure there's some policy on upgrading dependency based on semver - like upgrading minor version of dependency in bugfix version, but if that's considered as a bad practice in community, we may be better to avoid deprecated version line. Let's collect more voices here. |
Hmm, that's a hard choice. Janino's bultin tests are far from having good coverage, having hacked my own fork of Janino once and know the code fairly well, I feel somewhat uneasy about the huge refactorings... Sticking to the "deprecated" Janino 3.0.x branch for Spark 3.0.x doesn't sound like that bad of a situation to me. Spark 3.0 took a long time, but Spark 3.1 shouldn't be that far away. That said, I wouldn't be too sad if we end up taking Janino 3.1.2 into Spark 3.0.0. |
I'd tend to agree, though I'd also understand about Janino's status - only one maintainer, hard to guess the all usages. Spark generated code is far far from hand written code which no one would have been imagine about ;) Maybe it would be ideal to construct the regression tests in Spark side - either having set of generated codes, or E2E tests.
I'll try to sort it out once it happens but 3rd party is a kind of outer control. 3.0.17 might be unlikely happening, whereas 3.1.3 is likely, easier to persuade having new release. |
semver still applies in that behavior changes - and the shaded copy - can affects apps. But there's no issue w.r.t. semver in updating to a new minor release, especially in a major Spark release. But yes the issues are: how risky is the update in general? breakage is bad in any Spark release. How much does it help? if the current version isn't maintained, when do you have to update and take that risk, and is that better on a major vs minor release boundary? That is, if there is any perceived risk, would you take it now or in a minor Spark release. |
What changes were proposed in this pull request?
This patch proposes to detect whether the generated code hits the known Janino bug janino-compiler/janino#113 from exception thrown in codegen compilation, and re-generate workaround code via not using
switch
statement, as a "fail-back".The bug is triggered from some circumstance with using
switch
statement (please refer Janino issue for more details), butswitch
statement is still used first, as it makes the generated code more concise, and also more efficient in point of performance's view. So if the generated code doesn't hit the Janino bug, the behavior would be same with the state before the patch.The conditions are below:
From the new test, the generated code for
expand_doConsume
is following in normal path:and after applying the workaround automatically as a fail-back, the new generated code for
expand_doConsume
is following:Why are the changes needed?
We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.
We provided the patch to Janino via janino-compiler/janino#114 and Janino 3.1.1 was released which contains the patch, but we realized lots of unit tests fail once we apply Janino 3.1.1.
We have asked about releasing Janino 3.0.16 to Janino maintainer (see janino-compiler/janino#115), but given there's no guarantee to get it, we'd be better to have our own workaround.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UT. Confirmed that changing ExpandExec to not use
CodegenContext.disallowSwitchStatement
made the new test fail.