-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Prevent StackOverflowError when parsing large statements #33902
Conversation
Catch StackOverflowError exception and return a descriptive message to the client. This prevents large statement from killing the cluster. Fixes: elastic#32942
Pinging @elastic/es-search-aggs |
I'm not sure that this is a safe thing to do. |
@nik9000 With this code we return an error message to the client, and the ES node is not affected. |
try { | ||
return visitor.apply(new AstBuilder(paramTokens), tree); | ||
} catch (StackOverflowError e) { | ||
throw new ParsingException("{} is too large to parse (causes stack overflow)", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a message of the form "{} cannot be parsed" be more in line with our error messages in general? Meaning, the message says right away that something is wrong, at the start of the message. Also, the failure happens because the message is too large (in length) or because it has too many tokens?
} | ||
|
||
private <T> T invokeParser(String sql, List<SqlTypedParamValue> params, Function<SqlBaseParser, ParserRuleContext> parseFunction, | ||
private <T> T invokeParser(String name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you name the name
parameter in a more descriptive way? As I see it, it's only used for error reporting purposes, to adapt the message to what type of entity failed to be parsed. Maybe querySourceType
?
I'm not sure that this is a safe thing to do.
That is the part I'm not sure about. My instinct is that stuff like stack overflows and out of memory put the application into a weird state. I'm happy to be convinced I'm wrong about it though. I know we've had issues in the past with out of memory leaving ES in a half functional state. So we removed catching it. And we lumped stack overflow in with that initiative. When this has come up in other places we've mostly worked around it by untwisting the recursion into loops. If this is deep into ANTLR's code we really can't do much about it though. So I'm not sure what to do. |
I've run across this problem writing antlr grammars elsewhere, and the general way to solve it in my experience is to switch from recursive matching to multi-matching. So instead of:
you do
|
Indeed, in ANTLR the pattern is to catch the @romseygeek Thanks for the suggestions - it's worth a try though that makes the grammar even more complicated and it's not always obvious when a pattern is recursive or not. I wonder if listeners (or something similar) can be used to stop the parsing before it blows up. |
This problem cannot be solved using ANTLR as it requires recursion for anything more than an extremely simple grammar. There is no way around this since expressions can be of an indeterminate length. I view the problem as what's worse - letting a node die because of a SQL query or at least attempting to recover and catching a SOE. The trade off of catching an SOE seems better here. |
Also note that stack overflow in this cannot have any negative side effects. Normally the side effects to be aware of are partial initialization of objects. Here we are constructing and using an AstBuilder. This object is the only one that could be partially initialized, but in the catch case we do not save it. |
5b46812
to
d62deb1
Compare
d62deb1
to
37fb6bf
Compare
Please check the approach here: ff67d02 |
@elasticmachine retest this please |
My hesitation with this CircuitBreaker approach is that we cannot reset the counter for each part of the tree. This means that if we limit the elements to 100 for example, then in the query: It will also unnecessarily catch expressions like: |
If @rjernst is ok with catching StackOverflowError here then I am as well, so long as we leave a big comment making it clear about why it is ok here. It'd be nice to link to some documentation about how StackOverflowError works and what guarantees it gives us. |
There are no guarantees the circuit breaker presented here won't cause an SOF. Each stack frame can have a varying size based on the input expression. The stack frame size can also be controlled via JVM parameter so there's no guarantee of being able to calculate based on the limited number of recursions. |
I agree with that but I after searching and reading a bit about cases of catching the https://stackoverflow.com/questions/28551767/is-it-safe-to-catch-stackoverflowerror-in-java According to the latter: |
@elasticmachine retest this please. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Implement circuit breaker logic in the parser which catches expressions that can blow up the tree and result in StackOverflowError being thrown. Co-authored-by: Costin Leau <[email protected]>
Backported to |
@matriv Sorry I was on vacation and could not respond. I think catching StackOverflowError is the correct thing to do. We can prove based on what a stack overflow means that the state of the system has not been corrupted. We know this because we do not hang onto any objects that were being created when the SOE occurred (which is just the AstBuilder), and there are no statics that were potentially being modified. Additionally, the commit message for this PR is deceptive because an SOE can still occur, it is just less likely now. |
We don't have any guarantees regarding the method execution itself or it's side-effects. |
@costin Is there anything that will change your mind that catching the SOE is the better approach to this issue? |
This is true, but stack overflow occurs from deep recursion. Classloading would have happened at the top of the recursion tree, long before we are deep enough for stack overflow.
JIT happens in a separate system thread. It does not stop the world in order to compile a method to native code. |
Even though catching stack overflow errors is sometimes ok, the challenging part is to keep this code correct over time. Eg. unrelated refactorings might mistakenly fold other statements under the |
@jdconrad
I'm not sure that applies for lazy inner classes (init-on-demand idiom) that can appear at the tail of a deep recursion; e.g
Right however it's not the compilation of the JIT but the effect of the SOE on the native (JIT-ed) stack - I'm not aware of any guarantees on that front. Regarding the breaker - the default stack size for ES is 1M while the breaker looks for a parsing depth/width of 100 calls, we can limit this further if need be. Thoughts? |
I can see how this could be a problem, given the current design of the sql parser. In painless, all classes are loaded up front, through the whitelist. I imagine something similar could be done in sql.
I'm not sure what you mean. The JIT'd code's stack would be structured the exact same as the original code, otherwise the jvm could never safely switch to the JIT'd method's implementation in the middle of recursive calls.
The ideas here all sound plausible (although I would avoid checking the stack trace as I think the generation is dynamic, thus incurring cost to look at it). But I don't think anything can guarantee there is no stack overflow (without essentially creating our own parallel stack overflow tracking). @jpountz re: catching SOE
I think this is always the challenge with any pieces of code that have inter dependencies. I can see how the sql implementation is much more complicated than painless, but in the end, the idea seems to be to convert a text representation of the sql query into something that can be sent as an elasticsearch query. Given the query should be the only output, I think we can make catching SOE work. But I don't care that much, as long as the messaging of this change does not claim to "prevent" SOE, when it only makes it less likely. |
Implement circuit breaker logic in the parser which catches expressions that can blow up the tree and result in StackOverflowError being thrown. Co-authored-by: Costin Leau <[email protected]>
Catch StackOverflowError exception and return a descriptive message
to the client. This prevents large statement from killing the cluster.
Fixes: #32942