Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Prevent StackOverflowError when parsing large statements #33902

Merged
merged 12 commits into from
Sep 25, 2018

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Sep 20, 2018

Catch StackOverflowError exception and return a descriptive message
to the client. This prevents large statement from killing the cluster.

Fixes: #32942

Catch StackOverflowError exception and return a descriptive message
to the client. This prevents large statement from killing the cluster.

Fixes: elastic#32942
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@nik9000
Copy link
Member

nik9000 commented Sep 20, 2018

Catch StackOverflowError

I'm not sure that this is a safe thing to do.

@matriv
Copy link
Contributor Author

matriv commented Sep 20, 2018

@nik9000 With this code we return an error message to the client, and the ES node is not affected.
I've searched for any other way of handling it with ANTLR4 but haven't found such. Any suggestions?

try {
return visitor.apply(new AstBuilder(paramTokens), tree);
} catch (StackOverflowError e) {
throw new ParsingException("{} is too large to parse (causes stack overflow)", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a message of the form "{} cannot be parsed" be more in line with our error messages in general? Meaning, the message says right away that something is wrong, at the start of the message. Also, the failure happens because the message is too large (in length) or because it has too many tokens?

}

private <T> T invokeParser(String sql, List<SqlTypedParamValue> params, Function<SqlBaseParser, ParserRuleContext> parseFunction,
private <T> T invokeParser(String name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you name the name parameter in a more descriptive way? As I see it, it's only used for error reporting purposes, to adapt the message to what type of entity failed to be parsed. Maybe querySourceType?

@nik9000
Copy link
Member

nik9000 commented Sep 20, 2018

Catch StackOverflowError

I'm not sure that this is a safe thing to do.

the ES node is not affected.

That is the part I'm not sure about. My instinct is that stuff like stack overflows and out of memory put the application into a weird state. I'm happy to be convinced I'm wrong about it though.

I know we've had issues in the past with out of memory leaving ES in a half functional state. So we removed catching it. And we lumped stack overflow in with that initiative.

When this has come up in other places we've mostly worked around it by untwisting the recursion into loops. If this is deep into ANTLR's code we really can't do much about it though. So I'm not sure what to do.

@romseygeek
Copy link
Contributor

When this has come up in other places we've mostly worked around it by untwisting the recursion into loops

I've run across this problem writing antlr grammars elsewhere, and the general way to solve it in my experience is to switch from recursive matching to multi-matching. So instead of:

expression: part | part OR expression | ...

you do

expression: orexpression | ....
orexpression: part OR part (OR part)*

@costin
Copy link
Member

costin commented Sep 20, 2018

/cc @jdconrad @jpountz

Indeed, in ANTLR the pattern is to catch the Errors which is bad since all guarantees about the JVM are off. I am of the same opinion with Nik - it's better to just let the Error through since we don't know if the JVM is still usable or not; it's annoying to the user but much better than say data corruption or who knows what else.

@romseygeek Thanks for the suggestions - it's worth a try though that makes the grammar even more complicated and it's not always obvious when a pattern is recursive or not.
If I recall correctly ANTLR generates recursive-descent parsers (not sure if there's a hard-limit to that before backtracking) - if that's the case the error can still occur though the query has to be significantly bigger.
Further more due to the nature of SQL it might not even be possible (things like WITH or subselects are recursive by nature though in practice, I reckon they are no more than several dozens deep and that in extreme cases).

I wonder if listeners (or something similar) can be used to stop the parsing before it blows up.

@jdconrad
Copy link
Contributor

This problem cannot be solved using ANTLR as it requires recursion for anything more than an extremely simple grammar. There is no way around this since expressions can be of an indeterminate length. I view the problem as what's worse - letting a node die because of a SQL query or at least attempting to recover and catching a SOE. The trade off of catching an SOE seems better here.

@rjernst
Copy link
Member

rjernst commented Sep 20, 2018

Also note that stack overflow in this cannot have any negative side effects. Normally the side effects to be aware of are partial initialization of objects. Here we are constructing and using an AstBuilder. This object is the only one that could be partially initialized, but in the catch case we do not save it.

@matriv
Copy link
Contributor Author

matriv commented Sep 21, 2018

Please check the approach here: ff67d02

@matriv
Copy link
Contributor Author

matriv commented Sep 21, 2018

@elasticmachine retest this please

@matriv
Copy link
Contributor Author

matriv commented Sep 21, 2018

My hesitation with this CircuitBreaker approach is that we cannot reset the counter for each part of the tree. This means that if we limit the elements to 100 for example, then in the query: SELECT a=b OR a=b OR... FROM t WHERE a=b AND ..., noElements in the SELECT clause + noElements in the WHERE clause must be <= 100.

It will also unnecessarily catch expressions like: SELECT a1=b1 OR a2=b2, a3=b2 AND a4=b4, ...

@nik9000
Copy link
Member

nik9000 commented Sep 21, 2018

If @rjernst is ok with catching StackOverflowError here then I am as well, so long as we leave a big comment making it clear about why it is ok here. It'd be nice to link to some documentation about how StackOverflowError works and what guarantees it gives us.

@matriv
Copy link
Contributor Author

matriv commented Sep 21, 2018

@nik9000 Check the new implementation: 7e09985 (@costin 's idea)

@jdconrad
Copy link
Contributor

There are no guarantees the circuit breaker presented here won't cause an SOF. Each stack frame can have a varying size based on the input expression. The stack frame size can also be controlled via JVM parameter so there's no guarantee of being able to calculate based on the limited number of recursions.

@matriv
Copy link
Contributor Author

matriv commented Sep 21, 2018

There are no guarantees the circuit breaker presented here won't cause an SOF. Each stack frame can have a varying size based on the input expression. The stack frame size can also be controlled via JVM parameter so there's no guarantee of being able to calculate based on the limited number of recursions.

I agree with that but I after searching and reading a bit about cases of catching the StackOverflowError, I'm not sure we can have guarantees that JVM can continue to function without problems:

https://stackoverflow.com/questions/28551767/is-it-safe-to-catch-stackoverflowerror-in-java
https://softwareengineering.stackexchange.com/questions/209099/is-it-ever-okay-to-catch-stackoverflowerror-in-java
https://stackoverflow.com/questions/22128485/why-is-it-possible-to-recover-from-a-stackoverflowerror
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.5.2
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.3

According to the latter: This specification cannot predict where internal errors or resource limitations may be encountered and does not mandate precisely when they can be reported. Thus, any of the VirtualMachineError subclasses defined below may be thrown at any time during the operation of the Java Virtual Machine:

@matriv
Copy link
Contributor Author

matriv commented Sep 22, 2018

@elasticmachine retest this please.

@costin
Copy link
Member

costin commented Sep 24, 2018

LGTM

@matriv
Copy link
Contributor Author

matriv commented Sep 24, 2018

@nik9000 @jdconrad @rjernst So, what do you think about the 2 solutions (catching StackOverflowError vs the CircuitBreakerListener) ?

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matriv matriv merged commit 5840be6 into elastic:master Sep 25, 2018
@matriv matriv changed the title SQL: Handle StackOverflowError when parsing large statements SQL: Prevent StackOverflowError when parsing large statements Sep 25, 2018
@matriv matriv deleted the mt/fix-32942 branch September 25, 2018 17:23
matriv pushed a commit that referenced this pull request Sep 25, 2018
Implement circuit breaker logic in the parser which catches expressions
that can blow up the tree and result in StackOverflowError being thrown.

Co-authored-by: Costin Leau <[email protected]>
@matriv
Copy link
Contributor Author

matriv commented Sep 25, 2018

Backported to 6.x with 6a4e841

@rjernst
Copy link
Member

rjernst commented Sep 25, 2018

@matriv Sorry I was on vacation and could not respond. I think catching StackOverflowError is the correct thing to do. We can prove based on what a stack overflow means that the state of the system has not been corrupted. We know this because we do not hang onto any objects that were being created when the SOE occurred (which is just the AstBuilder), and there are no statics that were potentially being modified.

Additionally, the commit message for this PR is deceptive because an SOE can still occur, it is just less likely now.

@costin
Copy link
Member

costin commented Sep 26, 2018

We don't have any guarantees regarding the method execution itself or it's side-effects.
Take lazy loading: it's possible that class-loading occurs during the tree creation and some of the classes have have static initializers for constants and such; we don't know whether this have been partially initialized or not.
Or JIT-ing which might occur especially in a highly recursive scenario - is the jitted code affected? does its virtual stack still hold ? It is being unwinded so there's some safety but then again, there's also an Error being thrown.
If I recall correctly, in HotSpot methods share the stack frames with C/C++ native code and the VM itself (it's also JNI calls can throw SOE) so I would expect that invariants simply no longer hold and shutting things down is the only option.

@jdconrad
Copy link
Contributor

@costin Is there anything that will change your mind that catching the SOE is the better approach to this issue?

@rjernst
Copy link
Member

rjernst commented Sep 26, 2018

Take lazy loading: it's possible that class-loading occurs during the tree creation

This is true, but stack overflow occurs from deep recursion. Classloading would have happened at the top of the recursion tree, long before we are deep enough for stack overflow.

Or JIT-ing which might occur especially in a highly recursive scenario

JIT happens in a separate system thread. It does not stop the world in order to compile a method to native code.

@jpountz
Copy link
Contributor

jpountz commented Sep 26, 2018

Even though catching stack overflow errors is sometimes ok, the challenging part is to keep this code correct over time. Eg. unrelated refactorings might mistakenly fold other statements under the try statement. It makes me like the circuit breaker idea a bit better: if the threshold is reasonable enough then chances that you hit a stack overflow without hitting the circuit breaker first are very thin. @romseygeek's idea is probably worth exploring as well.

@costin
Copy link
Member

costin commented Sep 27, 2018

@jdconrad
Catching the SOE was the initial approach I used and what I found in most ANTLR examples . @jpountz brought this up and I couldn't come up with strong guarantees that the SOE has no side-effects especially in a long-running JVM like ES.
I'm happy to be proven otherwise.

@rjernst

Classloading would have happened at the top of the recursion tree, long before we are deep enough for stack overflow.

I'm not sure that applies for lazy inner classes (init-on-demand idiom) that can appear at the tail of a deep recursion; e.g WHERE A OR B OR C .... OR (1000 AND ABS(POWER(1-2, 1) > 10) - the OR can fill the stack while the nested expression at the end, that triggers the error, can cause classes that haven't been loaded before.
This can be alleviated however it requires the code to be analyzed and kept in check in the future which is far from ideal.

JIT happens in a separate system thread. It does not stop the world in order to compile a method to native code.

Right however it's not the compilation of the JIT but the effect of the SOE on the native (JIT-ed) stack - I'm not aware of any guarantees on that front.

Regarding the breaker - the default stack size for ES is 1M while the breaker looks for a parsing depth/width of 100 calls, we can limit this further if need be.
We could make this dynamic assuming we have the hooks to do so - basically count the current stack size and bail out if we're too close. @danielmitterdorfer pointed out to the stack walker API; another alternative would be to check how big the stacktrace (`Thread.currentThread().getStackTrace()) is and only allow it to grow by X every Y calls but then again this has its own cost.

Thoughts?

@rjernst
Copy link
Member

rjernst commented Oct 4, 2018

I'm not sure that applies for lazy inner classes (init-on-demand idiom) that can appear at the tail of a deep recursion; e.g WHERE A OR B OR C .... OR (1000 AND ABS(POWER(1-2, 1) > 10) - the OR can fill the stack while the nested expression at the end, that triggers the error, can cause classes that haven't been loaded before.

I can see how this could be a problem, given the current design of the sql parser. In painless, all classes are loaded up front, through the whitelist. I imagine something similar could be done in sql.

Right however it's not the compilation of the JIT but the effect of the SOE on the native (JIT-ed) stack - I'm not aware of any guarantees on that front.

I'm not sure what you mean. The JIT'd code's stack would be structured the exact same as the original code, otherwise the jvm could never safely switch to the JIT'd method's implementation in the middle of recursive calls.

Regarding the breaker ...

The ideas here all sound plausible (although I would avoid checking the stack trace as I think the generation is dynamic, thus incurring cost to look at it). But I don't think anything can guarantee there is no stack overflow (without essentially creating our own parallel stack overflow tracking).

@jpountz re: catching SOE

the challenging part is to keep this code correct over time

I think this is always the challenge with any pieces of code that have inter dependencies. I can see how the sql implementation is much more complicated than painless, but in the end, the idea seems to be to convert a text representation of the sql query into something that can be sent as an elasticsearch query. Given the query should be the only output, I think we can make catching SOE work. But I don't care that much, as long as the messaging of this change does not claim to "prevent" SOE, when it only makes it less likely.

kcm pushed a commit that referenced this pull request Oct 30, 2018
Implement circuit breaker logic in the parser which catches expressions
that can blow up the tree and result in StackOverflowError being thrown.

Co-authored-by: Costin Leau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants