SQL: Prevent StackOverflowError when parsing large statements #33902

matriv · 2018-09-20T15:44:01Z

Catch StackOverflowError exception and return a descriptive message
to the client. This prevents large statement from killing the cluster.

Fixes: #32942

Catch StackOverflowError exception and return a descriptive message to the client. This prevents large statement from killing the cluster. Fixes: elastic#32942

elasticmachine · 2018-09-20T15:44:05Z

Pinging @elastic/es-search-aggs

nik9000 · 2018-09-20T15:55:33Z

Catch StackOverflowError

I'm not sure that this is a safe thing to do.

matriv · 2018-09-20T16:00:08Z

@nik9000 With this code we return an error message to the client, and the ES node is not affected.
I've searched for any other way of handling it with ANTLR4 but haven't found such. Any suggestions?

astefan · 2018-09-20T15:54:13Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/parser/SqlParser.java

+        try {
+            return visitor.apply(new AstBuilder(paramTokens), tree);
+        } catch (StackOverflowError e) {
+            throw new ParsingException("{} is too large to parse (causes stack overflow)", name);


Would a message of the form "{} cannot be parsed" be more in line with our error messages in general? Meaning, the message says right away that something is wrong, at the start of the message. Also, the failure happens because the message is too large (in length) or because it has too many tokens?

astefan · 2018-09-20T16:04:19Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/parser/SqlParser.java

    }

-    private <T> T invokeParser(String sql, List<SqlTypedParamValue> params, Function<SqlBaseParser, ParserRuleContext> parseFunction,
+    private <T> T invokeParser(String name,


Can you name the name parameter in a more descriptive way? As I see it, it's only used for error reporting purposes, to adapt the message to what type of entity failed to be parsed. Maybe querySourceType?

nik9000 · 2018-09-20T16:06:36Z

Catch StackOverflowError

I'm not sure that this is a safe thing to do.

the ES node is not affected.

That is the part I'm not sure about. My instinct is that stuff like stack overflows and out of memory put the application into a weird state. I'm happy to be convinced I'm wrong about it though.

I know we've had issues in the past with out of memory leaving ES in a half functional state. So we removed catching it. And we lumped stack overflow in with that initiative.

When this has come up in other places we've mostly worked around it by untwisting the recursion into loops. If this is deep into ANTLR's code we really can't do much about it though. So I'm not sure what to do.

romseygeek · 2018-09-20T16:40:33Z

When this has come up in other places we've mostly worked around it by untwisting the recursion into loops

I've run across this problem writing antlr grammars elsewhere, and the general way to solve it in my experience is to switch from recursive matching to multi-matching. So instead of:

expression: part | part OR expression | ...

you do

expression: orexpression | ....
orexpression: part OR part (OR part)*

costin · 2018-09-20T17:26:18Z

/cc @jdconrad @jpountz

Indeed, in ANTLR the pattern is to catch the Errors which is bad since all guarantees about the JVM are off. I am of the same opinion with Nik - it's better to just let the Error through since we don't know if the JVM is still usable or not; it's annoying to the user but much better than say data corruption or who knows what else.

@romseygeek Thanks for the suggestions - it's worth a try though that makes the grammar even more complicated and it's not always obvious when a pattern is recursive or not.
If I recall correctly ANTLR generates recursive-descent parsers (not sure if there's a hard-limit to that before backtracking) - if that's the case the error can still occur though the query has to be significantly bigger.
Further more due to the nature of SQL it might not even be possible (things like WITH or subselects are recursive by nature though in practice, I reckon they are no more than several dozens deep and that in extreme cases).

I wonder if listeners (or something similar) can be used to stop the parsing before it blows up.

jdconrad · 2018-09-20T20:37:05Z

This problem cannot be solved using ANTLR as it requires recursion for anything more than an extremely simple grammar. There is no way around this since expressions can be of an indeterminate length. I view the problem as what's worse - letting a node die because of a SQL query or at least attempting to recover and catching a SOE. The trade off of catching an SOE seems better here.

rjernst · 2018-09-20T20:47:04Z

Also note that stack overflow in this cannot have any negative side effects. Normally the side effects to be aware of are partial initialization of objects. Here we are constructing and using an AstBuilder. This object is the only one that could be partially initialized, but in the catch case we do not save it.

matriv · 2018-09-21T10:12:43Z

Please check the approach here: ff67d02

matriv · 2018-09-21T10:15:25Z

@elasticmachine retest this please

matriv · 2018-09-21T11:41:27Z

My hesitation with this CircuitBreaker approach is that we cannot reset the counter for each part of the tree. This means that if we limit the elements to 100 for example, then in the query: SELECT a=b OR a=b OR... FROM t WHERE a=b AND ..., noElements in the SELECT clause + noElements in the WHERE clause must be <= 100.

It will also unnecessarily catch expressions like: SELECT a1=b1 OR a2=b2, a3=b2 AND a4=b4, ...

nik9000 · 2018-09-21T12:05:20Z

If @rjernst is ok with catching StackOverflowError here then I am as well, so long as we leave a big comment making it clear about why it is ok here. It'd be nice to link to some documentation about how StackOverflowError works and what guarantees it gives us.

matriv · 2018-09-21T12:55:55Z

@nik9000 Check the new implementation: 7e09985 (@costin 's idea)

jdconrad · 2018-09-21T16:19:49Z

There are no guarantees the circuit breaker presented here won't cause an SOF. Each stack frame can have a varying size based on the input expression. The stack frame size can also be controlled via JVM parameter so there's no guarantee of being able to calculate based on the limited number of recursions.

matriv · 2018-09-21T17:33:12Z

There are no guarantees the circuit breaker presented here won't cause an SOF. Each stack frame can have a varying size based on the input expression. The stack frame size can also be controlled via JVM parameter so there's no guarantee of being able to calculate based on the limited number of recursions.

I agree with that but I after searching and reading a bit about cases of catching the StackOverflowError, I'm not sure we can have guarantees that JVM can continue to function without problems:

https://stackoverflow.com/questions/28551767/is-it-safe-to-catch-stackoverflowerror-in-java
https://softwareengineering.stackexchange.com/questions/209099/is-it-ever-okay-to-catch-stackoverflowerror-in-java
https://stackoverflow.com/questions/22128485/why-is-it-possible-to-recover-from-a-stackoverflowerror
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.5.2
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.3

According to the latter: This specification cannot predict where internal errors or resource limitations may be encountered and does not mandate precisely when they can be reported. Thus, any of the VirtualMachineError subclasses defined below may be thrown at any time during the operation of the Java Virtual Machine:

matriv · 2018-09-22T18:15:27Z

@elasticmachine retest this please.

costin · 2018-09-24T09:40:19Z

LGTM

matriv · 2018-09-24T09:43:05Z

@nik9000 @jdconrad @rjernst So, what do you think about the 2 solutions (catching StackOverflowError vs the CircuitBreakerListener) ?

astefan

LGTM

costin

LGTM

Implement circuit breaker logic in the parser which catches expressions that can blow up the tree and result in StackOverflowError being thrown. Co-authored-by: Costin Leau <[email protected]>

matriv · 2018-09-25T17:33:46Z

Backported to 6.x with 6a4e841

rjernst · 2018-09-25T22:34:32Z

@matriv Sorry I was on vacation and could not respond. I think catching StackOverflowError is the correct thing to do. We can prove based on what a stack overflow means that the state of the system has not been corrupted. We know this because we do not hang onto any objects that were being created when the SOE occurred (which is just the AstBuilder), and there are no statics that were potentially being modified.

Additionally, the commit message for this PR is deceptive because an SOE can still occur, it is just less likely now.

costin · 2018-09-26T09:35:45Z

We don't have any guarantees regarding the method execution itself or it's side-effects.
Take lazy loading: it's possible that class-loading occurs during the tree creation and some of the classes have have static initializers for constants and such; we don't know whether this have been partially initialized or not.
Or JIT-ing which might occur especially in a highly recursive scenario - is the jitted code affected? does its virtual stack still hold ? It is being unwinded so there's some safety but then again, there's also an Error being thrown.
If I recall correctly, in HotSpot methods share the stack frames with C/C++ native code and the VM itself (it's also JNI calls can throw SOE) so I would expect that invariants simply no longer hold and shutting things down is the only option.

jdconrad · 2018-09-26T15:22:32Z

@costin Is there anything that will change your mind that catching the SOE is the better approach to this issue?

rjernst · 2018-09-26T15:30:13Z

Take lazy loading: it's possible that class-loading occurs during the tree creation

This is true, but stack overflow occurs from deep recursion. Classloading would have happened at the top of the recursion tree, long before we are deep enough for stack overflow.

Or JIT-ing which might occur especially in a highly recursive scenario

JIT happens in a separate system thread. It does not stop the world in order to compile a method to native code.

jpountz · 2018-09-26T16:33:45Z

Even though catching stack overflow errors is sometimes ok, the challenging part is to keep this code correct over time. Eg. unrelated refactorings might mistakenly fold other statements under the try statement. It makes me like the circuit breaker idea a bit better: if the threshold is reasonable enough then chances that you hit a stack overflow without hitting the circuit breaker first are very thin. @romseygeek's idea is probably worth exploring as well.

costin · 2018-09-27T12:24:31Z

@jdconrad
Catching the SOE was the initial approach I used and what I found in most ANTLR examples . @jpountz brought this up and I couldn't come up with strong guarantees that the SOE has no side-effects especially in a long-running JVM like ES.
I'm happy to be proven otherwise.

@rjernst

Classloading would have happened at the top of the recursion tree, long before we are deep enough for stack overflow.

I'm not sure that applies for lazy inner classes (init-on-demand idiom) that can appear at the tail of a deep recursion; e.g WHERE A OR B OR C .... OR (1000 AND ABS(POWER(1-2, 1) > 10) - the OR can fill the stack while the nested expression at the end, that triggers the error, can cause classes that haven't been loaded before.
This can be alleviated however it requires the code to be analyzed and kept in check in the future which is far from ideal.

JIT happens in a separate system thread. It does not stop the world in order to compile a method to native code.

Right however it's not the compilation of the JIT but the effect of the SOE on the native (JIT-ed) stack - I'm not aware of any guarantees on that front.

Regarding the breaker - the default stack size for ES is 1M while the breaker looks for a parsing depth/width of 100 calls, we can limit this further if need be.
We could make this dynamic assuming we have the hooks to do so - basically count the current stack size and bail out if we're too close. @danielmitterdorfer pointed out to the stack walker API; another alternative would be to check how big the stacktrace (`Thread.currentThread().getStackTrace()) is and only allow it to grow by X every Y calls but then again this has its own cost.

Thoughts?

rjernst · 2018-10-04T00:16:37Z

I'm not sure that applies for lazy inner classes (init-on-demand idiom) that can appear at the tail of a deep recursion; e.g WHERE A OR B OR C .... OR (1000 AND ABS(POWER(1-2, 1) > 10) - the OR can fill the stack while the nested expression at the end, that triggers the error, can cause classes that haven't been loaded before.

I can see how this could be a problem, given the current design of the sql parser. In painless, all classes are loaded up front, through the whitelist. I imagine something similar could be done in sql.

Right however it's not the compilation of the JIT but the effect of the SOE on the native (JIT-ed) stack - I'm not aware of any guarantees on that front.

I'm not sure what you mean. The JIT'd code's stack would be structured the exact same as the original code, otherwise the jvm could never safely switch to the JIT'd method's implementation in the middle of recursive calls.

Regarding the breaker ...

The ideas here all sound plausible (although I would avoid checking the stack trace as I think the generation is dynamic, thus incurring cost to look at it). But I don't think anything can guarantee there is no stack overflow (without essentially creating our own parallel stack overflow tracking).

@jpountz re: catching SOE

the challenging part is to keep this code correct over time

I think this is always the challenge with any pieces of code that have inter dependencies. I can see how the sql implementation is much more complicated than painless, but in the end, the idea seems to be to convert a text representation of the sql query into something that can be sent as an elasticsearch query. Given the query should be the only output, I think we can make catching SOE work. But I don't care that much, as long as the messaging of this change does not claim to "prevent" SOE, when it only makes it less likely.

Implement circuit breaker logic in the parser which catches expressions that can blow up the tree and result in StackOverflowError being thrown. Co-authored-by: Costin Leau <[email protected]>

SQL: Handle StackOverflowError when parsing large statements

bc842da

Catch StackOverflowError exception and return a descriptive message to the client. This prevents large statement from killing the cluster. Fixes: elastic#32942

matriv added >bug v7.0.0 :Analytics/SQL SQL querying v6.5.0 labels Sep 20, 2018

matriv requested review from costin and astefan September 20, 2018 15:44

Fixed imports

782a71d

astefan reviewed Sep 20, 2018

View reviewed changes

matriv force-pushed the mt/fix-32942 branch from 5b46812 to d62deb1 Compare September 21, 2018 10:06

matriv added 2 commits September 21, 2018 12:11

Introduce CircuitBreaker in the Parser

ff67d02

Merge remote-tracking branch 'upstream/master' into mt/fix-32942

37fb6bf

matriv force-pushed the mt/fix-32942 branch from d62deb1 to 37fb6bf Compare September 21, 2018 10:12

matriv added 2 commits September 21, 2018 14:53

Full implementation of tree depth Circuit Breaker

7e09985

Merge remote-tracking branch 'upstream/master' into mt/fix-32942

7e26d1c

Added one more test for complex tree

2642fe7

matriv added 3 commits September 21, 2018 18:50

Added more comments

50f58c6

Merge remote-tracking branch 'upstream/master' into mt/fix-32942

89f90fc

fix tests

4e2418b

remove unused import

1bfb98e

Merge remote-tracking branch 'upstream/master' into mt/fix-32942

abe742e

astefan approved these changes Sep 25, 2018

View reviewed changes

costin approved these changes Sep 25, 2018

View reviewed changes

matriv merged commit 5840be6 into elastic:master Sep 25, 2018

matriv changed the title ~~SQL: Handle StackOverflowError when parsing large statements~~ SQL: Prevent StackOverflowError when parsing large statements Sep 25, 2018

matriv deleted the mt/fix-32942 branch September 25, 2018 17:23

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

astefan mentioned this pull request Apr 4, 2019

SQL: document the limitations around parsing #40836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL: Prevent StackOverflowError when parsing large statements #33902

SQL: Prevent StackOverflowError when parsing large statements #33902

matriv commented Sep 20, 2018

elasticmachine commented Sep 20, 2018

nik9000 commented Sep 20, 2018

matriv commented Sep 20, 2018

astefan Sep 20, 2018

astefan Sep 20, 2018

nik9000 commented Sep 20, 2018

romseygeek commented Sep 20, 2018

costin commented Sep 20, 2018

jdconrad commented Sep 20, 2018

rjernst commented Sep 20, 2018

matriv commented Sep 21, 2018

matriv commented Sep 21, 2018

matriv commented Sep 21, 2018 •

edited

Loading

nik9000 commented Sep 21, 2018

matriv commented Sep 21, 2018

jdconrad commented Sep 21, 2018

matriv commented Sep 21, 2018

matriv commented Sep 22, 2018

costin commented Sep 24, 2018

matriv commented Sep 24, 2018

astefan left a comment

costin left a comment

matriv commented Sep 25, 2018

rjernst commented Sep 25, 2018

costin commented Sep 26, 2018

jdconrad commented Sep 26, 2018

rjernst commented Sep 26, 2018

jpountz commented Sep 26, 2018

costin commented Sep 27, 2018

rjernst commented Oct 4, 2018

SQL: Prevent StackOverflowError when parsing large statements #33902

SQL: Prevent StackOverflowError when parsing large statements #33902

Conversation

matriv commented Sep 20, 2018

elasticmachine commented Sep 20, 2018

nik9000 commented Sep 20, 2018

matriv commented Sep 20, 2018

astefan Sep 20, 2018

Choose a reason for hiding this comment

astefan Sep 20, 2018

Choose a reason for hiding this comment

nik9000 commented Sep 20, 2018

romseygeek commented Sep 20, 2018

costin commented Sep 20, 2018

jdconrad commented Sep 20, 2018

rjernst commented Sep 20, 2018

matriv commented Sep 21, 2018

matriv commented Sep 21, 2018

matriv commented Sep 21, 2018 • edited Loading

nik9000 commented Sep 21, 2018

matriv commented Sep 21, 2018

jdconrad commented Sep 21, 2018

matriv commented Sep 21, 2018

matriv commented Sep 22, 2018

costin commented Sep 24, 2018

matriv commented Sep 24, 2018

astefan left a comment

Choose a reason for hiding this comment

costin left a comment

Choose a reason for hiding this comment

matriv commented Sep 25, 2018

rjernst commented Sep 25, 2018

costin commented Sep 26, 2018

jdconrad commented Sep 26, 2018

rjernst commented Sep 26, 2018

jpountz commented Sep 26, 2018

costin commented Sep 27, 2018

rjernst commented Oct 4, 2018

matriv commented Sep 21, 2018 •

edited

Loading