-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade ANTLR to version 4.11.1 #12016
Conversation
Signed-off-by: Andriy Redko <[email protected]>
@rmuir @uschindler what kind of (performance? jmh?) testing would help to discard / prove that moving to 4.11.x makes / does not make sense. You have definitely seen traps in the past, I would very appreciate pointers / guidance to explore the possibility of any regressions, thank you in advance. |
the only way i know to prevent the traps is to do like painless and "enable picky mode" which fails test instead of doing slow things. and to have 100% test coverage of grammar! |
Thanks a lot, @rmuir , I will take it from there |
@reta I remember doing this adds overhead, that's why it is a boolean there. so it really just needs to be something we do from tests. for example it could be a package-private setter or similar? |
As far as inspecting coverage, I suspect it is pretty good. But there is instructions in https://github.com/apache/lucene/blob/main/help/tests.txt on how to generate reports. |
here is coverage report using the current antlr. I guess i dont know why so much is missing here: But yeah, if we exercise the possibilities of grammar from tests, AND tests use picky mode, then build will fail if grammar is ambiguous. It is definitely a PITA compared to antlr 3 :( |
cc: @jdconrad who might remember a lot more about this than me |
In general, sorry if i discouraged before, it is really just a frustrating situation If you get stuck, just leave the PR open. I will try to dig into this too, I have been through it before. I do agree we can't stay on old releases forever. But the crazy shit they did between antlr 3.x and 4.x caused me pain:
So I don't wish that on anyone. Currently the expressions is great because it is simple and performs well: Users can represent needs in a simple javascript-like fashion and trust that it has same performance as writing custom java code. |
Thanks for encouraging @rmuir ! I will be working on the matter this week and share my findings, thank you! |
I agree with @rmuir that having an ambiguity check for tests similar to Painless would be great for expressions. I'm a bit surprised this change didn't require much additionally to the regeneration of the lexer/grammar. One other thing I'm curious about is if this suffers from the same issue that Painless did with multiple nested conditionals where we had to separate them into their own rule in the grammar so they didn't do needless backtracking. So that may be something else worth adding a test for. Thanks for looking into upgrading this @reta! |
Simple package-private I don't want to see 100 new constructors/abstractions added with booleans, just because antlr made a bad decision. Our APIs should not have to suffer for it. Don't care what java developers or static analysis tools or whatever else wants to say about it: minimal abstractions here. |
Hi, I agree with all Robert say. I would also like to make another suggestion (that could also be applied when this has been fixed). To me it looks like a bad decission of antlr, that the tool compiles code and creates some Java/Class files but at runtime it also needs some dependency JAR in exactly the correct version. This is a problem for a library like lucene that gets included into other projects. The same also applies for ASM, although this is no longer a big problem anymore, because ASM's APIs are now very stable and it does not matter if a project uses ASM 8 or ASM 9 if minimum requirements are guaranteed (so the user has more flexibility). In contrast javacc/jflex do not have that problem, because javacc/jflex generate all of the code and you don't need any runtime library to execute and access the AST of bytecode or syntax. In Lucene both (ANTLR and ASM) are dependencies of one artifact: lucene-expressions, so it is only a real problem there. My suggestion would be: Let's shade the ANTLR (and possibly also ASM - until the JDK-included bytecode generator is out of incubation/preview) into the JAR file. With Apache Ant build this was hard to do, but with Grade it is quite easy. Just add another configuration with those two dependencies and use the "Gradle Shadow Plugin" (https://imperceptiblethoughts.com/shadow/) to transform their package names into the Lucene namespace. I know some people don't like this, but I had exactly the same problems with ASM in the forbiddenapis plugin and shading ASM in there was the only way to work around different, old, incompatible versions of ASM inside Maven or Gradle's classpath, while forbiddenapis always needeing the newest one. If you like I could make a PR to demonstrate the shading for Lucene's Gradle build in Expressions. @dweiss : What do you think? Of course we should not use shading anywhere else, but ANTLR and ASM are the candidates that always bring problems. Their size is also small so the overhead is small, but you have a consistent package that is unlikely to break when other projects use different library versions. |
Technically - easy to do. Question is whether we want to do it (I don't see any problems here). |
personally i am against the shading. I think it is a huge antipattern, it hides third-party artifacts completely. think about someone trying to do security or license audit and they have "secret dependencies" hidden from view, as an example. I don't understand the issue myself. if you want to use different antlr version in two places just use two classloaders. thats what we did in elastic/opensearch. but like i said, i'm not opposed to upgrading IF concerns about the new version are taken care of. |
I also dont understand the issue where ppl think they can modify arbitrary versions of lucene dependencies. Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on |
You can - I think what Uwe is describing is a problem for downstream projects where Lucene has antlr x.y.z and some other dependency has antlr a.b.c - then namespaces clash and the conflict is not easily resolved. Classloader separation is possible, of course, but it's hardly an easy alternative. :) I personally don't mind shading artifacts but I do agree they are a pain... even tracking down which version a project is actually using is a problem then (because shaded artifacts don't manifest their versions as clearly as a maven dependency). Corporate environments will hate them for legal reasons (for reasons Rob mentioned). |
Yes, java showed what a disaster it is around supply chains with the log4j vulnerability. Shading should not even be considered as an option. |
Hi, What you can always do: Offer a shaded version without those 2 dependencies as a separate artifact. Often seen on maven as "uber" or "shaded" behind version number. If you want to use shaded version, you know consequences. Setting exact version in Maven POMs is not possible, unfortunately. Maven has some tricks (it won't silently upgrade across major versions, but bugfix releases are automatically applied). I don't know exacty how this is handled internally by Maven resolver. My idea would be: publish "lucene-expressions-shaded-x.y.z" in addition to "lucene-expressions-x.y.z" |
I feel that if we publish a shaded jar we become guilty of "contributing to the delinquency" of terrible supply chain management, by hiding third party dependencies and their versions from view. It causes problems for checkers and even humans if there were ever some issue with one of the dependencies (example log4j). They might miss it entirely and get hacked, as an example. Then they will be mad at us for shading in the first place, even though its arguably their fault because they used a shaded jar. Let's just not hand them the gun. |
another way to say it: shading is a terrible, TERRIBLE idea and you only hear about it in java, because java is the only language with developers that are bad enough to consider it. Please, let's not talk about it ever again. |
Hard to disagree with you @rmuir, I think shading is the last resort which might be taken if no other options are possible, I could count only a few legitimate reasons when it makes sense (speaking in general). I think we should not take this path in Lucene, at least with ANTLR4. |
One other possibility would be to hand-roll the expression lexer/parser. This would get rid of the need for any additional dependencies and generated code. From what I can tell the API has been stable for a long time, so I don't think there would be a problem with maintenance. This would have the added benefit of likely improved performance as ANTLR isn't the fastest, and potentially better error messages. |
javacc? |
I was under the impression this was EOL? If it's still well supported I'm not familiar enough with the code generation to know if this would avoid the pitfalls ANTLR has. |
Signed-off-by: Andriy Redko <[email protected]>
@rmuir first pass over pickiness, all tests are now run w/o diagnostics listener (picky / not picky mode), the rough observations so far are encouraging, there are no significant changes in the test duration between these two modes (some random timings below): I am moving further to compare the numbers against |
we don't need to parameterize the pickiness IMO, we can just turn it on in these tests. Thanks for getting this hooked up. I will look at your branch and try to play with it. |
I pushed a proposal to your branch (only fixing one of the tests in .js to use it). If you are really against it, just revert the commit. But i think it keeps tests simpler. |
👍 No objections, thanks a lot for helping out @rmuir , it keeps tests simpler indeed |
sorry i didnt do the other tests, I gotta run for now. i can do them later tonight or tomorrow if you need but just wanted to prototype it out |
package org.apache.lucene.expressions.js; | ||
|
||
/** Settings for expression compiler for javascript expressions. */ | ||
final class JavascriptCompilerSettings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this additional class only holding one boolean? I think a simple additional PKG private compile method taking the boolean picky would have same effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really, but it could be useful to add more settings later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I hate mutable instances with getters and setters. 🤬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Uwe, let's move to a boolean (actually only need one package-private ctor with the boolean, the way the new base CompilerTestCase class works).
If we have more options to this thing later, we can deal with the issue at that time. It is all package-private anyway.
Signed-off-by: Andriy Redko <[email protected]>
@rmuir tests have been migrated, thank you |
…n favor of boolean Signed-off-by: Andriy Redko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
I will check it out and inspect the coverage report from .js tests and see if there are any holes. If i find them I will push more tests. I am just really paranoid about some of the slow things antlr4 will do, having fought thru these issues with @jdconrad once before. |
I am a bit afraid because those blobs with DFAs growed quite a lot in the patch. So yes, let's review how it behaves. |
showed as untested in coverage report, only need one codepath thru this stuff
i wrestled with it, I think we are good testing-wise. Lots of impossible branches in the generated code so it looks bad at a glance, but I feel like the functionality is covered. |
Cool, thanks for the "huge whitespace" (public·void·testEnormousExpressionSource) test! |
@rmuir @uschindler thanks a lot for HUGE help here guys! |
I forced regeneration with |
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md "Q: What are the main design decisions in ANTLR4? Ease-of-use over performance. I will worry about performance later." It allows us to move forward with newer antlr but hopefully prevent the associated headaches. Signed-off-by: Andriy Redko <[email protected]> Co-authored-by: Robert Muir <[email protected]>
@rmuir |
See 6700b7e |
Signed-off-by: Andriy Redko [email protected]
Description
The Apache Lucene is using quite old version of ANTLR 4.5.1-1. By itself, it is not a showstopper, but more profound issue is that some ANTLR 3.x bits are used [1]. Since ANTLR 4.10.x (or even earlier), the compatibility layer with
3.x
release line has been dropped in4.x
(see please [2]), which makes Apache Lucene impossile to be used with recent ANTLR 4.10.x+ releases [3]. The sample exception is below.[1] https://github.com/apache/lucene/blob/main/lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java#L189
[2] antlr/antlr4@c68e127
[3] opensearch-project/OpenSearch#4546
Closes #11788