Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade ANTLR to version 4.11.1 #12016

Merged
merged 9 commits into from
Dec 16, 2022
Merged

Upgrade ANTLR to version 4.11.1 #12016

merged 9 commits into from
Dec 16, 2022

Conversation

reta
Copy link
Member

@reta reta commented Dec 13, 2022

Signed-off-by: Andriy Redko [email protected]

Description

The Apache Lucene is using quite old version of ANTLR 4.5.1-1. By itself, it is not a showstopper, but more profound issue is that some ANTLR 3.x bits are used [1]. Since ANTLR 4.10.x (or even earlier), the compatibility layer with 3.x release line has been dropped in 4.x (see please [2]), which makes Apache Lucene impossile to be used with recent ANTLR 4.10.x+ releases [3]. The sample exception is below.

   >         java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 3 (expected 4).
   >             at [email protected]/org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:56)
   >             at [email protected]/org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:48)
   >             at [email protected]/org.apache.lucene.expressions.js.JavascriptLexer.<clinit>(JavascriptLexer.java:279)

[1] https://github.com/apache/lucene/blob/main/lucene/expressions/src/java/org/apache/lucene/expressions/js/JavascriptLexer.java#L189
[2] antlr/antlr4@c68e127
[3] opensearch-project/OpenSearch#4546

Closes #11788

@reta
Copy link
Member Author

reta commented Dec 13, 2022

@rmuir @uschindler what kind of (performance? jmh?) testing would help to discard / prove that moving to 4.11.x makes / does not make sense. You have definitely seen traps in the past, I would very appreciate pointers / guidance to explore the possibility of any regressions, thank you in advance.

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

the only way i know to prevent the traps is to do like painless and "enable picky mode" which fails test instead of doing slow things. and to have 100% test coverage of grammar!

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

@reta
Copy link
Member Author

reta commented Dec 13, 2022

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

@reta I remember doing this adds overhead, that's why it is a boolean there. so it really just needs to be something we do from tests. for example it could be a package-private setter or similar?

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

As far as inspecting coverage, I suspect it is pretty good. But there is instructions in https://github.com/apache/lucene/blob/main/help/tests.txt on how to generate reports.

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

here is coverage report using the current antlr. I guess i dont know why so much is missing here:

https://ci-builds.apache.org/job/Lucene/job/Lucene-Coverage-main/618/jacoco/org.apache.lucene.expressions.js/

But yeah, if we exercise the possibilities of grammar from tests, AND tests use picky mode, then build will fail if grammar is ambiguous. It is definitely a PITA compared to antlr 3 :(

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

cc: @jdconrad who might remember a lot more about this than me

@rmuir
Copy link
Member

rmuir commented Dec 13, 2022

In general, sorry if i discouraged before, it is really just a frustrating situation

If you get stuck, just leave the PR open. I will try to dig into this too, I have been through it before. I do agree we can't stay on old releases forever.

But the crazy shit they did between antlr 3.x and 4.x caused me pain:

  • made the mistake of not spelunking thru antlr guts to figure out how to prevent performance traps
  • had users report performance bugs (in all cases it was some grammar tweak to fix)
  • fix these performance bugs in subsequent releases
  • get frustrated with the leniency and figure out how to make the shit picky again.

So I don't wish that on anyone. Currently the expressions is great because it is simple and performs well: Users can represent needs in a simple javascript-like fashion and trust that it has same performance as writing custom java code.

@reta
Copy link
Member Author

reta commented Dec 13, 2022

Thanks for encouraging @rmuir ! I will be working on the matter this week and share my findings, thank you!

@jdconrad
Copy link
Contributor

jdconrad commented Dec 13, 2022

I agree with @rmuir that having an ambiguity check for tests similar to Painless would be great for expressions. I'm a bit surprised this change didn't require much additionally to the regeneration of the lexer/grammar. One other thing I'm curious about is if this suffers from the same issue that Painless did with multiple nested conditionals where we had to separate them into their own rule in the grammar so they didn't do needless backtracking. So that may be something else worth adding a test for. Thanks for looking into upgrading this @reta!

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

Simple package-private static method to turn on the pickiness should do it.

I don't want to see 100 new constructors/abstractions added with booleans, just because antlr made a bad decision. Our APIs should not have to suffer for it.

Don't care what java developers or static analysis tools or whatever else wants to say about it: minimal abstractions here.

@uschindler
Copy link
Contributor

uschindler commented Dec 14, 2022

Hi, I agree with all Robert say. I would also like to make another suggestion (that could also be applied when this has been fixed). To me it looks like a bad decission of antlr, that the tool compiles code and creates some Java/Class files but at runtime it also needs some dependency JAR in exactly the correct version. This is a problem for a library like lucene that gets included into other projects. The same also applies for ASM, although this is no longer a big problem anymore, because ASM's APIs are now very stable and it does not matter if a project uses ASM 8 or ASM 9 if minimum requirements are guaranteed (so the user has more flexibility).

In contrast javacc/jflex do not have that problem, because javacc/jflex generate all of the code and you don't need any runtime library to execute and access the AST of bytecode or syntax.

In Lucene both (ANTLR and ASM) are dependencies of one artifact: lucene-expressions, so it is only a real problem there. My suggestion would be: Let's shade the ANTLR (and possibly also ASM - until the JDK-included bytecode generator is out of incubation/preview) into the JAR file. With Apache Ant build this was hard to do, but with Grade it is quite easy. Just add another configuration with those two dependencies and use the "Gradle Shadow Plugin" (https://imperceptiblethoughts.com/shadow/) to transform their package names into the Lucene namespace.

I know some people don't like this, but I had exactly the same problems with ASM in the forbiddenapis plugin and shading ASM in there was the only way to work around different, old, incompatible versions of ASM inside Maven or Gradle's classpath, while forbiddenapis always needeing the newest one.

If you like I could make a PR to demonstrate the shading for Lucene's Gradle build in Expressions. @dweiss : What do you think? Of course we should not use shading anywhere else, but ANTLR and ASM are the candidates that always bring problems. Their size is also small so the overhead is small, but you have a consistent package that is unlikely to break when other projects use different library versions.

@dweiss
Copy link
Contributor

dweiss commented Dec 14, 2022

With Apache Ant this was hard to do but with Grade it is quite easy. Just add another configuration with those two dependencies and use the "Gradle Shadow Plugin" (https://imperceptiblethoughts.com/shadow/) to transform their package names into the Lucene namespace.

Technically - easy to do. Question is whether we want to do it (I don't see any problems here).

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

personally i am against the shading. I think it is a huge antipattern, it hides third-party artifacts completely. think about someone trying to do security or license audit and they have "secret dependencies" hidden from view, as an example.

I don't understand the issue myself. if you want to use different antlr version in two places just use two classloaders. thats what we did in elastic/opensearch.

but like i said, i'm not opposed to upgrading IF concerns about the new version are taken care of.

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

I also dont understand the issue where ppl think they can modify arbitrary versions of lucene dependencies.

Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on antlr == x.y.z rather than just depend on antlr.

@dweiss
Copy link
Contributor

dweiss commented Dec 14, 2022

Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on antlr == x.y.z rather than just depend on antlr.

You can - I think what Uwe is describing is a problem for downstream projects where Lucene has antlr x.y.z and some other dependency has antlr a.b.c - then namespaces clash and the conflict is not easily resolved. Classloader separation is possible, of course, but it's hardly an easy alternative. :)

I personally don't mind shading artifacts but I do agree they are a pain... even tracking down which version a project is actually using is a problem then (because shaded artifacts don't manifest their versions as clearly as a maven dependency). Corporate environments will hate them for legal reasons (for reasons Rob mentioned).

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

Yes, java showed what a disaster it is around supply chains with the log4j vulnerability. Shading should not even be considered as an option.

@uschindler
Copy link
Contributor

uschindler commented Dec 14, 2022

Can we specify our dependencies in a different way (e.g. exact version) in our maven stuff so this won't happen? e.g. you can do this in python, and specify that you depend on antlr == x.y.z rather than just depend on antlr.

You can - I think what Uwe is describing is a problem for downstream projects where Lucene has antlr x.y.z and some other dependency has antlr a.b.c - then namespaces clash and the conflict is not easily resolved. Classloader separation is possible, of course, but it's hardly an easy alternative. :)

I personally don't mind shading artifacts but I do agree they are a pain... even tracking down which version a project is actually using is a problem then (because shaded artifacts don't manifest their versions as clearly as a maven dependency). Corporate environments will hate them for legal reasons (for reasons Rob mentioned).

Hi,
as said before this was just a suggestion from my experience with forbiddenapis with some artifacts like "ASM" and "ANTLR". They are always a pain. From the secruity perspective there are problems, but you can also see it like "code copied" - we can of course also do this, but that's a lot more hassle.

What you can always do: Offer a shaded version without those 2 dependencies as a separate artifact. Often seen on maven as "uber" or "shaded" behind version number. If you want to use shaded version, you know consequences.

Setting exact version in Maven POMs is not possible, unfortunately. Maven has some tricks (it won't silently upgrade across major versions, but bugfix releases are automatically applied). I don't know exacty how this is handled internally by Maven resolver.

My idea would be: publish "lucene-expressions-shaded-x.y.z" in addition to "lucene-expressions-x.y.z"
That is also what the gradle-plugin is doing, it adds another coordinate.

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

I feel that if we publish a shaded jar we become guilty of "contributing to the delinquency" of terrible supply chain management, by hiding third party dependencies and their versions from view. It causes problems for checkers and even humans if there were ever some issue with one of the dependencies (example log4j). They might miss it entirely and get hacked, as an example. Then they will be mad at us for shading in the first place, even though its arguably their fault because they used a shaded jar.

Let's just not hand them the gun.

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

another way to say it: shading is a terrible, TERRIBLE idea and you only hear about it in java, because java is the only language with developers that are bad enough to consider it.

Please, let's not talk about it ever again.

@reta
Copy link
Member Author

reta commented Dec 14, 2022

Please, let's not talk about it ever again.

Hard to disagree with you @rmuir, I think shading is the last resort which might be taken if no other options are possible, I could count only a few legitimate reasons when it makes sense (speaking in general). I think we should not take this path in Lucene, at least with ANTLR4.

@jdconrad
Copy link
Contributor

One other possibility would be to hand-roll the expression lexer/parser. This would get rid of the need for any additional dependencies and generated code. From what I can tell the API has been stable for a long time, so I don't think there would be a problem with maintenance. This would have the added benefit of likely improved performance as ANTLR isn't the fastest, and potentially better error messages.

@uschindler
Copy link
Contributor

One other possibility would be to hand-roll the expression lexer/parser. This would get rid of the need for any additional dependencies and generated code. From what I can tell the API has been stable for a long time, so I don't think there would be a problem with maintenance. This would have the added benefit of likely improved performance as ANTLR isn't the fastest, and potentially better error messages.

javacc?

@jdconrad
Copy link
Contributor

javacc?

I was under the impression this was EOL? If it's still well supported I'm not familiar enough with the code generation to know if this would avoid the pitfalls ANTLR has.

@reta
Copy link
Member Author

reta commented Dec 14, 2022

@rmuir first pass over pickiness, all tests are now run w/o diagnostics listener (picky / not picky mode), the rough observations so far are encouraging, there are no significant changes in the test duration between these two modes (some random timings below):

image

I am moving further to compare the numbers against main and to conduct more precise measurements (instead of rough test timings).

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

we don't need to parameterize the pickiness IMO, we can just turn it on in these tests. Thanks for getting this hooked up. I will look at your branch and try to play with it.

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

I pushed a proposal to your branch (only fixing one of the tests in .js to use it). If you are really against it, just revert the commit. But i think it keeps tests simpler.

@reta
Copy link
Member Author

reta commented Dec 14, 2022

I pushed a proposal to your branch (only fixing one of the tests in .js to use it). If you are really against it, just revert the commit. But i think it keeps tests simpler.

👍 No objections, thanks a lot for helping out @rmuir , it keeps tests simpler indeed

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

sorry i didnt do the other tests, I gotta run for now. i can do them later tonight or tomorrow if you need but just wanted to prototype it out

lucene/expressions/src/java/module-info.java Show resolved Hide resolved
package org.apache.lucene.expressions.js;

/** Settings for expression compiler for javascript expressions. */
final class JavascriptCompilerSettings {
Copy link
Contributor

@uschindler uschindler Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this additional class only holding one boolean? I think a simple additional PKG private compile method taking the boolean picky would have same effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really, but it could be useful to add more settings later on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I hate mutable instances with getters and setters. 🤬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Uwe, let's move to a boolean (actually only need one package-private ctor with the boolean, the way the new base CompilerTestCase class works).

If we have more options to this thing later, we can deal with the issue at that time. It is all package-private anyway.

@reta
Copy link
Member Author

reta commented Dec 14, 2022

sorry i didnt do the other tests, I gotta run for now. i can do them later tonight or tomorrow if you need but just wanted to prototype it out

@rmuir tests have been migrated, thank you

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

@rmuir
Copy link
Member

rmuir commented Dec 14, 2022

I will check it out and inspect the coverage report from .js tests and see if there are any holes. If i find them I will push more tests. I am just really paranoid about some of the slow things antlr4 will do, having fought thru these issues with @jdconrad once before.

@uschindler
Copy link
Contributor

I will check it out and inspect the coverage report from .js tests and see if there are any holes. If i find them I will push more tests. I am just really paranoid about some of the slow things antlr4 will do, having fought thru these issues with @jdconrad once before.

I am a bit afraid because those blobs with DFAs growed quite a lot in the patch. So yes, let's review how it behaves.

@reta reta marked this pull request as ready for review December 14, 2022 23:05
showed as untested in coverage report, only need one codepath thru this
stuff
@rmuir
Copy link
Member

rmuir commented Dec 15, 2022

i wrestled with it, I think we are good testing-wise. Lots of impossible branches in the generated code so it looks bad at a glance, but I feel like the functionality is covered.

@uschindler
Copy link
Contributor

uschindler commented Dec 15, 2022

Cool, thanks for the "huge whitespace" (public·void·testEnormousExpressionSource) test!

@reta
Copy link
Member Author

reta commented Dec 15, 2022

@rmuir @uschindler thanks a lot for HUGE help here guys!

@rmuir
Copy link
Member

rmuir commented Dec 16, 2022

I forced regeneration with ./gradlew -p lucene/expressions regenerate --rerun-tasks just to ensure there were no source code changes and regeneration is idempotent / reproducible.

@rmuir rmuir merged commit 945d7fe into apache:main Dec 16, 2022
@rmuir rmuir added this to the 9.5.0 milestone Dec 16, 2022
asfgit pushed a commit that referenced this pull request Dec 16, 2022
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md

"Q: What are the main design decisions in ANTLR4?
Ease-of-use over performance. I will worry about performance later."

It allows us to move forward with newer antlr but hopefully prevent the associated headaches.

Signed-off-by: Andriy Redko <[email protected]>
Co-authored-by: Robert Muir <[email protected]>
@reta
Copy link
Member Author

reta commented Dec 16, 2022

@rmuir would it be possible to backport it to 9.5? thank you! nwd, I saw it in lucene_9x already, thanks again

@uschindler
Copy link
Contributor

See 6700b7e
We just cherrypick, no need for PRs (they are only required if backports are complex).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade ANTLR to version 4.11.1
5 participants