ATN serialization improvements (Java only for demo) #3505

KvanTTT · 2022-01-21T18:28:40Z

Get rid of ATN serialization restrictions (ATN states size can be > 65535, up to 2^31-1);
Use base64 encoding for ATN serialization:
- It decreases the size of generated parser up to 2 times
- It looks natively, doesn't require encodeIntAsCharEscape call, doesn't need encoding optimization hack (since all values are always in single-byte form)
Serialization: remove excess allocations, get rid of excess ATN serialization
Zero-allocation deserialization (it was 2 excess allocations before)
Move code related to testing from tool to tests project

Refactor ATN serializer and deserializer, use ATNDataWriter, ATNDataReader Remove excess data cloning in deserializer fixes antlr#1863, fixes antlr#2732, fixes antlr#3338

… to Integer.MAX_VALUE) Simplify serializer and deserializer API

…sts)

…nerated code)

parrt · 2022-01-22T20:44:46Z

runtime/Java/src/org/antlr/v4/runtime/atn/ATNDataReader.java

+public abstract class ATNDataReader {
+	protected abstract byte readByte();
+
+	public int read() {


Is there a standard base64 decoder somewhere?

Oh i see base64 below now. Hmm...adding general comment outside.

It's the binary reader that uses abstract readByte calls. It reads mostly integers from compact format (described below). readByte is implemented in ATNDataReaderBase64 (that reads from base64 string) and in ATNDataReaderByteBuffer.

parrt · 2022-01-22T20:52:11Z

Sounds amazing. But, i'm trying to remember how this works. Let's see: we go from list of int (32bits) to strings (list of short16) because java only statically initializes strings (whereas arrays are initialized one element at a time in code which explodes code size). Ok, can you explain how the base64 helps here?

Is it that we are now going to 32 bit atn data, represented in a few base64 char?

parrt · 2022-01-22T21:09:01Z

runtime/Java/src/org/antlr/v4/runtime/atn/ATNDataWriter.java

+import java.nio.ByteBuffer;
+import java.util.UUID;
+
+public class ATNDataWriter {


If we go base64 route, shouldn't we use existing encoder/decoders to avoid bugs? https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html

Sure we can. But I tried to get rid of excess big array allocations in the deserializer (there are 2 additional allocations if we use standard methods). It may affect application startup time.

well, if it's a trade between an alloc that gets collected vs homegrown base64, it seems safer to take the extra GC hit.

KvanTTT · 2022-01-22T21:47:59Z

Ok, can you explain how the base64 helps here?

Because it is smaller than plain char[] representation at least in source code. Also, it's a more standard format that does not require weird increments and decrements for storing optimization (no need to think about how binary data is encoded to string).

Consider 3 bytes: 128 128 128. In the plain char representation it's encoded as \u0080\u0080\u0080 that takes 18 chars in text file. With base64 it's encoded as gICAgICA that takes only 8 chars (2 (bytes per int) * 3 (bytes count) * 4/3 (encoding ratio)). It's less more than 2 times.

I've checked LargeLexer test and with base64 encoding, the generated parser takes 856 KB instead of 1331 KB with plain char encoding (but 846 KB instead of 716 KB in compiled format).

Anyway, the output format can be either plain chars or base64 chars since it does not depend on binary representation (it's being converted in SerializedATN). Different targets may use different output formats.

Is it that we are now going to 32 bit atn data, represented in a few base64 char?

Now binary data and string representation are split (ByteBuffer is used for binary data that is being transformed to base64 char[] later). Both output encoding algorithms can encode 32-bit ATN data.

parrt · 2022-01-22T23:12:59Z

Verrrrry interesting. Thanks for the explanation. I'll have to think hard on this but we should keep big picture in mind. What we have works and we're suggesting replacing it for a more general solution but with unknown issues/bugs to solve an uncommon case of > 64k states. Questions:

Do we care about java file size? Hmm..probably yes, but not that much. Could affect ability to edit/view java files in editor
Does base64 have a line break? I don't think so, which would mean hyper long line in editor.
Do we care about .class file size? probably yes, but not that much; it affects initial load time of jar and some string init time if strings get really big
Do we care about ATN list of state size? Yes but they are all now ints anyway so we save nothing at parse time

Currently we do handle 32-bit unicode char, but with two 16-bit \uXXXX values. From IntegerList:

/** Convert the list to a UTF-16 encoded char array. If all values are less
*  than the 0xFFFF 16-bit code point limit then this is just a char array
*  of 16-bit char as usual. For values in the supplementary range, encode
* them as two UTF-16 code units.
*/
public final char[] toCharArray() {...}

This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading.

I wonder if this simple solution is the least intrusive and risky.

ericvergnaud · 2022-01-23T10:32:38Z

Verrrrry interesting. Thanks for the explanation. I'll have to think hard on this but we should keep big picture in mind. What we have works and we're suggesting replacing it for a more general solution but with unknown issues/bugs to solve an uncommon case of > 64k states. Questions:

Do we care about java file size? Hmm..probably yes, but not that much. Could affect ability to edit/view java files in editor

Does base64 have a line break? I don't think so, which would mean hyper long line in editor.

Do we care about .class file size? probably yes, but not that much; it affects initial load time of jar and some string init time if strings get really big

Do we care about ATN list of state size? Yes but they are all now ints anyway so we save nothing at parse time

Currently we do handle 32-bit unicode char, but with two 16-bit \uXXXX values. From IntegerList:
/** Convert the list to a UTF-16 encoded char array. If all values are less
*  than the 0xFFFF 16-bit code point limit then this is just a char array
*  of 16-bit char as usual. For values in the supplementary range, encode
* them as two UTF-16 code units.
*/
public final char[] toCharArray() {...}
This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading.

I wonder if this simple solution is the least intrusive and risky.

My 2 cents:

I care about initialization time. With the growth of server less, this will become increasingly important. I believe that the existing implementation is optimal since it doesn't parse the string, rather it casts its content (an array of chars) to an array of ints (toInt() is a noop once optimized by the JIT). Switching to Base64 is likely to be slower. The only transcoding that happens is from utf8[] in the class file to char[] in RAM.
I care about memory usage. With the growth of mobile, keeping memory low is becoming necessary again, because on those devices adding RAM is not an option. As mentioned by @KvanTTT the class file is smaller in the current version than it would be using Base64, so this proposal doesn't help. Also, I suspect that in Java _serializedATN is never gc'd since it's final so maybe we could wrap it inside a dynamically loaded child class that provides it (and also does the -2 shift) (see https://stackoverflow.com/questions/2433261/when-and-how-are-classes-garbage-collected-in-java) ?

I would want to see metrics to help reach a decision: load time and memory footprint after forced GC.

parrt · 2022-01-23T17:56:53Z

Great points, @ericvergnaud. Yes, I can see fast startup/small class file size being important on mobile. Existing mechanism is smallest/fastest. If we adjust so max atn state can be > 16 bits then existing func will simply encode all ints as 2 16-bit chars. This provides new capability for a small group that needs huge atn w/o breaking anything. Most mobile code won't need this.

Re GC of atn string. If we remove final, we can then set to null when done, right? Should then be GC'd. I looked via javap and only difference in .class file of significance is a tiny static {} section to set field _serializedATN from a constant (3 bytecode instructions). What's easiest way to test if it's GC'd? jmap? jcmd? -XX:HeapDumpPath?

Actually, we don't care I think about _serializedATN. Shouldn't the String object just be a wrapper to the the constant in .class constant pool? I.e., just a few bytes of overhead.

I think therefore we should proceed to smallest tweak that will allow 32-bit ATN state numbers that preserves existing static string mechanism. Also let's try getting _serializedATN to GC by removing final. Sound good?

ericvergnaud · 2022-01-23T18:39:39Z

Sounds great indeed!
No need to test GC with your solution, we know it will work. I would also make the field private, which will signal to suicidal users that their attempt was successful ;-)
I wanted to paste a braveheart gif here but it seems GH won't allow it...

KvanTTT · 2022-01-23T21:40:24Z

Sorry, I'm a bit busy and I'll be able to answer the next week. Also, I have to check one thing.

KvanTTT · 2022-01-26T18:49:33Z

Does base64 have a line break? I don't think so, which would mean hyper long line in editor.

Yes, it has. Base64 string encoding does not differ from the current string encoding using plain or escaped chars (at least for Java).

This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading.

Unfortunately, it's unclear how the deserializer will find out how many bits are used for integer encoding. There are two ways of such problem resolving:

(Bad) Put info about bits to the beginning of a serialized string. It breaks backward compatibility (actually it does not matter a lot). But 2x increase in the size of the output string is too excessive.
(Better) Use dynamic chars count per int and detect size based on leading bits. It preserves backward compatibility at least for not very big values (up to 2^14-1=16383) and does not significantly increase the size of the output string.

In the previous closed PR I suggested the following encoding scheme:

encoding	count	type
00xx xxxx xxxx xxxx	1	int (14 bit)
01xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx	2	int (30 bit)
1000 0000 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx	3	int (32 bit)
1111 1111 1111 1111	1	-1 (0xFFFF)

I wonder if this simple solution is the least intrusive and risky.

We can quite easily change the format of internal serialization since parsers are not back-compatible (ANTLR reports ANTLR Runtime version %s used for parser compilation does not match the current runtime version for parser of different versions)

Yes, I can see fast startup/small class file size being important on mobile. Existing mechanism is smallest/fastest.

Actually, the smallest/faster (and clearest) mechanism is loading data from a resource file (for instance java.util.Properties). But it complicates compilation and it's not back-compatible.

This provides new capability for a small group that needs huge atn w/o breaking anything. Most mobile code won't need this.

Maybe most but I'm not sure about all. Current and future mobile devices may use ANTLR for applications related to natural languages processing or AI (?) that require a lot of tokens and ATN states.

Re GC of atn string. If we remove final, we can then set to null when done, right? Should then be GC'd. I looked via javap and only difference in .class file of significance is a tiny static {} section to set field _serializedATN from a constant (3 bytecode instructions). What's easiest way to test if it's GC'd? jmap? jcmd? -XX:HeapDumpPath?

I like the idea of removing info that is required only for deserialization. Unfortunately, there are some problems:

It breaks backward compatibility because of method public String getSerializedATN(). But it looks like not a big problem, we can just deprecate this method.
The main problem is Java string literals. Once created in memory, they are unlikely to be collected by GC during app lifetime because they are stored in string pool.

Consider the following code:

public class Main {
    public static void main(String[] args) {
        String string = "test";
        char[] chars = new char[] { 't', 'e', 's', 't' };
    }

    public static void test() {
        String s2 = "test";
        String s3 = "test2";
    }
}

And its bytecode:

  public static void main(java.lang.String[]);
    Code:
       0: ldc           #2                  // String test
       2: astore_1
       3: iconst_4
       4: newarray       char
       6: dup
       7: iconst_0
       8: bipush        116
      10: castore
      11: dup
      12: iconst_1
      13: bipush        101
      15: castore
      16: dup
      17: iconst_2
      18: bipush        115
      20: castore
      21: dup
      22: iconst_3
      23: bipush        116
      25: castore
      26: astore_2
      27: return

  public static void test();
    Code:
       0: ldc           #2                  // String test
       2: astore_0
       3: ldc           #3                  // String test2
       5: astore_1
       6: return

As we see, both "test" constants are being loaded from #2 item despite the fact they are in different methods. It means that string literals are stored in separated storage (interned). Also, see the question on StackOverflow: When will a string be garbage collected in java.

ericvergnaud · 2022-01-26T18:54:59Z

removing final will free the reference, but not the string constant, so it's probably not good enough. We'd have to either load if from a resource, or from a class which we unload.

KvanTTT · 2022-01-26T18:59:57Z

removing final will free the reference, but not the string constant, so it's probably not good enough. We'd have to either load if from a resource, or from a class which we unload.

Yes, also an array of chars also probably won't help because all data will be stored to a method (that is unlikely to be collected) instead of string pool. I vote for the first option, loading from the resource because dynamic unloading is too excessive for such a task.

KvanTTT · 2022-01-26T19:08:29Z

Also, using char[] instead of String is probably a bit better because it does not require an extra toCharArray call (BTW C# target uses char[] instead of String).

If we remove final, we can then set to null when done, right?

I suggest getting rid of the field at all and moving the initialization code to a static constructor (that calls submethods) because setting to null looks like an antipattern.

KvanTTT · 2022-01-26T19:51:42Z

Ok, I'll rollback base64 encoding since it's not a very optimal solution. Moreover, it looks like I've found a way how to decrease the size of source code for ATN data (it is not necessary to escape most part of the symbols) without affecting compilation files. But the format of encoding of big ATN is still unclear and I'm waiting for an answer.

parrt · 2022-01-27T01:11:13Z

Ok, a lot to take in here. Will need time to think. Any idea how many grammars generate more than 2^14 states? Just curious about your encoding mechanism. Might work.

I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc...

KvanTTT · 2022-01-27T11:05:35Z

Any idea how many grammars generate more than 2^14 states?

I don't know exactly but I've checked the encoding on our runtime tests and only one test with large lexer was failing. On the other hand, almost all tests are quite small, real grammars are bigger.

I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc...

Ok. Yes, it requires building workflow changes for all runtimes and a lot of effort. Also, I have an idea of how to decrease the size of serialized string/array of chars.

KvanTTT · 2022-01-27T21:42:25Z

I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc...

BTW, Swift already keeps ATN in a separate file but does it very ineffectively since it uses JSON instead of binary encoding. Probably it makes sense to use binary files for new targets or for targets that do not require compilation (JavaScript, Python). They just will read ATN from a nearby file.

parrt · 2022-01-29T19:10:53Z

Regarding having separate files: I definitely prefer not having to keep two files which must be kept in sync. I think some of the target developers simply used the existing serialization mechanisms as an expedient, but it's suboptimal from a parser user build point of view. I've just spent the last hour going through the serialization code for Java because it is a special case... the obvious thing for other targets is simply to store a static integer array in the generated code. I built a little thing to track the size of numbers in the generated ATNs... let me write something up it report it here.

KvanTTT · 2022-01-29T20:11:16Z

Integer array takes more bytes in source code compared to raw char arrays or string arrays. Especially considering that almost all chars may be in a raw format, not escaped (I'm experimenting with that). Why Java is a special case?

KvanTTT · 2022-01-29T20:21:49Z

BTW, Go and C++ use int arrays for ATN data. Maybe it's ok for compiled languages but not very optimal for JavaScript where the size of the source is more critical.

parrt · 2022-01-29T21:10:09Z

Why Java is a special case?

Only because int/char arrays are initialized with code which blows out size of init method easily and is slow. Strings are in the .class file constant pool in contrast. Other languages won't suffer from this.

not very optimal for JavaScript where the size of the source is more critical.

Yeah, size of generated atn is something to pay attention to for JS as it's loaded in src form. Not huge though. Java grammar ATN in number of integers:

Lexer len getSerialized = 9754
Parser len getSerialized = 16029

Still working on other numbers. standby: I need a sandwich haha

KvanTTT · 2022-01-29T21:22:31Z

Strings are in the .class file constant pool in contrast. Other languages won't suffer from this.

At least C# (.NET) works in a similar way. It also has the conception of string pool and string interning. Not sure about other languages.

Yeah, size of generated atn is something to pay attention to for JS as it's loaded in src form. Not huge though. Java grammar ATN in number of integers:

It's quite relatively. Also, it can be another much bigger grammar. If it's possible to decrease the size without affecting performance, why not do that.

BTW, I'm changing JSON -> Binary (string array) serialization for Swift and there are exiting numbers of size decreasing. I'll publish the result a bit later.

parrt · 2022-01-29T21:25:34Z

It's always fun to improve performance or reduce size, but in a project like this I would like to leave everything alone that isn't broken, at least for now.

At least C# (.NET) works in a similar way.

Are you saying that static short arrays (vs strings) are initialized using a[i] = v for each element of the array like they are in Java? surely they fixed that problem in C#.

parrt · 2022-01-29T21:26:44Z

I'm not worried about size as we are only talking about 32K without UTF8 compression for the base Java parser grammar. I'm totally willing to accept that size given that it has worked for over a decade.

parrt · 2022-01-29T21:48:17Z

Ok, if I'm doing this correctly, it looks to me like one UTF-8 byte (0..127) holds about 47% of all values in Java parser's serialization. A full 0..255 byte hold about 75% of all values. it looks like we are getting a pretty decent compression from UTF-8. Out of the ~16,000 integers, here are the first few values we need to encode with their counts:

value,count
1,2
2,4758
3,2081
4,152
5,567
6,20
7,395
8,25
9,171

Interestingly, the maximum value is not very large...like 15k. The big numbers on the end are the UUID encoding that Sam put in; btw, not sure we need this and could remove it. Not sure what function it serves or what error it prevents, given that we already encode the serialization version number.

count,value
1762,1
1961,1
15335,1
16764,1
22884,1
24715,1
30598,1
33075,1
42794,1
47597,1

So, in the end, I don't think we have a problem with the existing mechanism except for the original issue we are trying to solve: What happens when we get a really big grammar where the number of ATN states exceeds 65535?

We have code that handles this by manually encoding 32-bit numbers as two unicode short chars. Take a look at IntegerList.toCharArray(). Oh, ok, I just noticed that the code generation doesn't use that. The SerializedATN output model object directly encodes a string via:

serialized = new ArrayList<String>(data.size());
for (int c : data.toArray()) {
	String encoded = factory.getGenerator().getTarget().encodeIntAsCharEscape(c == -1 ? Character.MAX_VALUE : c);
	serialized.add(encoded);
}

This code generation bit would have to be updated to switch between "ints as 16 chars" and "ints as 2 x 16 chars" depending on the maximum value found in the serialized data. Further, we have to be very careful about how we encode Token.EOF and -1 ints; currently we use Character.MAX_VALUE for both, which might not be correct. I also see a place during serialization of lexical actions where we treat -1 as 0xFFFF in the serialized ATN. (See line 263 case Transition.ACTION.) Sam also put a note in the code where he shifts the entire serialized integer list up by 2 to improve UTF-8 encoding size... I'm going to look at the generated code size next when I get rid of that shift. open to suggestions for a way to see the size of the constant pool easily, preferably inside intellij :)

KvanTTT · 2022-01-29T22:21:24Z

Are you saying that static short arrays (vs strings) are initialized using a[i] = v for each element of the array like they are in Java? surely they fixed that problem in C#.

I meant string literals and objects, they work in Java and in C# in similar way

Ok, if I'm doing this correctly, it looks to me like one UTF-8 byte (0..127) holds about 47% of all values in Java parser's serialization. A full 0..255 byte hold about 75% of all values. it looks like we are getting a pretty decent compression from UTF-8.

Please take a look at my suggestion in #3494 I suggested using 1 byte as minimal piece of information instead of current 2-byte. Values within 0..127 can be encoded as 1 byte. Also it can encode any 32 bit integer and does not require 32 bit int to be within 0..65535 range (it looks quite inconsistent).

I also see a place during serialization of lexical actions where we treat -1 as 0xFFFF in the serialized ATN.

I've checked: we need only 0xFFFF for -1. Other negative numbers are not used.

This code generation bit would have to be updated to switch between "ints as 16 chars" and "ints as 2 x 16 chars" depending on the maximum value found in the serialized data.

You suggest putting "switch flag" to serialized data, don't you? Deserializer should not about that. Also, it would significantly increase the size of output data.

Sam also put a note in the code where he shifts the entire serialized integer list up by 2 to improve UTF-8 encoding size...

It's quite a weird solution, I don't completely understand how it helps to optimize the size. Most values are \0 and they become \2 with increment. But it definitely looks useless with my improvements in another PR.

parrt · 2022-01-29T22:38:08Z

More data. I'm looking at the serialized ATN strings for lexers and parsers for the java grammar:

_serializedATN_lexer
9754 chars used to store the String
19508 bytes used to store the String
12118 bytes used to store the String in UTF8
_serializedATN_parser
16029 chars used to store the String
32058 bytes used to store the String
20666 bytes used to store the String in UTF8

When we look at the unshifted versions I don't see any difference in the parser and it's actually smaller in the lexer! I could be making a mistake in my computations here but the generated strings definitely look to be different by 2:

_serializedATN_lexer
9754 chars used to store the String
19508 bytes used to store the String
12118 bytes used to store the String in UTF8
_serializedATN_lexer_not_shifted
9754 chars used to store the String
19508 bytes used to store the String
12083 bytes used to store the String in UTF8

_serializedATN_parser
16029 chars used to store the String
32058 bytes used to store the String
20666 bytes used to store the String in UTF8
_serializedATN_parser_not_shifted
16029 chars used to store the String
32058 bytes used to store the String
20666 bytes used to store the String in UTF8

parrt · 2022-01-29T22:40:02Z

Please take a look at my suggestion in #3494 I suggested using 1 byte as minimal piece of information instead of current 2-byte.

This is very similar to what UTF-8 does, which is the format used in the class file. I guess once it's loaded into Java however it will be two bytes per character, but it would be the same even in your encoding once it got back into memory. As you can see from the numbers I just posted, we're getting a very good compression from the simple UTF-8.

parrt · 2022-01-29T22:41:00Z

the code I'm using to examine UTF-8 size looks like this:

System.out.println("_serializedATN_parser_not_shifted");
// test size
try {
    ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
    OutputStreamWriter out = new OutputStreamWriter(bytesOut, "UTF8");
    out.write(_serializedATN_parser_not_shifted);
    out.flush();
    byte[] tstBytes = bytesOut.toByteArray();
    int size = tstBytes.length;
    System.out.println(_serializedATN_parser_not_shifted.length() + " chars used to store the String");
    System.out.println(_serializedATN_parser_not_shifted.length()*2 + " bytes used to store the String");
    System.out.println(size + " bytes used to store the String in UTF8");
    out.close();
}
catch (IOException ioe) {
    System.err.println(ioe);
}

parrt · 2022-01-29T22:45:18Z

Would you be willing to make a PR that got rid of this shifting by 2 @KvanTTT ? It's easy but there are about 10 places to change it. there does not seem to be a big problem as all of the Java tests seem to run. @ericvergnaud do you have a problem with us getting rid of this weird premature optimization?

parrt · 2022-01-29T23:09:17Z

So, just to finish this off, I think we have a very acceptable solution for all but the biggest grammars. I believe that @ericvergnaud also agreed that we are mostly okay. For the special case of really big grammars, I'm willing to simply support it at this point and later possibly we can optimize. In order to support it, all we have to do is generate two 16-bit chars for each integer in the serialization. It potentially (more than) doubles the number of bytes but I'm okay with that is it still fairly small. Later, we can figure out how to deal with and encoding that is not messed up by the UTF-8 encoding of the class files. Other targets will have to be examined to figure out if they use short char arrays and, if so, switch that to int arrays for this figure case.

In other words, we begin the process by serializing an ATN into a list of integers. Then, we figure out the maximum value and see if everything fits in 16 bits. If so, we leave everything as is, otherwise we convert ALL int values to two \uXXXX chars rather than a single char. Does this make sense? Do you want to do the PR or should I?

I just created a tiny PR that is a small bit of cleanup; would be useful if you could take a quick look at that.

ericvergnaud · 2022-01-30T10:48:18Z

Maybe it was premature, but that doesn’t necessarily make it wrong... If I understand the history, it was targeted at reducing the serialized size, so we’d have to measure the impact of removing it whilst at the same time trying to reduce the size through other means… Personally I don’t really care about the literal size, the Javascript typically gets gzipped, so a few k don’t affect the network transfer time. We should focus on deserialization time, and memory size after deserialization.

…

Le 29 janv. 2022 à 23:45, Terence Parr ***@***.***> a écrit : Would you be willing to make a PR that got rid of this shifting by 2 @KvanTTT <https://github.com/KvanTTT> ? It's easy but there are about 10 places to change it. there does not seem to be a big problem as all of the Java tests seem to run. @ericvergnaud <https://github.com/ericvergnaud> do you have a problem with us getting rid of this weird premature optimization? — Reply to this email directly, view it on GitHub <#3505 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNQJALVUW3Q3LXP7EN5LLUYRUYRANCNFSM5MQGL35Q>. You are receiving this because you were mentioned.

KvanTTT · 2022-01-30T14:34:48Z

Then, we figure out the maximum value and see if everything fits in 16 bits. If so, we leave everything as is, otherwise we convert ALL int values to two \uXXXX chars rather than a single char. Does this make sense?

Yet another point against such a solution: we have to bypass the ATN twice to detect the maximum integer value (because int is used for everything). It duplicated the code, may decrease performance and it looks more complicated. Also, integer bits info (16 or 32) should be added to the beginning of data for the deserializer.

Why can't we just use dynamic integers since back-compatibility will be broken anyway after #3516 (at least 2 and 4 bytes)? MessagePack and other binary serializers use dynamic integers, it's working solution. We can encapsulate the code of integer writing to the separated method write and use it everywhere in the serializer (the same for deserializer).

KvanTTT · 2022-01-30T14:39:04Z

Maybe it was premature, but that doesn’t necessarily make it wrong... If I understand the history, it was targeted at reducing the serialized size, so we’d have to measure the impact of removing it whilst at the same time trying to reduce the size through other means…

It looks like it's actual for only big 0xFFFE and 0xFFFF values that take several chars in escaped format (\uFFFF). But 0xFFFE is rare, 0xFFFF is probably more frequent but it can be replaced by just raw literal (I've done it in #3513).

KvanTTT · 2022-02-05T20:49:54Z

Ok, a lot to take in here. Will need time to think. Any idea how many grammars generate more than 2^14 states? Just curious about your encoding mechanism. Might work.

I was a bit wrong, actually, it's up to 2^15 states count (32768) without breaking compatibility. It's two times lesser than the current max limit (2^16).

parrt · 2022-02-05T23:54:23Z

Looking at the ATN for your MySql grammar, it seems nowhere near the 16 bit limit. Here's the tail end of the histogram of state numbers and counts for lexer and parser:

$ tail MySql*.csv
==> MySqlLexer-histo.csv <==
13061,2
13062,2
13063,2
13064,2
13065,2
13066,2
13067,2
13068,1
13163,1
65535,3

==> MySqlParser-histo.csv <==
6778,2
6779,2
6780,2
6781,2
6782,2
6783,2
6784,2
6786,1
7974,1
65535,3

Seems extremely rare that we'd have a bigger grammar than MySql and these are only 20% of the way to 65,535, right?

KvanTTT · 2022-02-06T12:03:29Z

Seems extremely rare that we'd have a bigger grammar than MySql and these are only 20% of the way to 65,535, right?

I think it depends on the application. For programming languages, 65356 should be enough (but also not sure). But I suspect ANTLR can be used for natural languages processing where thousands of tokens are okay. Or other applications I can not even imagine. Also, there are several issues related to such a limit, some users require full range. I like the idea of attracting more users to use the great ANTLR tool.

parrt · 2022-02-06T18:50:18Z

What concrete used cases have people submitted? I think we should carefully evaluate whether this is really needed.

KvanTTT · 2022-02-06T19:04:16Z

Add a more understandable message than "Serialized ATN data element .... element ... out of range 0..65535" #1863
UnsupportedOperationException while generating code for large grammars. #2732
Serialized ATN data element 810567 element 11 out of range 0..65535 #3338
Getting this error: Exception in thread "main" java.lang.UnsupportedOperationException: Serialized ATN data element out of range #840

Take a look at @sharwell comment in the latest issue:

The serialization logic could be rewritten to use compressed integers like the ones used in ECMA-335 (bytecode for .NET), but it wouldn't be a small undertaking. It's arguably a good idea in the long run though.

@ftomassetti:

I had the same issue with the grammar of language I was writing. One thing you can do to avoid this is to have one token type for several operators with the same precedence (e.g., relational operators) instead of separate tokens. It seems to help.

It would be nice if this was fixed eventually...

@mullekay:

Sorry to raise this question again! I am trying to build a large dictionary of about 200,000 words (which use simple regular expressions to cover for edge cases. Unfortunately, I am running into the mentioned issue as well. From what I understand, this is due to the reason that the number of internal ATN states is limited to MAX_VALUE = '\uFFFF'. Is there a recommended way of using the ANTLR lexer as a tagger for large dictionaries. Furthermore, I should mention that the code is generating the grammar on the fly.

And others.

KvanTTT · 2022-02-20T19:47:26Z

I'm closing in favor of #3546

KvanTTT added 6 commits January 18, 2022 01:13

Allow ATN serialization of values more than 65535 (writeCompactUInt32)

48c67ed

Refactor ATN serializer and deserializer, use ATNDataWriter, ATNDataReader Remove excess data cloning in deserializer fixes antlr#1863, fixes antlr#2732, fixes antlr#3338

Full support of positive int range in serializer and deserializer (up…

373fb0f

… to Integer.MAX_VALUE) Simplify serializer and deserializer API

Get rid of excess ATN serialization, remove excess allocations

52936cc

Move decode method from ATNSerializer to ATNDeserializerHelper (to te…

815fd26

…sts)

Implement base64 encoding instead of plain chars (up to 2x compact ge…

59a8ca3

…nerated code)

Zero-allocation deserialization for base64 string

ec78ce0

KvanTTT mentioned this pull request Jan 21, 2022

Get rid of ATN serialization restrictions (ATN states size can be > 65535, up to 2^31-1), remove excess serialization and allocations #3493

Closed

parrt reviewed Jan 22, 2022

View reviewed changes

parrt force-pushed the master branch from ea98374 to e4c1a74 Compare February 8, 2022 01:39

KvanTTT changed the base branch from master to dev February 16, 2022 10:39

KvanTTT closed this Feb 20, 2022

KvanTTT mentioned this pull request Feb 21, 2022

Increase ATN states size limit, simplify ATN serialization #3546

Closed

tzzhwj mentioned this pull request Feb 23, 2022

How to improve the deserialization time of ATNSimulator? #3554

Open

ATN serialization improvements (Java only for demo) #3505

ATN serialization improvements (Java only for demo) #3505

Conversation

KvanTTT commented Jan 21, 2022 • edited Loading

parrt Jan 22, 2022

Choose a reason for hiding this comment

parrt Jan 22, 2022

Choose a reason for hiding this comment

KvanTTT Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

parrt commented Jan 22, 2022 • edited Loading

parrt Jan 22, 2022

Choose a reason for hiding this comment

KvanTTT Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

parrt Jan 22, 2022

Choose a reason for hiding this comment

KvanTTT commented Jan 22, 2022

parrt commented Jan 22, 2022

ericvergnaud commented Jan 23, 2022

parrt commented Jan 23, 2022 • edited Loading

ericvergnaud commented Jan 23, 2022 • edited Loading

KvanTTT commented Jan 23, 2022 • edited Loading

KvanTTT commented Jan 26, 2022 • edited Loading

ericvergnaud commented Jan 26, 2022

KvanTTT commented Jan 26, 2022 • edited Loading

KvanTTT commented Jan 26, 2022 • edited Loading

KvanTTT commented Jan 26, 2022

parrt commented Jan 27, 2022

KvanTTT commented Jan 27, 2022 • edited Loading

KvanTTT commented Jan 27, 2022 • edited Loading

parrt commented Jan 29, 2022

KvanTTT commented Jan 29, 2022 • edited Loading

KvanTTT commented Jan 29, 2022

parrt commented Jan 29, 2022 • edited Loading

KvanTTT commented Jan 29, 2022

parrt commented Jan 29, 2022

parrt commented Jan 29, 2022

parrt commented Jan 29, 2022

KvanTTT commented Jan 29, 2022

parrt commented Jan 29, 2022

parrt commented Jan 29, 2022

parrt commented Jan 29, 2022

parrt commented Jan 29, 2022

parrt commented Jan 29, 2022 • edited Loading

ericvergnaud commented Jan 30, 2022 via email

KvanTTT commented Jan 30, 2022 • edited Loading

KvanTTT commented Jan 30, 2022 • edited Loading

KvanTTT commented Feb 5, 2022 • edited Loading

parrt commented Feb 5, 2022 • edited Loading

KvanTTT commented Feb 6, 2022 • edited Loading

parrt commented Feb 6, 2022

KvanTTT commented Feb 6, 2022 • edited Loading

KvanTTT commented Feb 20, 2022

KvanTTT commented Jan 21, 2022 •

edited

Loading

KvanTTT Jan 22, 2022 •

edited

Loading

parrt commented Jan 22, 2022 •

edited

Loading

KvanTTT Jan 22, 2022 •

edited

Loading

parrt commented Jan 23, 2022 •

edited

Loading

ericvergnaud commented Jan 23, 2022 •

edited

Loading

KvanTTT commented Jan 23, 2022 •

edited

Loading

KvanTTT commented Jan 26, 2022 •

edited

Loading

KvanTTT commented Jan 26, 2022 •

edited

Loading

KvanTTT commented Jan 26, 2022 •

edited

Loading

KvanTTT commented Jan 27, 2022 •

edited

Loading

KvanTTT commented Jan 27, 2022 •

edited

Loading

KvanTTT commented Jan 29, 2022 •

edited

Loading

parrt commented Jan 29, 2022 •

edited

Loading

parrt commented Jan 29, 2022 •

edited

Loading

KvanTTT commented Jan 30, 2022 •

edited

Loading

KvanTTT commented Jan 30, 2022 •

edited

Loading

KvanTTT commented Feb 5, 2022 •

edited

Loading

parrt commented Feb 5, 2022 •

edited

Loading

KvanTTT commented Feb 6, 2022 •

edited

Loading

KvanTTT commented Feb 6, 2022 •

edited

Loading