Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATN serialization improvements (Java only for demo) #3505

Closed
wants to merge 6 commits into from

Conversation

KvanTTT
Copy link
Member

@KvanTTT KvanTTT commented Jan 21, 2022

public abstract class ATNDataReader {
protected abstract byte readByte();

public int read() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard base64 decoder somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see base64 below now. Hmm...adding general comment outside.

Copy link
Member Author

@KvanTTT KvanTTT Jan 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the binary reader that uses abstract readByte calls. It reads mostly integers from compact format (described below). readByte is implemented in ATNDataReaderBase64 (that reads from base64 string) and in ATNDataReaderByteBuffer.

@parrt
Copy link
Member

parrt commented Jan 22, 2022

Sounds amazing. But, i'm trying to remember how this works. Let's see: we go from list of int (32bits) to strings (list of short16) because java only statically initializes strings (whereas arrays are initialized one element at a time in code which explodes code size). Ok, can you explain how the base64 helps here?

Is it that we are now going to 32 bit atn data, represented in a few base64 char?

import java.nio.ByteBuffer;
import java.util.UUID;

public class ATNDataWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go base64 route, shouldn't we use existing encoder/decoders to avoid bugs? https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html

Copy link
Member Author

@KvanTTT KvanTTT Jan 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can. But I tried to get rid of excess big array allocations in the deserializer (there are 2 additional allocations if we use standard methods). It may affect application startup time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if it's a trade between an alloc that gets collected vs homegrown base64, it seems safer to take the extra GC hit.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 22, 2022

Ok, can you explain how the base64 helps here?

Because it is smaller than plain char[] representation at least in source code. Also, it's a more standard format that does not require weird increments and decrements for storing optimization (no need to think about how binary data is encoded to string).

Consider 3 bytes: 128 128 128. In the plain char representation it's encoded as \u0080\u0080\u0080 that takes 18 chars in text file. With base64 it's encoded as gICAgICA that takes only 8 chars (2 (bytes per int) * 3 (bytes count) * 4/3 (encoding ratio)). It's less more than 2 times.

I've checked LargeLexer test and with base64 encoding, the generated parser takes 856 KB instead of 1331 KB with plain char encoding (but 846 KB instead of 716 KB in compiled format).

Anyway, the output format can be either plain chars or base64 chars since it does not depend on binary representation (it's being converted in SerializedATN). Different targets may use different output formats.

Is it that we are now going to 32 bit atn data, represented in a few base64 char?

Now binary data and string representation are split (ByteBuffer is used for binary data that is being transformed to base64 char[] later). Both output encoding algorithms can encode 32-bit ATN data.

@parrt
Copy link
Member

parrt commented Jan 22, 2022

Verrrrry interesting. Thanks for the explanation. I'll have to think hard on this but we should keep big picture in mind. What we have works and we're suggesting replacing it for a more general solution but with unknown issues/bugs to solve an uncommon case of > 64k states. Questions:

  1. Do we care about java file size? Hmm..probably yes, but not that much. Could affect ability to edit/view java files in editor
  2. Does base64 have a line break? I don't think so, which would mean hyper long line in editor.
  3. Do we care about .class file size? probably yes, but not that much; it affects initial load time of jar and some string init time if strings get really big
  4. Do we care about ATN list of state size? Yes but they are all now ints anyway so we save nothing at parse time

Currently we do handle 32-bit unicode char, but with two 16-bit \uXXXX values. From IntegerList:

/** Convert the list to a UTF-16 encoded char array. If all values are less
*  than the 0xFFFF 16-bit code point limit then this is just a char array
*  of 16-bit char as usual. For values in the supplementary range, encode
* them as two UTF-16 code units.
*/
public final char[] toCharArray() {...}

This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading.

I wonder if this simple solution is the least intrusive and risky.

@ericvergnaud
Copy link
Contributor

Verrrrry interesting. Thanks for the explanation. I'll have to think hard on this but we should keep big picture in mind. What we have works and we're suggesting replacing it for a more general solution but with unknown issues/bugs to solve an uncommon case of > 64k states. Questions:

  1. Do we care about java file size? Hmm..probably yes, but not that much. Could affect ability to edit/view java files in editor
  2. Does base64 have a line break? I don't think so, which would mean hyper long line in editor.
  3. Do we care about .class file size? probably yes, but not that much; it affects initial load time of jar and some string init time if strings get really big
  4. Do we care about ATN list of state size? Yes but they are all now ints anyway so we save nothing at parse time

Currently we do handle 32-bit unicode char, but with two 16-bit \uXXXX values. From IntegerList:

/** Convert the list to a UTF-16 encoded char array. If all values are less
*  than the 0xFFFF 16-bit code point limit then this is just a char array
*  of 16-bit char as usual. For values in the supplementary range, encode
* them as two UTF-16 code units.
*/
public final char[] toCharArray() {...}

This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading.

I wonder if this simple solution is the least intrusive and risky.

My 2 cents:

  • I care about initialization time. With the growth of server less, this will become increasingly important. I believe that the existing implementation is optimal since it doesn't parse the string, rather it casts its content (an array of chars) to an array of ints (toInt() is a noop once optimized by the JIT). Switching to Base64 is likely to be slower. The only transcoding that happens is from utf8[] in the class file to char[] in RAM.
  • I care about memory usage. With the growth of mobile, keeping memory low is becoming necessary again, because on those devices adding RAM is not an option. As mentioned by @KvanTTT the class file is smaller in the current version than it would be using Base64, so this proposal doesn't help. Also, I suspect that in Java _serializedATN is never gc'd since it's final so maybe we could wrap it inside a dynamically loaded child class that provides it (and also does the -2 shift) (see https://stackoverflow.com/questions/2433261/when-and-how-are-classes-garbage-collected-in-java) ?

I would want to see metrics to help reach a decision: load time and memory footprint after forced GC.

@parrt
Copy link
Member

parrt commented Jan 23, 2022

Great points, @ericvergnaud. Yes, I can see fast startup/small class file size being important on mobile. Existing mechanism is smallest/fastest. If we adjust so max atn state can be > 16 bits then existing func will simply encode all ints as 2 16-bit chars. This provides new capability for a small group that needs huge atn w/o breaking anything. Most mobile code won't need this.

Re GC of atn string. If we remove final, we can then set to null when done, right? Should then be GC'd. I looked via javap and only difference in .class file of significance is a tiny static {} section to set field _serializedATN from a constant (3 bytecode instructions). What's easiest way to test if it's GC'd? jmap? jcmd? -XX:HeapDumpPath?

Actually, we don't care I think about _serializedATN. Shouldn't the String object just be a wrapper to the the constant in .class constant pool? I.e., just a few bytes of overhead.

I think therefore we should proceed to smallest tweak that will allow 32-bit ATN state numbers that preserves existing static string mechanism. Also let's try getting _serializedATN to GC by removing final. Sound good?

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 23, 2022

Sounds great indeed!
No need to test GC with your solution, we know it will work. I would also make the field private, which will signal to suicidal users that their attempt was successful ;-)
I wanted to paste a braveheart gif here but it seems GH won't allow it...

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 23, 2022

Sorry, I'm a bit busy and I'll be able to answer the next week. Also, I have to check one thing.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 26, 2022

Does base64 have a line break? I don't think so, which would mean hyper long line in editor.

Yes, it has. Base64 string encoding does not differ from the current string encoding using plain or escaped chars (at least for Java).

This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading.

Unfortunately, it's unclear how the deserializer will find out how many bits are used for integer encoding. There are two ways of such problem resolving:

  1. (Bad) Put info about bits to the beginning of a serialized string. It breaks backward compatibility (actually it does not matter a lot). But 2x increase in the size of the output string is too excessive.
  2. (Better) Use dynamic chars count per int and detect size based on leading bits. It preserves backward compatibility at least for not very big values (up to 2^14-1=16383) and does not significantly increase the size of the output string.

In the previous closed PR I suggested the following encoding scheme:

encoding count type
00xx xxxx xxxx xxxx 1 int (14 bit)
01xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx 2 int (30 bit)
1000 0000 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx 3 int (32 bit)
1111 1111 1111 1111 1 -1 (0xFFFF)

I wonder if this simple solution is the least intrusive and risky.

We can quite easily change the format of internal serialization since parsers are not back-compatible (ANTLR reports ANTLR Runtime version %s used for parser compilation does not match the current runtime version for parser of different versions)

Yes, I can see fast startup/small class file size being important on mobile. Existing mechanism is smallest/fastest.

Actually, the smallest/faster (and clearest) mechanism is loading data from a resource file (for instance java.util.Properties). But it complicates compilation and it's not back-compatible.

This provides new capability for a small group that needs huge atn w/o breaking anything. Most mobile code won't need this.

Maybe most but I'm not sure about all. Current and future mobile devices may use ANTLR for applications related to natural languages processing or AI (?) that require a lot of tokens and ATN states.

Re GC of atn string. If we remove final, we can then set to null when done, right? Should then be GC'd. I looked via javap and only difference in .class file of significance is a tiny static {} section to set field _serializedATN from a constant (3 bytecode instructions). What's easiest way to test if it's GC'd? jmap? jcmd? -XX:HeapDumpPath?

I like the idea of removing info that is required only for deserialization. Unfortunately, there are some problems:

  1. It breaks backward compatibility because of method public String getSerializedATN(). But it looks like not a big problem, we can just deprecate this method.
  2. The main problem is Java string literals. Once created in memory, they are unlikely to be collected by GC during app lifetime because they are stored in string pool.

Consider the following code:

public class Main {
    public static void main(String[] args) {
        String string = "test";
        char[] chars = new char[] { 't', 'e', 's', 't' };
    }

    public static void test() {
        String s2 = "test";
        String s3 = "test2";
    }
}

And its bytecode:

  public static void main(java.lang.String[]);
    Code:
       0: ldc           #2                  // String test
       2: astore_1
       3: iconst_4
       4: newarray       char
       6: dup
       7: iconst_0
       8: bipush        116
      10: castore
      11: dup
      12: iconst_1
      13: bipush        101
      15: castore
      16: dup
      17: iconst_2
      18: bipush        115
      20: castore
      21: dup
      22: iconst_3
      23: bipush        116
      25: castore
      26: astore_2
      27: return

  public static void test();
    Code:
       0: ldc           #2                  // String test
       2: astore_0
       3: ldc           #3                  // String test2
       5: astore_1
       6: return

As we see, both "test" constants are being loaded from #2 item despite the fact they are in different methods. It means that string literals are stored in separated storage (interned). Also, see the question on StackOverflow: When will a string be garbage collected in java.

@ericvergnaud
Copy link
Contributor

removing final will free the reference, but not the string constant, so it's probably not good enough. We'd have to either load if from a resource, or from a class which we unload.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 26, 2022

removing final will free the reference, but not the string constant, so it's probably not good enough. We'd have to either load if from a resource, or from a class which we unload.

Yes, also an array of chars also probably won't help because all data will be stored to a method (that is unlikely to be collected) instead of string pool. I vote for the first option, loading from the resource because dynamic unloading is too excessive for such a task.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 26, 2022

Also, using char[] instead of String is probably a bit better because it does not require an extra toCharArray call (BTW C# target uses char[] instead of String).

If we remove final, we can then set to null when done, right?

I suggest getting rid of the field at all and moving the initialization code to a static constructor (that calls submethods) because setting to null looks like an antipattern.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 26, 2022

Ok, I'll rollback base64 encoding since it's not a very optimal solution. Moreover, it looks like I've found a way how to decrease the size of source code for ATN data (it is not necessary to escape most part of the symbols) without affecting compilation files. But the format of encoding of big ATN is still unclear and I'm waiting for an answer.

@parrt
Copy link
Member

parrt commented Jan 27, 2022

Ok, a lot to take in here. Will need time to think. Any idea how many grammars generate more than 2^14 states? Just curious about your encoding mechanism. Might work.

I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc...

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 27, 2022

Any idea how many grammars generate more than 2^14 states?

I don't know exactly but I've checked the encoding on our runtime tests and only one test with large lexer was failing. On the other hand, almost all tests are quite small, real grammars are bigger.

I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc...

Ok. Yes, it requires building workflow changes for all runtimes and a lot of effort. Also, I have an idea of how to decrease the size of serialized string/array of chars.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 27, 2022

I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc...

BTW, Swift already keeps ATN in a separate file but does it very ineffectively since it uses JSON instead of binary encoding. Probably it makes sense to use binary files for new targets or for targets that do not require compilation (JavaScript, Python). They just will read ATN from a nearby file.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

Regarding having separate files: I definitely prefer not having to keep two files which must be kept in sync. I think some of the target developers simply used the existing serialization mechanisms as an expedient, but it's suboptimal from a parser user build point of view. I've just spent the last hour going through the serialization code for Java because it is a special case... the obvious thing for other targets is simply to store a static integer array in the generated code. I built a little thing to track the size of numbers in the generated ATNs... let me write something up it report it here.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 29, 2022

Integer array takes more bytes in source code compared to raw char arrays or string arrays. Especially considering that almost all chars may be in a raw format, not escaped (I'm experimenting with that). Why Java is a special case?

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 29, 2022

BTW, Go and C++ use int arrays for ATN data. Maybe it's ok for compiled languages but not very optimal for JavaScript where the size of the source is more critical.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

Why Java is a special case?

Only because int/char arrays are initialized with code which blows out size of init method easily and is slow. Strings are in the .class file constant pool in contrast. Other languages won't suffer from this.

not very optimal for JavaScript where the size of the source is more critical.

Yeah, size of generated atn is something to pay attention to for JS as it's loaded in src form. Not huge though. Java grammar ATN in number of integers:

Lexer len getSerialized = 9754
Parser len getSerialized = 16029

Still working on other numbers. standby: I need a sandwich haha

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 29, 2022

Strings are in the .class file constant pool in contrast. Other languages won't suffer from this.

At least C# (.NET) works in a similar way. It also has the conception of string pool and string interning. Not sure about other languages.

Yeah, size of generated atn is something to pay attention to for JS as it's loaded in src form. Not huge though. Java grammar ATN in number of integers:

It's quite relatively. Also, it can be another much bigger grammar. If it's possible to decrease the size without affecting performance, why not do that.

BTW, I'm changing JSON -> Binary (string array) serialization for Swift and there are exiting numbers of size decreasing. I'll publish the result a bit later.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

It's always fun to improve performance or reduce size, but in a project like this I would like to leave everything alone that isn't broken, at least for now.

At least C# (.NET) works in a similar way.

Are you saying that static short arrays (vs strings) are initialized using a[i] = v for each element of the array like they are in Java? surely they fixed that problem in C#.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

I'm not worried about size as we are only talking about 32K without UTF8 compression for the base Java parser grammar. I'm totally willing to accept that size given that it has worked for over a decade.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

Ok, if I'm doing this correctly, it looks to me like one UTF-8 byte (0..127) holds about 47% of all values in Java parser's serialization. A full 0..255 byte hold about 75% of all values. it looks like we are getting a pretty decent compression from UTF-8. Out of the ~16,000 integers, here are the first few values we need to encode with their counts:

value,count
1,2
2,4758
3,2081
4,152
5,567
6,20
7,395
8,25
9,171

Interestingly, the maximum value is not very large...like 15k. The big numbers on the end are the UUID encoding that Sam put in; btw, not sure we need this and could remove it. Not sure what function it serves or what error it prevents, given that we already encode the serialization version number.

count,value
1762,1
1961,1
15335,1
16764,1
22884,1
24715,1
30598,1
33075,1
42794,1
47597,1

So, in the end, I don't think we have a problem with the existing mechanism except for the original issue we are trying to solve: What happens when we get a really big grammar where the number of ATN states exceeds 65535?

We have code that handles this by manually encoding 32-bit numbers as two unicode short chars. Take a look at IntegerList.toCharArray(). Oh, ok, I just noticed that the code generation doesn't use that. The SerializedATN output model object directly encodes a string via:

serialized = new ArrayList<String>(data.size());
for (int c : data.toArray()) {
	String encoded = factory.getGenerator().getTarget().encodeIntAsCharEscape(c == -1 ? Character.MAX_VALUE : c);
	serialized.add(encoded);
}

This code generation bit would have to be updated to switch between "ints as 16 chars" and "ints as 2 x 16 chars" depending on the maximum value found in the serialized data. Further, we have to be very careful about how we encode Token.EOF and -1 ints; currently we use Character.MAX_VALUE for both, which might not be correct. I also see a place during serialization of lexical actions where we treat -1 as 0xFFFF in the serialized ATN. (See line 263 case Transition.ACTION.) Sam also put a note in the code where he shifts the entire serialized integer list up by 2 to improve UTF-8 encoding size... I'm going to look at the generated code size next when I get rid of that shift. open to suggestions for a way to see the size of the constant pool easily, preferably inside intellij :)

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 29, 2022

Are you saying that static short arrays (vs strings) are initialized using a[i] = v for each element of the array like they are in Java? surely they fixed that problem in C#.

I meant string literals and objects, they work in Java and in C# in similar way

Ok, if I'm doing this correctly, it looks to me like one UTF-8 byte (0..127) holds about 47% of all values in Java parser's serialization. A full 0..255 byte hold about 75% of all values. it looks like we are getting a pretty decent compression from UTF-8.

Please take a look at my suggestion in #3494 I suggested using 1 byte as minimal piece of information instead of current 2-byte. Values within 0..127 can be encoded as 1 byte. Also it can encode any 32 bit integer and does not require 32 bit int to be within 0..65535 range (it looks quite inconsistent).

I also see a place during serialization of lexical actions where we treat -1 as 0xFFFF in the serialized ATN.

I've checked: we need only 0xFFFF for -1. Other negative numbers are not used.

This code generation bit would have to be updated to switch between "ints as 16 chars" and "ints as 2 x 16 chars" depending on the maximum value found in the serialized data.

You suggest putting "switch flag" to serialized data, don't you? Deserializer should not about that. Also, it would significantly increase the size of output data.

Sam also put a note in the code where he shifts the entire serialized integer list up by 2 to improve UTF-8 encoding size...

It's quite a weird solution, I don't completely understand how it helps to optimize the size. Most values are \0 and they become \2 with increment. But it definitely looks useless with my improvements in another PR.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

More data. I'm looking at the serialized ATN strings for lexers and parsers for the java grammar:

_serializedATN_lexer
9754 chars used to store the String
19508 bytes used to store the String
12118 bytes used to store the String in UTF8
_serializedATN_parser
16029 chars used to store the String
32058 bytes used to store the String
20666 bytes used to store the String in UTF8

When we look at the unshifted versions I don't see any difference in the parser and it's actually smaller in the lexer! I could be making a mistake in my computations here but the generated strings definitely look to be different by 2:

_serializedATN_lexer
9754 chars used to store the String
19508 bytes used to store the String
12118 bytes used to store the String in UTF8
_serializedATN_lexer_not_shifted
9754 chars used to store the String
19508 bytes used to store the String
12083 bytes used to store the String in UTF8
_serializedATN_parser
16029 chars used to store the String
32058 bytes used to store the String
20666 bytes used to store the String in UTF8
_serializedATN_parser_not_shifted
16029 chars used to store the String
32058 bytes used to store the String
20666 bytes used to store the String in UTF8

@parrt
Copy link
Member

parrt commented Jan 29, 2022

Please take a look at my suggestion in #3494 I suggested using 1 byte as minimal piece of information instead of current 2-byte.

This is very similar to what UTF-8 does, which is the format used in the class file. I guess once it's loaded into Java however it will be two bytes per character, but it would be the same even in your encoding once it got back into memory. As you can see from the numbers I just posted, we're getting a very good compression from the simple UTF-8.

@parrt
Copy link
Member

parrt commented Jan 29, 2022

the code I'm using to examine UTF-8 size looks like this:

System.out.println("_serializedATN_parser_not_shifted");
// test size
try {
    ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
    OutputStreamWriter out = new OutputStreamWriter(bytesOut, "UTF8");
    out.write(_serializedATN_parser_not_shifted);
    out.flush();
    byte[] tstBytes = bytesOut.toByteArray();
    int size = tstBytes.length;
    System.out.println(_serializedATN_parser_not_shifted.length() + " chars used to store the String");
    System.out.println(_serializedATN_parser_not_shifted.length()*2 + " bytes used to store the String");
    System.out.println(size + " bytes used to store the String in UTF8");
    out.close();
}
catch (IOException ioe) {
    System.err.println(ioe);
}

@parrt
Copy link
Member

parrt commented Jan 29, 2022

Would you be willing to make a PR that got rid of this shifting by 2 @KvanTTT ? It's easy but there are about 10 places to change it. there does not seem to be a big problem as all of the Java tests seem to run. @ericvergnaud do you have a problem with us getting rid of this weird premature optimization?

@parrt
Copy link
Member

parrt commented Jan 29, 2022

So, just to finish this off, I think we have a very acceptable solution for all but the biggest grammars. I believe that @ericvergnaud also agreed that we are mostly okay. For the special case of really big grammars, I'm willing to simply support it at this point and later possibly we can optimize. In order to support it, all we have to do is generate two 16-bit chars for each integer in the serialization. It potentially (more than) doubles the number of bytes but I'm okay with that is it still fairly small. Later, we can figure out how to deal with and encoding that is not messed up by the UTF-8 encoding of the class files. Other targets will have to be examined to figure out if they use short char arrays and, if so, switch that to int arrays for this figure case.

In other words, we begin the process by serializing an ATN into a list of integers. Then, we figure out the maximum value and see if everything fits in 16 bits. If so, we leave everything as is, otherwise we convert ALL int values to two \uXXXX chars rather than a single char. Does this make sense? Do you want to do the PR or should I?

I just created a tiny PR that is a small bit of cleanup; would be useful if you could take a quick look at that.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 30, 2022 via email

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

Then, we figure out the maximum value and see if everything fits in 16 bits. If so, we leave everything as is, otherwise we convert ALL int values to two \uXXXX chars rather than a single char. Does this make sense?

Yet another point against such a solution: we have to bypass the ATN twice to detect the maximum integer value (because int is used for everything). It duplicated the code, may decrease performance and it looks more complicated. Also, integer bits info (16 or 32) should be added to the beginning of data for the deserializer.

Why can't we just use dynamic integers since back-compatibility will be broken anyway after #3516 (at least 2 and 4 bytes)? MessagePack and other binary serializers use dynamic integers, it's working solution. We can encapsulate the code of integer writing to the separated method write and use it everywhere in the serializer (the same for deserializer).

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

Maybe it was premature, but that doesn’t necessarily make it wrong... If I understand the history, it was targeted at reducing the serialized size, so we’d have to measure the impact of removing it whilst at the same time trying to reduce the size through other means…

It looks like it's actual for only big 0xFFFE and 0xFFFF values that take several chars in escaped format (\uFFFF). But 0xFFFE is rare, 0xFFFF is probably more frequent but it can be replaced by just raw literal (I've done it in #3513).

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 5, 2022

Ok, a lot to take in here. Will need time to think. Any idea how many grammars generate more than 2^14 states? Just curious about your encoding mechanism. Might work.

I was a bit wrong, actually, it's up to 2^15 states count (32768) without breaking compatibility. It's two times lesser than the current max limit (2^16).

@parrt
Copy link
Member

parrt commented Feb 5, 2022

Looking at the ATN for your MySql grammar, it seems nowhere near the 16 bit limit. Here's the tail end of the histogram of state numbers and counts for lexer and parser:

$ tail MySql*.csv
==> MySqlLexer-histo.csv <==
13061,2
13062,2
13063,2
13064,2
13065,2
13066,2
13067,2
13068,1
13163,1
65535,3

==> MySqlParser-histo.csv <==
6778,2
6779,2
6780,2
6781,2
6782,2
6783,2
6784,2
6786,1
7974,1
65535,3

Seems extremely rare that we'd have a bigger grammar than MySql and these are only 20% of the way to 65,535, right?

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

Seems extremely rare that we'd have a bigger grammar than MySql and these are only 20% of the way to 65,535, right?

I think it depends on the application. For programming languages, 65356 should be enough (but also not sure). But I suspect ANTLR can be used for natural languages processing where thousands of tokens are okay. Or other applications I can not even imagine. Also, there are several issues related to such a limit, some users require full range. I like the idea of attracting more users to use the great ANTLR tool.

@parrt
Copy link
Member

parrt commented Feb 6, 2022

What concrete used cases have people submitted? I think we should carefully evaluate whether this is really needed.

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

Take a look at @sharwell comment in the latest issue:

The serialization logic could be rewritten to use compressed integers like the ones used in ECMA-335 (bytecode for .NET), but it wouldn't be a small undertaking. It's arguably a good idea in the long run though.

@ftomassetti:

I had the same issue with the grammar of language I was writing. One thing you can do to avoid this is to have one token type for several operators with the same precedence (e.g., relational operators) instead of separate tokens. It seems to help.

It would be nice if this was fixed eventually...

@mullekay:

Sorry to raise this question again! I am trying to build a large dictionary of about 200,000 words (which use simple regular expressions to cover for edge cases. Unfortunately, I am running into the mentioned issue as well. From what I understand, this is due to the reason that the number of internal ATN states is limited to MAX_VALUE = '\uFFFF'. Is there a recommended way of using the ANTLR lexer as a tagger for large dictionaries. Furthermore, I should mention that the code is generating the grammar on the fly.

And others.

@KvanTTT KvanTTT changed the base branch from master to dev February 16, 2022 10:39
@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 20, 2022

I'm closing in favor of #3546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants