ATN serialized data: remove shifting by 2, remove UUID #3516

KvanTTT · 2022-01-30T13:48:13Z

Changes from antlr/antlr-php-runtime#17 are required in master for PHP

KvanTTT · 2022-01-30T14:10:48Z

It looks like some runtime tests rely on the stable version of ANTLR, that's why some tests are failing. It shouldn't be.

KvanTTT · 2022-01-30T15:09:20Z

@ericvergnaud where can I find the original grammar for https://github.com/antlr/antlr4/blob/bca2536f3fb689e6e06b8351c71571f8e837f71e/runtime/Python3/tests/mocks/TestLexer.py ? It looks strange that it's not possible to update the generated data.

@parrt the same for https://github.com/antlr/antlr4/blob/88e10ac51d8c1e744d7dae054264f63f85a12439/runtime-testsuite/resources/org/antlr/v4/test/runtime/InterpDataReaderTest.interp It does not have original grammar.

ericvergnaud · 2022-01-30T15:33:55Z

@ericvergnaud where can I find the original grammar for https://github.com/antlr/antlr4/blob/bca2536f3fb689e6e06b8351c71571f8e837f71e/runtime/Python3/tests/mocks/TestLexer.py ? It looks strange that it's not possible to update the generated data.

Thats a question for @youkaichao I believe...

parrt · 2022-01-30T18:07:53Z

Ok, let me pull in this branch and see what it does. Thank you for your work on this guys... standby.

KvanTTT · 2022-01-30T18:13:12Z

It looks like we should either remove some tests or fix them for working with the actual ANTLR version from the source, not from a package repository.

parrt · 2022-01-30T18:13:32Z

All Java tests pass. Also Go. Weird Python3 is passing but it should not based upon what you showed me in the gists above.

parrt · 2022-01-30T18:19:04Z

@KvanTTT which test is failing for TestLexer.py? I can't get it to fail after I pulled in your branch but it should fail.

parrt · 2022-01-30T18:19:47Z

looks like there are other Python tests that should fail:

├── mocks
│   ├── TestLexer.py
│   └── __init__.py
├── parser
│   ├── __init__.py
│   ├── clexer.py
│   └── cparser.py

KvanTTT · 2022-01-30T18:22:50Z

Yes, because it's extra tests from https://github.com/antlr/antlr4/tree/master/runtime/Python3/tests, not from the runtime-testsuite directory. I can't even regenerate lexer and parsers because grammars are not presented. Other runtimes also have similar tests.

parrt · 2022-01-30T18:38:43Z

Ok, I will make a PR to fix the Java thing first.

KvanTTT · 2022-01-30T18:43:25Z

What's wrong with Java?

parrt · 2022-01-30T18:46:19Z

The .interp files that do not have corresponding grammars

KvanTTT · 2022-01-30T18:49:03Z

Yep, I was able to change them manually but it does not look the correct way.

parrt · 2022-01-30T19:25:57Z

Ok, so I have the Java fixed, but the Python TestLexer.py looks to be in the runtime and not the runtime test suite! Those files shouldn't be run by the unit tests so we should be okay although that area should be cleaned up eventually. I'm back to looking at this PR.

parrt · 2022-01-30T19:49:59Z

looking at the change list...wow, it dramatically simplifies the code to avoid the UU ID. love it. looks like Sam was trying to create backward compatibility but I don't think it's worth it.

runtime/Java/src/org/antlr/v4/runtime/atn/ATNSerializer.java

parrt · 2022-01-30T19:55:14Z

This looks great to me @KvanTTT !!! I just had that one issue about removing a dead comment. oh and also rebase from master once we include my fixes to the missing Java grammar files.

Then once @ericvergnaud signs off on this we can merge. :)

ericvergnaud · 2022-01-30T20:47:11Z

@parrt one we test Swift i.e. load a newly generated parser using the current runtime I'll be fine.

parrt · 2022-01-30T20:57:25Z

@KvanTTT can you rebase on master? Then I'll repull.

KvanTTT · 2022-01-30T21:11:17Z

Those files shouldn't be run by the unit tests so we should be okay although that area should be cleaned up eventually. I'm back to looking at this PR.

Should I disable such tests to make CI status green?

parrt · 2022-01-30T21:15:13Z

Should I disable such tests to make CI status green?

Yes, please disable. I'll make new issue: #3521

KvanTTT · 2022-01-30T21:22:49Z

Ok, I will be able to do it tomorrow since it's late for me.

parrt · 2022-01-30T21:23:50Z

Great thanks!

KvanTTT · 2022-02-05T18:35:02Z

@KvanTTT can you comment on whether this is consistent with what you were saying on your side?

Yes, it's consistent with my observations.

It does look like Sam might be right and there's some weird modified UTF-8 vs UTF-8 going on inside the class files.

Yes, he is right partially (0 takes 2 bytes instead of 1 and 0xFFFF takes 3 bytes instead of 1), but it does not imply triple size of source/class file as he said (as we see it's about 11% at worse case).

One possible path forward is to not do the shift but to remove the UUID code.

I agree the current implementation is not very clear. I can suggest two solutions how to optimize ATN data (but after this PR status resolving): the first includes target-dependent shift (only for Java) the second includes improved and more compact encoding which is actual for all runtimes. Both of them are better than the old solution with +2 shift that is only actual for Java. And both of them also imply more code removing.

parrt · 2022-02-05T18:39:04Z

Ok, sounds good. Let us proceed as follows (mostly as-is for this PR).

remove UUID from ATN and supporting code that handles backward-compatible
remove shift by 2 from all targets except Java which is a special case for a lot of reasons; this gives us the advantage of removing a bunch of code from other targets that must be wondering why we are doing it. This means that Java target generated code should be identical except for the UUID.

Then we can attend to the possibility of shrinking the ATN serialization strings with your other PR.

KvanTTT · 2022-02-05T19:05:27Z

remove shift by 2 from all targets except Java which is a special case for a lot of reasons; this gives us the advantage of removing a bunch of code from other targets that must be wondering why we are doing it. This means that Java target generated code should be identical except for the UUID.

Just to clarify: do you suggest fixing the current PR to add shifting optimization only for Java?

parrt · 2022-02-05T19:08:12Z

I suppose maybe that's the easiest path. Otherwise we would have to make a separate PR that did the shifting and this one is almost ready to go. But if you prefer we can go with to PRs.

KvanTTT · 2022-02-05T19:13:56Z

I suppose maybe that's the easiest path. Otherwise we would have to make a separate PR that did the shifting and this one is almost ready to go. But if you prefer we can go with to PRs.

Ok, I can do it in this PR. Just thought there are already a lot of changes for the review and maybe it's inconvenient :)

parrt · 2022-02-05T19:27:19Z

Yep But I think I’ve been keeping up with the changes but will take an overall look again

On Sat, Feb 5, 2022 at 11:14 AM Ivan Kochurkin ***@***.***> wrote: I suppose maybe that's the easiest path. Otherwise we would have to make a separate PR that did the shifting and this one is almost ready to go. But if you prefer we can go with to PRs. Ok, I can do it in this PR. Just thought there are already a lot of changes for the review :) — Reply to this email directly, view it on GitHub <#3516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABLUWJLE2XXTAS5HEPPJ2DUZVZH7ANCNFSM5NEIXMYA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Dictation in use. Please excuse homophones, malapropisms, and nonsense.

KvanTTT · 2022-02-05T20:40:14Z

I've implemented Java-specific ATN data optimization in the latest commit.

parrt · 2022-02-05T23:38:30Z

runtime/Java/src/org/antlr/v4/runtime/atn/UnicodeSerializeMode.java

@@ -0,0 +1,6 @@
+package org.antlr.v4.runtime.atn;


I think I'd prefer not creating new files unless there is a strong reason.

Certainly in the runtime because it means Target developers will try to implement.

parrt · 2022-02-05T23:39:40Z

tool-testsuite/test/org/antlr/v4/test/tool/ATNSerializeUtils.java

+
+import java.util.List;
+
+public class ATNSerializeUtils {


I'm not sure creating a one method utility class is worth creating a file.

parrt · 2022-02-05T23:43:12Z

runtime/Java/src/org/antlr/v4/runtime/atn/ATNDataReader.java

+	public ATNDataReader(char[] data) {
+		this.data = data;
+	}
+


Hmm... The principal I like what you are doing here except we have to think about how it affects all of the other targets. Now we are diverging from the way other targets are implementing things. In other words everybody else has data[p++] still. Of course we might need to generalize this if we are going to handle 32 bit ATN states but I not sure we need such a widespread change. If we were starting from scratch I can imagine this, but every change in a PR to a mature product has its costs or risks anyway.

Now we are diverging from the way other targets are implementing things. In other words everybody else has data[p++] still.

I've just encapsulated reading logic into the separated class. Actually almost all targets already have similar implementation that uses readInt methods: see C#, Python, Dart and other.

Of course we might need to generalize this if we are going to handle 32 bit ATN states but I not sure we need such a widespread change.

Handling of 32 bit ATN states could be easily added to the ATNDataWriter and ATNDataReader. I would demonstrate it in the next PR. I guess some users need such a feature since there are several user issues on the tracker.

If we were starting from scratch I can imagine this, but every change in a PR to a mature product has its costs or risks anyway.

We have a lot of tests and ATN data is already not back-compatible because of changes in this PR. I don't see a strong reason why to not add support of full range.

The code for handling 32 bit integers is quite simple:

public void write(int value) { if (value < 0b1000_0000_0000_0000) { writeInt16(value); } else { writeInt16((value & 0b0111_1111_1111_1111) | (1 << 15)); writeInt16(value >>> 15); } } public int read() { int value = readInt16(); return value < 0b1000_0000_0000_0000 && value >= 0 ? value : (readInt16() << 15) | (value & 0b0111_1111_1111_1111); }

parrt · 2022-02-05T23:43:58Z

runtime/Java/src/org/antlr/v4/runtime/atn/ATNDataWriter.java

@@ -0,0 +1,36 @@
+package org.antlr.v4.runtime.atn;


This is another file that's in one target which will diverge from other targets and then new targets will have yet another file to implement. A lotta these files are really related to the way Java likes to do things.

ATNDataWriter is only actual for Java, maybe C++ because other targets don't write they only read ATN.

parrt · 2022-02-05T23:45:07Z

I was kind of expecting you to simply back out the code so that the job at target added then subtracted 2 again. haha.. Seems like kind of a major change, particularly given the comments I have in line. :)

KvanTTT · 2022-02-06T11:56:25Z

I was kind of expecting you to simply back out the code so that the job at target added then subtracted 2 again. haha.. Seems like kind of a major change, particularly given the comments I have in line. :)

It could be possible but the suggested implementation looks more clear and does not require additional allocation.

KvanTTT · 2022-02-06T12:18:32Z

I've amended the latest commit by your notes.

parrt · 2022-02-06T18:56:28Z

It could be possible but the suggested implementation looks more clear and does not require additional allocation.

I really value all of the effort you put in, but this is an old established project and there are many considerations that I've had to keep in mind over the years. What you are saying is true that it is clear, but that is not my only consideration. As I mentioned, one of my concerns is keeping the code for the various target similar and this changes things quite significantly. It may be the case that we need to go to something more general in the future if we need more than 16 bits.

Please consider making the minimal change necessary to the java target. My intention for this update to the software is simply carving out the UUID from all targets, then removing +-2 from all target accept Java.

KvanTTT · 2022-02-06T19:22:32Z

Ok, understand you. I've rewritten the latest commit with minimal changes: 60f92b8 Let's postpone other changes for future.

KvanTTT · 2022-02-06T19:28:34Z

As I mentioned, one of my concerns is keeping the code for the various target similar and this changes things quite significantly.

As I said code of various targets are differs from Java because they use readInt method intead of data[p++]. It more looks like my approash with writer/reader classes.

parrt · 2022-02-06T21:00:56Z

As I said code of various targets are differs from Java because they use readInt method intead of data[p++]. It more looks like my approash with writer/reader classes.

Interesting. Ok, didn't know that. Well let's keep it as-s for now.

parrt · 2022-02-06T21:06:10Z

Wow! Great work! So much code removed!!! Thanks, @KvanTTT :)

KvanTTT · 2022-02-06T21:11:39Z

Please, don't forget to merge antlr/antlr-php-runtime#17 as well to make our CI status completely green.

KvanTTT · 2022-02-06T21:26:36Z

If you like code removing, probably you'd like the next PR with optimization of Swift serialization (JSON -> binary) :) #3513

KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch from e3b5e76 to 88e10ac Compare January 30, 2022 14:09

KvanTTT mentioned this pull request Jan 30, 2022

ATN serialization improvements (Java only for demo) #3505

Closed

parrt reviewed Jan 30, 2022

View reviewed changes

runtime/Java/src/org/antlr/v4/runtime/atn/ATNSerializer.java Outdated Show resolved Hide resolved

KvanTTT mentioned this pull request Jan 31, 2022

ATN serialized data: remove shifting by 2, remove UUID antlr/antlr-php-runtime#17

Merged

KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch 4 times, most recently from 48645fd to 1314459 Compare January 31, 2022 17:25

parrt reviewed Feb 5, 2022

View reviewed changes

KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch from 34edfb8 to 0f1da3e Compare February 6, 2022 12:16

Implement Java-specific ATN data optimization (+-2 shift)

60f92b8

KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch from 0f1da3e to 60f92b8 Compare February 6, 2022 19:21

parrt added atn-analysis comp:runtime type:cleanup labels Feb 6, 2022

parrt added this to the 4.10 milestone Feb 6, 2022

parrt merged commit 6040190 into antlr:master Feb 6, 2022

KvanTTT deleted the atn-remove-uuid-and-shifting-by-2 branch February 6, 2022 21:21

ATN serialized data: remove shifting by 2, remove UUID #3516

ATN serialized data: remove shifting by 2, remove UUID #3516

Conversation

KvanTTT commented Jan 30, 2022

KvanTTT commented Jan 30, 2022

KvanTTT commented Jan 30, 2022

ericvergnaud commented Jan 30, 2022 • edited Loading

parrt commented Jan 30, 2022

KvanTTT commented Jan 30, 2022

parrt commented Jan 30, 2022 • edited Loading

parrt commented Jan 30, 2022

parrt commented Jan 30, 2022

KvanTTT commented Jan 30, 2022 • edited Loading

parrt commented Jan 30, 2022

KvanTTT commented Jan 30, 2022

parrt commented Jan 30, 2022

KvanTTT commented Jan 30, 2022 • edited Loading

parrt commented Jan 30, 2022 • edited Loading

parrt commented Jan 30, 2022

parrt commented Jan 30, 2022

ericvergnaud commented Jan 30, 2022

parrt commented Jan 30, 2022

KvanTTT commented Jan 30, 2022

parrt commented Jan 30, 2022 • edited Loading

KvanTTT commented Jan 30, 2022

parrt commented Jan 30, 2022 via email • edited Loading

KvanTTT commented Feb 5, 2022

parrt commented Feb 5, 2022

KvanTTT commented Feb 5, 2022 • edited Loading

parrt commented Feb 5, 2022

KvanTTT commented Feb 5, 2022 • edited Loading

parrt commented Feb 5, 2022 via email

KvanTTT commented Feb 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parrt commented Feb 5, 2022

KvanTTT commented Feb 6, 2022 • edited Loading

KvanTTT commented Feb 6, 2022

parrt commented Feb 6, 2022

KvanTTT commented Feb 6, 2022 • edited Loading

KvanTTT commented Feb 6, 2022

parrt commented Feb 6, 2022

parrt commented Feb 6, 2022

KvanTTT commented Feb 6, 2022

KvanTTT commented Feb 6, 2022

ericvergnaud commented Jan 30, 2022 •

edited

Loading

parrt commented Jan 30, 2022 •

edited

Loading

KvanTTT commented Jan 30, 2022 •

edited

Loading

KvanTTT commented Jan 30, 2022 •

edited

Loading

parrt commented Jan 30, 2022 •

edited

Loading

parrt commented Jan 30, 2022 •

edited

Loading

parrt commented Jan 30, 2022 via email •

edited

Loading

KvanTTT commented Feb 5, 2022 •

edited

Loading

KvanTTT commented Feb 5, 2022 •

edited

Loading

KvanTTT commented Feb 6, 2022 •

edited

Loading

KvanTTT commented Feb 6, 2022 •

edited

Loading