Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATN serialized data: remove shifting by 2, remove UUID #3516

Merged
merged 3 commits into from
Feb 6, 2022

Conversation

KvanTTT
Copy link
Member

@KvanTTT KvanTTT commented Jan 30, 2022

Changes from antlr/antlr-php-runtime#17 are required in master for PHP

fix #3515

@KvanTTT KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch from e3b5e76 to 88e10ac Compare January 30, 2022 14:09
@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

It looks like some runtime tests rely on the stable version of ANTLR, that's why some tests are failing. It shouldn't be.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 30, 2022

@ericvergnaud where can I find the original grammar for https://github.com/antlr/antlr4/blob/bca2536f3fb689e6e06b8351c71571f8e837f71e/runtime/Python3/tests/mocks/TestLexer.py ? It looks strange that it's not possible to update the generated data.

Thats a question for @youkaichao I believe...

@parrt
Copy link
Member

parrt commented Jan 30, 2022

Ok, let me pull in this branch and see what it does. Thank you for your work on this guys... standby.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

It looks like we should either remove some tests or fix them for working with the actual ANTLR version from the source, not from a package repository.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

All Java tests pass. Also Go. Weird Python3 is passing but it should not based upon what you showed me in the gists above.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

@KvanTTT which test is failing for TestLexer.py? I can't get it to fail after I pulled in your branch but it should fail.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

looks like there are other Python tests that should fail:

├── mocks
│   ├── TestLexer.py
│   └── __init__.py
├── parser
│   ├── __init__.py
│   ├── clexer.py
│   └── cparser.py

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

Yes, because it's extra tests from https://github.com/antlr/antlr4/tree/master/runtime/Python3/tests, not from the runtime-testsuite directory. I can't even regenerate lexer and parsers because grammars are not presented. Other runtimes also have similar tests.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

Ok, I will make a PR to fix the Java thing first.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

What's wrong with Java?

@parrt
Copy link
Member

parrt commented Jan 30, 2022

The .interp files that do not have corresponding grammars

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

Yep, I was able to change them manually but it does not look the correct way.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

Ok, so I have the Java fixed, but the Python TestLexer.py looks to be in the runtime and not the runtime test suite! Those files shouldn't be run by the unit tests so we should be okay although that area should be cleaned up eventually. I'm back to looking at this PR.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

looking at the change list...wow, it dramatically simplifies the code to avoid the UU ID. love it. looks like Sam was trying to create backward compatibility but I don't think it's worth it.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

This looks great to me @KvanTTT !!! I just had that one issue about removing a dead comment. oh and also rebase from master once we include my fixes to the missing Java grammar files.

Then once @ericvergnaud signs off on this we can merge. :)

@ericvergnaud
Copy link
Contributor

@parrt one we test Swift i.e. load a newly generated parser using the current runtime I'll be fine.

@parrt
Copy link
Member

parrt commented Jan 30, 2022

@KvanTTT can you rebase on master? Then I'll repull.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

Those files shouldn't be run by the unit tests so we should be okay although that area should be cleaned up eventually. I'm back to looking at this PR.

Should I disable such tests to make CI status green?

@parrt
Copy link
Member

parrt commented Jan 30, 2022

Should I disable such tests to make CI status green?

Yes, please disable. I'll make new issue: #3521

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 30, 2022

Ok, I will be able to do it tomorrow since it's late for me.

@parrt
Copy link
Member

parrt commented Jan 30, 2022 via email

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 5, 2022

@KvanTTT can you comment on whether this is consistent with what you were saying on your side?

Yes, it's consistent with my observations.

It does look like Sam might be right and there's some weird modified UTF-8 vs UTF-8 going on inside the class files.

Yes, he is right partially (0 takes 2 bytes instead of 1 and 0xFFFF takes 3 bytes instead of 1), but it does not imply triple size of source/class file as he said (as we see it's about 11% at worse case).

One possible path forward is to not do the shift but to remove the UUID code.

I agree the current implementation is not very clear. I can suggest two solutions how to optimize ATN data (but after this PR status resolving): the first includes target-dependent shift (only for Java) the second includes improved and more compact encoding which is actual for all runtimes. Both of them are better than the old solution with +2 shift that is only actual for Java. And both of them also imply more code removing.

@parrt
Copy link
Member

parrt commented Feb 5, 2022

Ok, sounds good. Let us proceed as follows (mostly as-is for this PR).

  • remove UUID from ATN and supporting code that handles backward-compatible
  • remove shift by 2 from all targets except Java which is a special case for a lot of reasons; this gives us the advantage of removing a bunch of code from other targets that must be wondering why we are doing it. This means that Java target generated code should be identical except for the UUID.

Then we can attend to the possibility of shrinking the ATN serialization strings with your other PR.

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 5, 2022

remove shift by 2 from all targets except Java which is a special case for a lot of reasons; this gives us the advantage of removing a bunch of code from other targets that must be wondering why we are doing it. This means that Java target generated code should be identical except for the UUID.

Just to clarify: do you suggest fixing the current PR to add shifting optimization only for Java?

@parrt
Copy link
Member

parrt commented Feb 5, 2022

I suppose maybe that's the easiest path. Otherwise we would have to make a separate PR that did the shifting and this one is almost ready to go. But if you prefer we can go with to PRs.

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 5, 2022

I suppose maybe that's the easiest path. Otherwise we would have to make a separate PR that did the shifting and this one is almost ready to go. But if you prefer we can go with to PRs.

Ok, I can do it in this PR. Just thought there are already a lot of changes for the review and maybe it's inconvenient :)

@parrt
Copy link
Member

parrt commented Feb 5, 2022 via email

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 5, 2022

I've implemented Java-specific ATN data optimization in the latest commit.

@@ -0,0 +1,6 @@
package org.antlr.v4.runtime.atn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer not creating new files unless there is a strong reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly in the runtime because it means Target developers will try to implement.


import java.util.List;

public class ATNSerializeUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure creating a one method utility class is worth creating a file.

public ATNDataReader(char[] data) {
this.data = data;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... The principal I like what you are doing here except we have to think about how it affects all of the other targets. Now we are diverging from the way other targets are implementing things. In other words everybody else has data[p++] still. Of course we might need to generalize this if we are going to handle 32 bit ATN states but I not sure we need such a widespread change. If we were starting from scratch I can imagine this, but every change in a PR to a mature product has its costs or risks anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are diverging from the way other targets are implementing things. In other words everybody else has data[p++] still.

I've just encapsulated reading logic into the separated class. Actually almost all targets already have similar implementation that uses readInt methods: see C#, Python, Dart and other.

Of course we might need to generalize this if we are going to handle 32 bit ATN states but I not sure we need such a widespread change.

Handling of 32 bit ATN states could be easily added to the ATNDataWriter and ATNDataReader. I would demonstrate it in the next PR. I guess some users need such a feature since there are several user issues on the tracker.

If we were starting from scratch I can imagine this, but every change in a PR to a mature product has its costs or risks anyway.

We have a lot of tests and ATN data is already not back-compatible because of changes in this PR. I don't see a strong reason why to not add support of full range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for handling 32 bit integers is quite simple:

public void write(int value) {
	if (value < 0b1000_0000_0000_0000) {
		writeInt16(value);
	} else {
		writeInt16((value & 0b0111_1111_1111_1111) | (1 << 15));
		writeInt16(value >>> 15);
	}
}

public int read() {
	int value = readInt16();
	return value < 0b1000_0000_0000_0000 && value >= 0
			? value
			: (readInt16() << 15) | (value & 0b0111_1111_1111_1111);
}

@@ -0,0 +1,36 @@
package org.antlr.v4.runtime.atn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another file that's in one target which will diverge from other targets and then new targets will have yet another file to implement. A lotta these files are really related to the way Java likes to do things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATNDataWriter is only actual for Java, maybe C++ because other targets don't write they only read ATN.

@parrt
Copy link
Member

parrt commented Feb 5, 2022

I was kind of expecting you to simply back out the code so that the job at target added then subtracted 2 again. haha.. Seems like kind of a major change, particularly given the comments I have in line. :)

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

I was kind of expecting you to simply back out the code so that the job at target added then subtracted 2 again. haha.. Seems like kind of a major change, particularly given the comments I have in line. :)

It could be possible but the suggested implementation looks more clear and does not require additional allocation.

@KvanTTT KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch from 34edfb8 to 0f1da3e Compare February 6, 2022 12:16
@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

I've amended the latest commit by your notes.

@parrt
Copy link
Member

parrt commented Feb 6, 2022

It could be possible but the suggested implementation looks more clear and does not require additional allocation.

I really value all of the effort you put in, but this is an old established project and there are many considerations that I've had to keep in mind over the years. What you are saying is true that it is clear, but that is not my only consideration. As I mentioned, one of my concerns is keeping the code for the various target similar and this changes things quite significantly. It may be the case that we need to go to something more general in the future if we need more than 16 bits.

Please consider making the minimal change necessary to the java target. My intention for this update to the software is simply carving out the UUID from all targets, then removing +-2 from all target accept Java.

@KvanTTT KvanTTT force-pushed the atn-remove-uuid-and-shifting-by-2 branch from 0f1da3e to 60f92b8 Compare February 6, 2022 19:21
@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

Ok, understand you. I've rewritten the latest commit with minimal changes: 60f92b8 Let's postpone other changes for future.

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

As I mentioned, one of my concerns is keeping the code for the various target similar and this changes things quite significantly.

As I said code of various targets are differs from Java because they use readInt method intead of data[p++]. It more looks like my approash with writer/reader classes.

@parrt
Copy link
Member

parrt commented Feb 6, 2022

As I said code of various targets are differs from Java because they use readInt method intead of data[p++]. It more looks like my approash with writer/reader classes.

Interesting. Ok, didn't know that. Well let's keep it as-s for now.

@parrt
Copy link
Member

parrt commented Feb 6, 2022

Wow! Great work! So much code removed!!! Thanks, @KvanTTT :)

@parrt parrt merged commit 6040190 into antlr:master Feb 6, 2022
@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

Please, don't forget to merge antlr/antlr-php-runtime#17 as well to make our CI status completely green.

@KvanTTT KvanTTT deleted the atn-remove-uuid-and-shifting-by-2 branch February 6, 2022 21:21
@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 6, 2022

If you like code removing, probably you'd like the next PR with optimization of Swift serialization (JSON -> binary) :) #3513

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up ATN serialization: rm UUID and shifting by value of 2
4 participants