Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize targets to 80 char strings for ATN serialization segments. #3438

Merged
merged 2 commits into from
Dec 27, 2021

Conversation

parrt
Copy link
Member

@parrt parrt commented Dec 27, 2021

@ericvergnaud @KvanTTT How does this look? Tests appear to pass locally.

@parrt
Copy link
Member Author

parrt commented Dec 27, 2021

Well, @ericvergnaud is the one who cares and suggests 80 so let's go old school :)

@parrt
Copy link
Member Author

parrt commented Dec 27, 2021

crap. i got the -q removal stuck in this PR too. :( oops. already in master. oh well.

@parrt parrt merged commit a16b5ef into antlr:master Dec 27, 2021
@ericvergnaud
Copy link
Contributor

well not sure what's going on but it seems that the value returned by getSerializedATNSegmentLimit is not used...

@parrt
Copy link
Member Author

parrt commented Dec 28, 2021

For which targets? Wow. SerializedATN.getSegments() isn't called anywhere (unless reflection). Looks like constructor does all the work to get string:

	public SerializedATN(OutputModelFactory factory, ATN atn) {
		super(factory);
		IntegerList data = ATNSerializer.getSerialized(atn);
		serialized = new ArrayList<String>(data.size());
		for (int c : data.toArray()) {
			String encoded = factory.getGenerator().getTarget().encodeIntAsCharEscape(c == -1 ? Character.MAX_VALUE : c);
			serialized.add(encoded);
		}
//		System.out.println(ATNSerializer.getDecoded(factory.getGrammar(), atn));
	}

oh! the template can call <segments>. Here's PHP:

private const SERIALIZED_ATN =
	<model.segments:{segment| "<segment; wrap={" .<\n>"}>"}; separator=" .\n">;

and java:

<model.segments:{segment|private static final String _serializedATNSegment<i0> =
	"<segment; wrap={"+<\n><\t>"}>";}; separator="\n">
public static final String _serializedATN = Utils.join(
	new String[] {
		<model.segments:{segment | _serializedATNSegment<i0>}; separator=",\n">
	},
	""
);

@KvanTTT
Copy link
Member

KvanTTT commented Jan 21, 2022

well not sure what's going on but it seems that the value returned by getSerializedATNSegmentLimit is not used...

Actually, it's called internally by StringTemplate, and value from getSerializedATNSegmentLimit is used (I don't know how it works exactly).

@KvanTTT
Copy link
Member

KvanTTT commented Jan 27, 2022

Unfortunately, in the end, it's a very bad change. It's limit not for line length (it's set up internally by StringTemplate during wrapping) but for the maximum length of string or char array. And now in generated code, we have a lot of useless small 80-element arrays for targets that consider ATN segment limit (Dart, C++, PHP).

Also, this limit is actual only for Java target since other targets don't have 65535 limit.

I suggest reverting 80 to Integer.MAX_VALUE by default.

@parrt
Copy link
Member Author

parrt commented Jan 29, 2022

The getSerializedATNSegmentLimit is used to prevent strings longer than 65535 characters for the serialized ATN. If we have a serialized ATN that requires something larger than that, it has to be split into multiple strings that are joined at runtime. An obvious case where this will happen is when the number of ATN states goes beyond 16 bits.

I believe this also has the effect of generating strings on multiple lines instead of one contiguous long line. @ericvergnaud wants to continue keeping it for the various targets he manages because it makes it easier to see the generated code in the editor. In the case of Java, it would slow things down a little bit and cause memory allocation if we set it to 80 or whatever. In my case, I prefer the full 64k string to avoid multiple allocations.

@KvanTTT
Copy link
Member

KvanTTT commented Jan 29, 2022

The getSerializedATNSegmentLimit is used to prevent strings longer than 65535 characters for the serialized ATN.

Yes, that's why it should not be changed to 80. It creates a lot of unwanted small fragments for targets that respect segments (PHP, C++, Dart). I've fixed it in the latest PR. It looks like this parameter is only relevant for Java.

I believe this also has the effect of generating strings on multiple lines instead of one contiguous long line.

It doesn't affect string splitting because StringTemplate is responsible for string wrapping.

In the case of Java, it would slow things down a little bit and cause memory allocation if we set it to 80 or whatever.

Only on the compiler step. Java compiler folds constant strings into a single one in .class file.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 30, 2022 via email

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 30, 2022 via email

@KvanTTT
Copy link
Member

KvanTTT commented Jan 30, 2022

The limit is not about multiline strings, it's about maximum string length that is only actual for Java because of 65535 limit.
It's worth changing the default value back to Int.MaxValue. Is that ok for you that C++ has a lot of 80-length string arrays for ATN? https://gist.github.com/KvanTTT/58496ebf63e32498ff644c43deb5bd96#file-largelexer-cpp-L1183

@parrt
Copy link
Member Author

parrt commented Jan 30, 2022

My own preference would be to see one big array of integers wrapped appropriately at some line length, rather than that just with lots of different segments. Unless a target has a limit, we should probably set the segment length to max integer.

I think we should separate the idea of line length, which can be handled in templates, from the length of the ATN serialization list of integers. I don't think any language but Java has a string length issue in terms of compilation or storing in a generated binary file. @ericvergnaud points out that most targets support multi line strings so they can be built with appropriate template actions.

In the end, my vote is to set the segment length to be max int and then change the templates to solve the issue of wrapping that Eric originally pointed out. E.g., the java target already does that trivially:

public static final String _serializedATN =
	"<model.serialized; wrap={"+<\n><\t>"}>";

@ericvergnaud how will it affect JavaScript or other targets if we moved to max int? Can we tweak the templates to satisfy your needs?

@parrt
Copy link
Member Author

parrt commented Jan 30, 2022

(ANTLR history/info Is slowly filtering back into my old brain) woot!

@KvanTTT
Copy link
Member

KvanTTT commented Jan 30, 2022

I've checked, all targets have wrapping (over string or arrays).

@parrt
Copy link
Member Author

parrt commented Jan 30, 2022

I've checked, all targets have wrapping (over string or arrays).

excellent. Ok, so this covers the primary concern that @ericvergnaud had originally. I will look at this after we figure out the UU ID and ATN serialization version thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants