Fix #1890 Standalone CR should be recognized as line separator #2519

nixel2007 · 2019-03-24T09:11:55Z

Hello!

I've made changes to treat stand-alone CR as line separator.
This changes were tested on millions of LOC with-out any noticable perfomance regress. Also I use my fork to parse files everyday as a part of static analysis process.

If antlr4 authors want I can make the same changes for all other supported languages.

KvanTTT · 2019-04-07T21:32:18Z

Link to an issue: #1098

nixel2007 · 2019-06-10T10:10:36Z

Ping?

KvanTTT · 2022-01-07T13:41:04Z

@parrt maybe it makes sense to accept the \r line separator because a lot of users encounter the same problem and report it. On the other hand, the fix looks simple. It looks like not such a rare case.

parrt · 2022-01-08T00:55:23Z

If there is an isolated \r it doesn't reset line nor bump char position. doesn't nothing. is it an error? i think only old macs used \r in isolation...

maybe we should count as a newline and reset char position and bump line.

KvanTTT · 2022-01-08T12:02:18Z

As I've already written in the issue comment:

From Wiki: Commodore 8-bit machines (C64, C128), Acorn BBC, ZX Spectrum, TRS-80, Apple II series, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9 uses \r as the new line separator. But there are also more rare line separators as \036, \025 that I have never seen.

But it looks like many users encounter the problem with \r separator. Also all text editors and IDE that I've tested support such line separator (Notepad++, VS Code, IntelliJ, even Windows Notepad). Thus, it's not a very rare line separator.

I suggest supporting it especially since the fix is quite simple.

But this pull request is not okay because it does not have a unit test and fixes only Java runtime (other runtimes should accept \r as well).

nixel2007 · 2022-01-08T12:46:53Z

@KvanTTT

See OP:

If antlr4 authors want I can make the same changes for all other supported languages.

KvanTTT · 2022-01-08T12:59:31Z

Also, BaseRuntimeTest should be improved to accept \r\n and \r line separators (now it converts \r\n to \n that's bad).

KvanTTT

A unit test is required (LexerExec/CarriageReturnAsLineSeparator.txt). Also, readDescriptor should be improved to accept \r and don't change to \n all line separators from the input stream.
Implementation for other runtime is required as we've discussed.

Before implementation, we should wait for @parrt final decision about \r as a line separator.

KvanTTT · 2022-01-08T13:05:00Z

runtime/Java/src/org/antlr/v4/runtime/atn/LexerATNSimulator.java

+		} else if ( curChar == '\r' ) {
+			int nextChar = input.LA(2);
+			if ( nextChar != '\n') {
+				line++;
+				charPositionInLine=0;
+			}
+		} else {


Maybe it's better to change to the following?

Suggested change

} else if ( curChar == '\r' ) {

int nextChar = input.LA(2);

if ( nextChar != '\n') {

line++;

charPositionInLine=0;

}

} else {

} else if ( curChar == '\r' ) {

if (input.LA(2) == '\n') {

input.consume();

}

line++;

charPositionInLine=0;

} else {

It looks more optimal since it does not require reading \n twice.

ericvergnaud · 2022-01-08T15:22:47Z

Hi,
given the very nature of ANTLR, not sure any of this should be hard-coded.
If the input is not text, a CR might NOT mean what it means in a text editor.
It's a bit of a chicken-and-egg situation because the lexer uses CR/LF do detect line breaks before it actually parses them...
Would it make sense to have this be a configuration of the lexer ?
Something like:
lexer.setLineBreaksPattern("\r?\n")
such that users can adapt to whatever situation they need, although the default would be the above pattern?

KvanTTT · 2022-01-08T16:12:11Z

If the input is not text, a CR might NOT mean what it means in a text editor.

I think it's a very rare case when ANTLR is used for not text parsing (In my opinion it's not very suitable for binary parsing) and when \r is not treated as a new line. If so, \r\n and \n also should be defined by the user but they are defined by default.

Single \n is also line break without carriage return but actually, it's always used as \r\n.

I'm basing on the most widespread situations case and ANTLR practical usage. In most cases, \r is a line separator. If it's not, behavior can be changed as you suggest.

Would it make sense to have this be a configuration of the lexer ?

I have no objection but it complicates runtimes. Anyway by default line separator should be (\r?\n|\r) because almost all text tools (if not all) support it and it surprises the user if a tool doesn't support such a separator.

lexer.setLineBreaksPattern("\r?\n")

Also, actually binary files don't have "lines" at all.

ericvergnaud · 2022-01-08T16:39:02Z

If the input is not text, a CR might NOT mean what it means in a text editor.

I think it's a very rare case when ANTLR is used for not text parsing (In my opinion it's not very suitable for binary parsing) and when \r is not treated as a new line. If so, \r\n and \n also should be defined by the user but they are defined by default.

Single \n is also line break without carriage return but actually, it's always used as \r\n.

I'm basing on the most widespread situations case and ANTLR practical usage. In most cases, \r is a line separator. If it's not, behavior can be changed as you suggest.

Would it make sense to have this be a configuration of the lexer ?

I have no objection but it complicates runtimes. Anyway by default line separator should be (\r?\n|\r) because almost all text tools (if not all) support it and it surprises the user if a tool doesn't support such a separator.

lexer.setLineBreaksPattern("\r?\n")

Also, actually binary files don't have "lines" at all.

You'd be surprised how many people use antlr for use cases it wasn't designed for.
It's also a very rare case when text has CR without LF, so changing the current behavior for a rare case, when it can be made optional doesn't sound like the best approach.

parrt · 2022-01-08T17:17:42Z

Hi guys, Thanks very much for helping me think through this problem. Excellent analyses. My thoughts:

\r in isolation is a rare case these days
People parsing \r only text right now must be doing something; the obvious solution is simply to either replace with \n prior to processing or using a custom character stream or file loader
Existing code might rely on current behavior, across multiple targets
Making this change in Java means changing all targets to suit, possibly updating documentation
It's unclear what the effect on run-time speed would be because the Java JIT could be inlining consume() at the moment; adding more code including a conditional could prevent the inlining and cause a virtual method call. Given how much we tuned the java target, I would be very careful about changing something that affects every character consumption

In the end, given the simple solution and potential issues arising from a big change across targets for a rare situation is probably not worth it. Consequently, I'm going to close this one with the recommendation that people use a custom character stream or file loader.

parrt · 2022-01-08T17:18:36Z

Also, BaseRuntimeTest should be improved to accept \r\n and \r line separators (now it converts \r\n to \n that's bad).

a unit test that checked \r\n would be good.

nixel2007 · 2022-01-08T17:48:15Z

Thanks for your opinion, @parrt

My 2 cents: instead of changing character stream (which could lead to position errors) user may override LexerATNSimulator:
https://github.com/1c-syntax/bsl-parser/blob/b32b5c83b7afe84831764f6e88a794bce613cda9/src/main/java/com/github/_1c_syntax/bsl/parser/CRAwareLexerATNSimulator.java

and pass it to lexer as member:

https://github.com/1c-syntax/bsl-parser/blob/b32b5c83b7afe84831764f6e88a794bce613cda9/src/main/antlr/BSLLexer.g4#L29-L35

parrt · 2022-01-08T17:50:12Z

Ah. I wondered about subclassing to change consume() directly. thanks!

KvanTTT · 2022-01-08T17:53:46Z

a unit test that checked \r\n would be good.

I started working on it in another pull request: Preserve line separators for input runtime tests dat. But there are some problems because git changes EOL depending on OS and other options. On the other hand, we don't have another input format that allows chars escaping: 1\r\n2\r\n3. I'm thinking about how to implement such a test in the best way.

@nixel2007 yes, CRAwareLexerATNSimulator is a much more preferable option since it does not require changing all input string.

nixel2007 · 2022-01-08T17:55:50Z

But there are some problems because git changes EOL depending on OS and other options.

we fixed EOLs on test files with .gitattributes file

KvanTTT · 2022-01-08T18:00:53Z

Yes, I've tried to ignore EOL changing for the entire directory:

/runtime-testsuite/resources/org/antlr/v4/test/runtime/descriptors/*.txt -text

But maybe it makes sense to set up \r\n just to the certain file:

LineSeparator_CRLF.txt eol=crlf
LineSeparator_LF.txt eol=lf

KvanTTT · 2022-01-09T12:22:12Z

BTW, it looks like ANTLR 3 also does not treat single \r as line separator because line numbers are wrong with such grammars (ANTLR 3 is used for parsing ANTLR 4 grammars).

cxambs · 2023-03-20T09:48:48Z

Thanks for your opinion, @parrt

My 2 cents: instead of changing character stream (which could lead to position errors) user may override LexerATNSimulator: https://github.com/1c-syntax/bsl-parser/blob/b32b5c83b7afe84831764f6e88a794bce613cda9/src/main/java/com/github/_1c_syntax/bsl/parser/CRAwareLexerATNSimulator.java

and pass it to lexer as member:

https://github.com/1c-syntax/bsl-parser/blob/b32b5c83b7afe84831764f6e88a794bce613cda9/src/main/antlr/BSLLexer.g4#L29-L35

I am trying this approach in CSharp, but Consume is not a virtual member.
Suggestions?

cxambs · 2023-03-20T11:37:44Z

And, of course, shadowing it won't work as well.
Probably methods from the generated Lexer could be all virtual.
Without that option for now, what are my options?

Allow user to subclass and consume differently. Useful for CR handling as suggested in #2519 Signed-off-by: Alberto Simões <[email protected]>

Allow user to subclass and consume differently. Useful for CR handling as suggested in antlr#2519 Signed-off-by: Alberto Simões <[email protected]>

Allow user to subclass and consume differently. Useful for CR handling as suggested in antlr#2519 Signed-off-by: Alberto Simões <[email protected]> Signed-off-by: Victor Smirnov <[email protected]>

nixel2007 added 2 commits October 25, 2019 01:16

antlr#1890 Standalone CR should be recognized as line separator [java]

84d291f

contributors.txt sign

65ca40b

nixel2007 force-pushed the fix/cr-is-new-line-separator branch from 2c23d4c to 65ca40b Compare October 24, 2019 22:17

nixel2007 mentioned this pull request Nov 22, 2019

Fix antlr/antlr4#1890 Standalone CR should be recognized as line separator tunnelvisionlabs/antlr4#58

Closed

KvanTTT reviewed Jan 8, 2022

View reviewed changes

parrt closed this Jan 8, 2022

cxambs mentioned this pull request Mar 20, 2023

Update LexerATNSimulator.cs with virtual Consume #4197

Merged

parrt pushed a commit that referenced this pull request Mar 21, 2023

Update LexerATNSimulator.cs with virutal Consume

8dcc652

Allow user to subclass and consume differently. Useful for CR handling as suggested in #2519 Signed-off-by: Alberto Simões <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1890 Standalone CR should be recognized as line separator #2519

Fix #1890 Standalone CR should be recognized as line separator #2519

nixel2007 commented Mar 24, 2019 •

edited

Loading

KvanTTT commented Apr 7, 2019

nixel2007 commented Jun 10, 2019

KvanTTT commented Jan 7, 2022 •

edited

Loading

parrt commented Jan 8, 2022 •

edited

Loading

KvanTTT commented Jan 8, 2022

nixel2007 commented Jan 8, 2022 •

edited

Loading

KvanTTT commented Jan 8, 2022

KvanTTT left a comment •

edited

Loading

KvanTTT Jan 8, 2022 •

edited

Loading

ericvergnaud commented Jan 8, 2022 •

edited

Loading

KvanTTT commented Jan 8, 2022 •

edited

Loading

ericvergnaud commented Jan 8, 2022

parrt commented Jan 8, 2022

parrt commented Jan 8, 2022

nixel2007 commented Jan 8, 2022

parrt commented Jan 8, 2022

KvanTTT commented Jan 8, 2022

nixel2007 commented Jan 8, 2022 •

edited

Loading

KvanTTT commented Jan 8, 2022

KvanTTT commented Jan 9, 2022

cxambs commented Mar 20, 2023

cxambs commented Mar 20, 2023

Fix #1890 Standalone CR should be recognized as line separator #2519

Fix #1890 Standalone CR should be recognized as line separator #2519

Conversation

nixel2007 commented Mar 24, 2019 • edited Loading

KvanTTT commented Apr 7, 2019

nixel2007 commented Jun 10, 2019

KvanTTT commented Jan 7, 2022 • edited Loading

parrt commented Jan 8, 2022 • edited Loading

KvanTTT commented Jan 8, 2022

nixel2007 commented Jan 8, 2022 • edited Loading

KvanTTT commented Jan 8, 2022

KvanTTT left a comment • edited Loading

Choose a reason for hiding this comment

KvanTTT Jan 8, 2022 • edited Loading

Choose a reason for hiding this comment

ericvergnaud commented Jan 8, 2022 • edited Loading

KvanTTT commented Jan 8, 2022 • edited Loading

ericvergnaud commented Jan 8, 2022

parrt commented Jan 8, 2022

parrt commented Jan 8, 2022

nixel2007 commented Jan 8, 2022

parrt commented Jan 8, 2022

KvanTTT commented Jan 8, 2022

nixel2007 commented Jan 8, 2022 • edited Loading

KvanTTT commented Jan 8, 2022

KvanTTT commented Jan 9, 2022

cxambs commented Mar 20, 2023

cxambs commented Mar 20, 2023

nixel2007 commented Mar 24, 2019 •

edited

Loading

KvanTTT commented Jan 7, 2022 •

edited

Loading

parrt commented Jan 8, 2022 •

edited

Loading

nixel2007 commented Jan 8, 2022 •

edited

Loading

KvanTTT left a comment •

edited

Loading

KvanTTT Jan 8, 2022 •

edited

Loading

ericvergnaud commented Jan 8, 2022 •

edited

Loading

KvanTTT commented Jan 8, 2022 •

edited

Loading

nixel2007 commented Jan 8, 2022 •

edited

Loading