Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1890 Standalone CR should be recognized as line separator #2519

Closed

Conversation

nixel2007
Copy link

@nixel2007 nixel2007 commented Mar 24, 2019

Hello!

I've made changes to treat stand-alone CR as line separator.
This changes were tested on millions of LOC with-out any noticable perfomance regress. Also I use my fork to parse files everyday as a part of static analysis process.

If antlr4 authors want I can make the same changes for all other supported languages.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 7, 2019

Link to an issue: #1098

@nixel2007
Copy link
Author

Ping?

@KvanTTT
Copy link
Member

KvanTTT commented Jan 7, 2022

@parrt maybe it makes sense to accept the \r line separator because a lot of users encounter the same problem and report it. On the other hand, the fix looks simple. It looks like not such a rare case.

@parrt
Copy link
Member

parrt commented Jan 8, 2022

If there is an isolated \r it doesn't reset line nor bump char position. doesn't nothing. is it an error? i think only old macs used \r in isolation...

maybe we should count as a newline and reset char position and bump line.

@KvanTTT
Copy link
Member

KvanTTT commented Jan 8, 2022

As I've already written in the issue comment:

From Wiki: Commodore 8-bit machines (C64, C128), Acorn BBC, ZX Spectrum, TRS-80, Apple II series, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9 uses \r as the new line separator. But there are also more rare line separators as \036, \025 that I have never seen.

But it looks like many users encounter the problem with \r separator. Also all text editors and IDE that I've tested support such line separator (Notepad++, VS Code, IntelliJ, even Windows Notepad). Thus, it's not a very rare line separator.

I suggest supporting it especially since the fix is quite simple.

But this pull request is not okay because it does not have a unit test and fixes only Java runtime (other runtimes should accept \r as well).

@nixel2007
Copy link
Author

nixel2007 commented Jan 8, 2022

@KvanTTT

See OP:

If antlr4 authors want I can make the same changes for all other supported languages.

@KvanTTT
Copy link
Member

KvanTTT commented Jan 8, 2022

Also, BaseRuntimeTest should be improved to accept \r\n and \r line separators (now it converts \r\n to \n that's bad).

Copy link
Member

@KvanTTT KvanTTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • A unit test is required (LexerExec/CarriageReturnAsLineSeparator.txt). Also, readDescriptor should be improved to accept \r and don't change to \n all line separators from the input stream.
  • Implementation for other runtime is required as we've discussed.

Before implementation, we should wait for @parrt final decision about \r as a line separator.

Comment on lines +737 to +743
} else if ( curChar == '\r' ) {
int nextChar = input.LA(2);
if ( nextChar != '\n') {
line++;
charPositionInLine=0;
}
} else {
Copy link
Member

@KvanTTT KvanTTT Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to change to the following?

Suggested change
} else if ( curChar == '\r' ) {
int nextChar = input.LA(2);
if ( nextChar != '\n') {
line++;
charPositionInLine=0;
}
} else {
} else if ( curChar == '\r' ) {
if (input.LA(2) == '\n') {
input.consume();
}
line++;
charPositionInLine=0;
} else {

It looks more optimal since it does not require reading \n twice.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 8, 2022

Hi,
given the very nature of ANTLR, not sure any of this should be hard-coded.
If the input is not text, a CR might NOT mean what it means in a text editor.
It's a bit of a chicken-and-egg situation because the lexer uses CR/LF do detect line breaks before it actually parses them...
Would it make sense to have this be a configuration of the lexer ?
Something like:
lexer.setLineBreaksPattern("\r?\n")
such that users can adapt to whatever situation they need, although the default would be the above pattern?

@KvanTTT
Copy link
Member

KvanTTT commented Jan 8, 2022

If the input is not text, a CR might NOT mean what it means in a text editor.

I think it's a very rare case when ANTLR is used for not text parsing (In my opinion it's not very suitable for binary parsing) and when \r is not treated as a new line. If so, \r\n and \n also should be defined by the user but they are defined by default.

Single \n is also line break without carriage return but actually, it's always used as \r\n.

I'm basing on the most widespread situations case and ANTLR practical usage. In most cases, \r is a line separator. If it's not, behavior can be changed as you suggest.

Would it make sense to have this be a configuration of the lexer ?

I have no objection but it complicates runtimes. Anyway by default line separator should be (\r?\n|\r) because almost all text tools (if not all) support it and it surprises the user if a tool doesn't support such a separator.

lexer.setLineBreaksPattern("\r?\n")

Also, actually binary files don't have "lines" at all.

@ericvergnaud
Copy link
Contributor

If the input is not text, a CR might NOT mean what it means in a text editor.

I think it's a very rare case when ANTLR is used for not text parsing (In my opinion it's not very suitable for binary parsing) and when \r is not treated as a new line. If so, \r\n and \n also should be defined by the user but they are defined by default.

Single \n is also line break without carriage return but actually, it's always used as \r\n.

I'm basing on the most widespread situations case and ANTLR practical usage. In most cases, \r is a line separator. If it's not, behavior can be changed as you suggest.

Would it make sense to have this be a configuration of the lexer ?

I have no objection but it complicates runtimes. Anyway by default line separator should be (\r?\n|\r) because almost all text tools (if not all) support it and it surprises the user if a tool doesn't support such a separator.

lexer.setLineBreaksPattern("\r?\n")

Also, actually binary files don't have "lines" at all.

You'd be surprised how many people use antlr for use cases it wasn't designed for.
It's also a very rare case when text has CR without LF, so changing the current behavior for a rare case, when it can be made optional doesn't sound like the best approach.

@parrt
Copy link
Member

parrt commented Jan 8, 2022

Hi guys, Thanks very much for helping me think through this problem. Excellent analyses. My thoughts:

  • \r in isolation is a rare case these days
  • People parsing \r only text right now must be doing something; the obvious solution is simply to either replace with \n prior to processing or using a custom character stream or file loader
  • Existing code might rely on current behavior, across multiple targets
  • Making this change in Java means changing all targets to suit, possibly updating documentation
  • It's unclear what the effect on run-time speed would be because the Java JIT could be inlining consume() at the moment; adding more code including a conditional could prevent the inlining and cause a virtual method call. Given how much we tuned the java target, I would be very careful about changing something that affects every character consumption

In the end, given the simple solution and potential issues arising from a big change across targets for a rare situation is probably not worth it. Consequently, I'm going to close this one with the recommendation that people use a custom character stream or file loader.

@parrt parrt closed this Jan 8, 2022
@parrt
Copy link
Member

parrt commented Jan 8, 2022

Also, BaseRuntimeTest should be improved to accept \r\n and \r line separators (now it converts \r\n to \n that's bad).

a unit test that checked \r\n would be good.

@nixel2007
Copy link
Author

@parrt
Copy link
Member

parrt commented Jan 8, 2022

Ah. I wondered about subclassing to change consume() directly. thanks!

@KvanTTT
Copy link
Member

KvanTTT commented Jan 8, 2022

a unit test that checked \r\n would be good.

I started working on it in another pull request: Preserve line separators for input runtime tests dat. But there are some problems because git changes EOL depending on OS and other options. On the other hand, we don't have another input format that allows chars escaping: 1\r\n2\r\n3. I'm thinking about how to implement such a test in the best way.

@nixel2007 yes, CRAwareLexerATNSimulator is a much more preferable option since it does not require changing all input string.

@nixel2007
Copy link
Author

nixel2007 commented Jan 8, 2022

But there are some problems because git changes EOL depending on OS and other options.

we fixed EOLs on test files with .gitattributes file

@KvanTTT
Copy link
Member

KvanTTT commented Jan 8, 2022

Yes, I've tried to ignore EOL changing for the entire directory:

/runtime-testsuite/resources/org/antlr/v4/test/runtime/descriptors/*.txt -text

But maybe it makes sense to set up \r\n just to the certain file:

LineSeparator_CRLF.txt eol=crlf
LineSeparator_LF.txt eol=lf

@KvanTTT
Copy link
Member

KvanTTT commented Jan 9, 2022

BTW, it looks like ANTLR 3 also does not treat single \r as line separator because line numbers are wrong with such grammars (ANTLR 3 is used for parsing ANTLR 4 grammars).

@cxambs
Copy link
Contributor

cxambs commented Mar 20, 2023

Thanks for your opinion, @parrt

My 2 cents: instead of changing character stream (which could lead to position errors) user may override LexerATNSimulator: https://github.com/1c-syntax/bsl-parser/blob/b32b5c83b7afe84831764f6e88a794bce613cda9/src/main/java/com/github/_1c_syntax/bsl/parser/CRAwareLexerATNSimulator.java

and pass it to lexer as member:

https://github.com/1c-syntax/bsl-parser/blob/b32b5c83b7afe84831764f6e88a794bce613cda9/src/main/antlr/BSLLexer.g4#L29-L35

I am trying this approach in CSharp, but Consume is not a virtual member.
Suggestions?

@cxambs
Copy link
Contributor

cxambs commented Mar 20, 2023

And, of course, shadowing it won't work as well.
Probably methods from the generated Lexer could be all virtual.
Without that option for now, what are my options?

parrt pushed a commit that referenced this pull request Mar 21, 2023
Allow user to subclass and consume differently. Useful for CR handling as suggested in #2519

Signed-off-by: Alberto Simões <[email protected]>
mdehoog pushed a commit to mdehoog/antlr4 that referenced this pull request Apr 16, 2023
Allow user to subclass and consume differently. Useful for CR handling as suggested in antlr#2519

Signed-off-by: Alberto Simões <[email protected]>
vityaman pushed a commit to vityaman/antlr4 that referenced this pull request Apr 30, 2023
Allow user to subclass and consume differently. Useful for CR handling as suggested in antlr#2519

Signed-off-by: Alberto Simões <[email protected]>
vityaman pushed a commit to vityaman/antlr4 that referenced this pull request Apr 30, 2023
Allow user to subclass and consume differently. Useful for CR handling as suggested in antlr#2519

Signed-off-by: Alberto Simões <[email protected]>
Signed-off-by: Victor Smirnov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants