Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grammar clarifications regarding tabs and carriage returns #38

Open
andrewrk opened this issue Jul 8, 2021 · 20 comments
Open

grammar clarifications regarding tabs and carriage returns #38

andrewrk opened this issue Jul 8, 2021 · 20 comments

Comments

@andrewrk
Copy link
Member

andrewrk commented Jul 8, 2021

NL = New Line (0x0a)
CR = Carriage Return (0x0d)
TAB = Tab (0x09)

Inside Line Comments and Documentation Comments

  • Any TAB is rejected by the grammar since it is ambiguous how it should be rendered.
  • CR directly preceding NL is unambiguously part of the newline sequence. It is accepted by the grammar and removed by zig fmt, leaving only NL.
  • CR anywhere else is rejected by the grammar.

Inside Multi-Line String Literals

  • zig fmt may not mangle multi line string literals, and therefore the control character TAB are rejected by the grammar inside multi-line string literals.
  • CR inside the multiline string literal is also rejected for the same reason
  • However CR directly before NL is interpreted as only a newline and not part of the multiline string. zig fmt will delete the CR.

For string literals that want to include CR, TAB, or any other control sequences, they will need to use regular string literals and the ++ operator, or @embedFile.

Whitespace

  • TAB used as whitespace is unambiguous. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.
  • CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.

astronaute-meme

@vrischmann
Copy link

vrischmann commented Jul 15, 2021

This

zig fmt may not mangle multi line string literals, and therefore the control characters TAB and CR are rejected by the grammar inside multi-line string literals.

causes a problem when using git on windows, unless I'm missing something.

If you clone this gist (which contains no CR) on Windows with the default git configuration, it produces this:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> xxd .\main.zig
00000000: 666e 2063 7265 6174 6554 6573 7454 6162  fn createTestTab
00000010: 6c65 7328 2920 2176 6f69 6420 7b0d 0a20  les() !void {..
00000020: 2020 2063 6f6e 7374 2062 6172 203d 2026     const bar = &
00000030: 5b5f 5d5b 5d63 6f6e 7374 2075 387b 0d0a  [_][]const u8{..
00000040: 2020 2020 2020 2020 5c5c 666f 6f62 6172          \\foobar
00000050: 0d0a 2020 2020 7d3b 0d0a 2020 2020 5f20  ..    };..    _
00000060: 3d20 6261 723b 0d0a 7d                   = bar;..}

which does not compile:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> zig build-exe .\main.zig
.\main.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: '\n'
        \\foobar
                 ^

it is the intended behaviour but IMO it's surprising. Currently as far as I can see you need to either configure git with core.autocrlf=false or use a gitattributes files.

@thejoshwolfe
Copy link
Contributor

It still surprises me that Git's default configuration is to give you the wrong bytes on windows. Can you clarify which client you're using? is it provided by GitHub? by git-scm.com? by a cygwin package manager (probably not this one)?

As much as I'd like to say that Git is wrong (which i kinda did in the last paragraph), the conflict is arising from the concept of a "text file", which many people think is an array of lines rather than a single byte array. Sometimes discussions of character encoding get tangled up in the concept of a "text file" too.

But Zig is not interested in all the complexity of "text files", and specifies always UTF-8, always spaces over tabs, always LF line endings. There's some room for accepting and canonicalizing (in zig fmt) unambiguous alternatives for everyone's convenience, but CRLF line endings are officially the wrong way to end lines in Zig (along with NEL, LS, and PS). While it bothers me emotionally that some Git cient on Windows is wasting our time here talking about CRLF line endings, it seems important to get this right to avoid even more wasted time for everyone in the future.

With that preamble ramble out of the way: i think we should definitely canonicalize CRLF line endings in multiline string literals to LF line endings.

@vrischmann
Copy link

I'm using the installer from git-scm.

@Vexu
Copy link
Member

Vexu commented Jul 15, 2021

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

If mangling the files is an issue, the feature can be put behind a --convert-whitespace flag.

@nektro
Copy link

nektro commented Apr 28, 2022

could even make zig build call zig fmt on the whole directory before running, or otherwise tell them to run it on the whole folder since its likely not the one file thats mangled

@andrewrk
Copy link
Member Author

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

Yes, I'm thinking along the same lines. Make it a hard error, just like unused locals, and then have --autofix (as well as zig fmt) make the compilation "just work" while modifying the source files to comply.

@rockorager
Copy link

rockorager commented Aug 3, 2024

For string literals that want to include [...] any other control sequences, they will need to use regular string literals and the ++ operator, or @embedfile.

This is specific to multi-line string literals, right? I can still have const seq = "\x1b[31m";? Or potentially any other string literal that contains 0x00...0x1F, except for CR, NL, and Tab?

@castholm
Copy link

castholm commented Aug 5, 2024

This is specific to multi-line string literals, right? I can still have const seq = "\x1b[31m";? Or potentially any other string literal that contains 0x00...0x1F, except for CR, NL, and Tab?

Yes, this clarification is specifically for multi-line string literals. Regular string literals can of course still use escape sequences to encode characters/byte sequences like control characters or invalid UTF-8 sequences that are illegal in Zig source code. But multi-line string literals don't support escape sequences and thus have no way to encode HT, LF, CR or other illegal characters, so for those rare cases you are need to concatenate the multi-line string literal with a regular string literal containing the necessary escape sequences, or put the text data in a separate file and import it using @embedfile.

@PanSashko
Copy link

The restriction to use tabs in comments and multi line strings is frustrating.

What is "rendering" and why does it matter?

It is a really annoying change for any project where there are comments or multi line strings with tabs. But what are real benefits for others who do not care?

@thejoshwolfe
Copy link
Contributor

@PanSashko is your project where you want tabs in comments and strings something you can link to? It might help better understand the pros and cons to see a real example.

@PanSashko
Copy link

PanSashko commented Sep 8, 2024

@thejoshwolfe unfortunately I cannot publish it yet. But I can describe 2 examples.

Imagine a pet language that is indent-based like Python. Let's say it supports tabs or spaces for indentation. Then, e.g., parser tests should cover both cases. Some small tests are using multi line strings for embedding source code under test.
An obvious workaround for that case is @embedFile, but why?
With the restriction not all UTF-8 strings can be represented in Zig's multi line strings.

Other example. Imagine a Zig project and 2 developers that work on it. One of them prefers 4-space indents, other one 2-space indents. The easiest option is to use tabs, and that's allowed by Zig, according to this issue. But the comments restriction disables an ability to temporarily comment out a piece of code, as it does not compile.
So the convenient code commenting is broken for any Zig project that uses tabs. As I understand, that is not justified, because regular comments are ignored by compiler anyway.

I think before this change Zig coders from both sides of the "spaces vs tabs war" were happy. Also everyone who did not care was happy. This why I really wonder what is the advantage of this change?

@castholm
Copy link

castholm commented Sep 8, 2024

One problem with rejecting tabs in comments is that you can't easily comment out code. If you are forced to write Zig with tabs for whatever reason, you will need to find a way to make your editor replace the commented out tabs with spaces (and then bring them back when uncommenting). Just selecting lines and prepending them with // won't work if they contain indentation.

@thejoshwolfe
Copy link
Contributor

i certainly agree that it's confusing to get an error when commenting out working code. the original design of zig was to consistently disallow tabs (and all this is true about carriage returns as well), but in order for zig fmt to fix tabs, it had to support parsing them without error. once the default implementation of zig switched to the self hosted one, the main compiler got this tolerance as well.

however, there is still an intention to disallow hard tabs in zig source, whether in a comment, a string literal, or whitespace around tokens. see andrewrk's comment above. this would avoid the confusion when commenting out code, because the error would be much sooner.

What is "rendering" and why does it matter?

the relevant aspect of rendering in this context is the vertical alignment of grapheme clusters across multiple lines when using a monospace font. different editor implementations and configurations are also relevant. i have a hard time explaining why it matters in general, but certainly there are plenty of specific cases where the text is intended to align in a particular way independent of editor configuration.

Imagine a pet language that is indent-based like Python. Let's say it supports tabs or spaces for indentation. Then, e.g., parser tests should cover both cases. Some small tests are using multi line strings for embedding source code under test.

I don't have an easy way to search the zig codebase at the moment, but I'm pretty sure zig's own tests do this. the workaround is to use ++ "\t" ++ style string concatenation to get the invalid characters into the test case, or at least that's an option i would reach for. pretty awkward, but it seems appropriate for how unusual the use case is.

Imagine a Zig project and 2 developers that work on it. One of them prefers 4-space indents, other one 2-space indents. The easiest option is to use tabs,

The intent is that you use zig fmt, and then your preferences yield to the community standard. allowing different preferences for indentation is an explicit antifeature of zig fmt.

I think before this change Zig coders from both sides of the "spaces vs tabs war" were happy

ideally, we get to a point where is no spaces vs tabs war among zig authors, because there will be no tabs in zig. that's what i hope for at least.

If you are forced to write Zig with tabs for whatever reason

@castholm are you in that situation? how did that happen?

@castholm
Copy link

castholm commented Sep 8, 2024

are you in that situation? how did that happen?

I'm not (but I recall from earlier discussion from back when tabs were hard banned that some people suggested they need them for accessibility) and fwiw I'm strictly on the spaces side. I was mainly just trying to make a point that allowing them in source but not comments is the worst of both worlds.
If the end goal is to begrudgingly support tabs but as a second-class citizen (i.e. still have the canonical zig fmt format favor spaces), they should also be supported in regular comments.
If the end goal is to make tabs a compile error, they shouldn't be supported anywhere. But as you mention, since zig fmt is currently the mechanism that normalizes source encoding, for zig fmt to even be able to parse and fix up the code they would need to be allowed.

@PanSashko
Copy link

pretty awkward, but it seems appropriate for how unusual the use case is.

I do not see any point in awkwardness, when it is so easy to fix. But in this case it was working, but now it is intentionally broken.

On other hand, "unambiguous source formatting" looks to me like a fictional benefit. I would be happy to hear at least one practical example when it is required or useful, because I cannot find any.

Recommending zig fmt and "community standards" for everyone is OK. But in this case it causes annoyance for some with no benefits for others.

ianprime0509 added a commit to ianprime0509/zig that referenced this issue Sep 10, 2024
Closes ziglang#21358
Closes ziglang#21360

This commit modifies the `multiline_string_literal_line`, `doc_comment`,
and `container_doc_comment` tokens to no longer include the line ending
as part of the token. This makes it easier to handle line endings (which
may be LF, CRLF, or in edge cases possibly nonexistent) consistently.

In the two issues linked above, Autodoc was already assuming this for
doc comments, and yielding incorrect results when handling files with
CRLF line endings (both in Markdown parsing and source rendering).

Applying the same simplification for multiline string literals also
brings `zig fmt` into conformance with
ziglang/zig-spec#38 regarding formatting of
multiline strings with CRLF line endings: the spec says that `zig fmt`
should remove the CR from such line endings, but this was not previously
the case.
Vexu pushed a commit to ziglang/zig that referenced this issue Sep 10, 2024
Closes #21358
Closes #21360

This commit modifies the `multiline_string_literal_line`, `doc_comment`,
and `container_doc_comment` tokens to no longer include the line ending
as part of the token. This makes it easier to handle line endings (which
may be LF, CRLF, or in edge cases possibly nonexistent) consistently.

In the two issues linked above, Autodoc was already assuming this for
doc comments, and yielding incorrect results when handling files with
CRLF line endings (both in Markdown parsing and source rendering).

Applying the same simplification for multiline string literals also
brings `zig fmt` into conformance with
ziglang/zig-spec#38 regarding formatting of
multiline strings with CRLF line endings: the spec says that `zig fmt`
should remove the CR from such line endings, but this was not previously
the case.
DivergentClouds pushed a commit to DivergentClouds/zig that referenced this issue Sep 24, 2024
Closes ziglang#21358
Closes ziglang#21360

This commit modifies the `multiline_string_literal_line`, `doc_comment`,
and `container_doc_comment` tokens to no longer include the line ending
as part of the token. This makes it easier to handle line endings (which
may be LF, CRLF, or in edge cases possibly nonexistent) consistently.

In the two issues linked above, Autodoc was already assuming this for
doc comments, and yielding incorrect results when handling files with
CRLF line endings (both in Markdown parsing and source rendering).

Applying the same simplification for multiline string literals also
brings `zig fmt` into conformance with
ziglang/zig-spec#38 regarding formatting of
multiline strings with CRLF line endings: the spec says that `zig fmt`
should remove the CR from such line endings, but this was not previously
the case.
@thejoshwolfe
Copy link
Contributor

thejoshwolfe commented Sep 28, 2024

On other hand, "unambiguous source formatting" looks to me like a fictional benefit. I would be happy to hear at least one practical example when it is required or useful, because I cannot find any.

Back in Java 1.5, before openjdk, the source for java.lang.String and other classes was formatted using a mix of tabs and spaces, where the intent was that a tab stood in for 8 spaces (to save disk space or something probably). I can't find the source code for jdk pre openjdk 7 (without creating an oracle account?), but i can recreate it as an example. Starting with this modern excerpt: https://github.com/openjdk/jdk/blob/73ebb848fdb66861e912ea747c039ddd1f7a5f48/src/java.base/share/classes/java/lang/String.java#L552-L581

then replacing 8 spaces with tabs to recreate the style of jdk 1.5, it looks like this in github web/mobile app/whatever you're using, which for me renders the intended way:

    @SuppressWarnings("removal")
    private String(Charset charset, byte[] bytes, int offset, int length) {
	if (length == 0) {
	    this.value = "".value;
	    this.coder = "".coder;
	} else if (charset == UTF_8.INSTANCE) {
	    if (COMPACT_STRINGS) {
		int dp = StringCoding.countPositives(bytes, offset, length);
		if (dp == length) {
		    this.value = Arrays.copyOfRange(bytes, offset, offset + length);
		    this.coder = LATIN1;
		    return;
		}
		// Decode with a stable copy, to be the result if the decoded length is the same
		byte[] latin1 = Arrays.copyOfRange(bytes, offset, offset + length);
		int sp = dp;            // first dp bytes are already in the copy
		while (sp < length) {
		    int b1 = latin1[sp++];
		    if (b1 >= 0) {
			latin1[dp++] = (byte)b1;
			continue;
		    }
		    if ((b1 & 0xfe) == 0xc2 && sp < length) { // b1 either 0xc2 or 0xc3
			int b2 = latin1[sp];
			if (b2 < -64) { // continuation bytes are always negative values in the range -128 to -65
			    latin1[dp++] = (byte)decode2(b1, b2);
			    sp++;
			    continue;
			}
		    }

And if you set tabs to appear as 2 spaces, a configuration i've seen commonly in web development contexts, it looks like this:

image

I would call that simply incorrect behavior. For the indentation to appear to go backwards when it should go forwards is a failure between the source code author and the editor configuration. The original authors (according to my story, which i've simulated in this discussion) were vigilant about formatting the indentation consistently; it wasn't sloppiness; it was a strategy that was prone to ambiguous rendering.

In countless modern scenarios, mostly in closed-source code authored by my colleagues at work, I've encountered a few stray tabs here and there where someone didn't notice that their editor was defaulting to inserting tabs where there were already spaces or visa versa. This typically happens in code bases without auto formatters, linters, or other modern comforts, such as in shell scripts for devops use cases. It looked correct with their editor configuration, which is often 4-space tab width, but then in the github PR review web interface, tabs are typically rendered 8-spaces wide. This causes incorrect indentation rendering, which is a failure between the author and the renderer configuration.

These types of failure are the main casualties in the tabs vs spaces war. There are many solutions, of course, and an auto formatter is involved in many of them. Zig has an autoformatter, and it's generally recommended to use it in order to avoid ambiguous and incorrect appearing indentation like these examples.

@PanSashko
Copy link

PanSashko commented Sep 30, 2024

@thejoshwolfe you described a "problem" that Zig still has too. It allows mixed indentation, so Zig sources with mixed indentation will look like your screenshot for some tab widths.

On other hand, zig fmt solves all of that for anyone who cares.
I don't see the point of breaking developer experience for those who use tabs and do not use zig fmt.

@IntegratedQuantum
Copy link

I just came across this because I wanted to comment out some code to take care of after testing the other changes that I made.
And I have to say, getting an error for commented code is a pretty bad user experience.

Sure there is a workaround for this: Only comment the code without the tabs in front of it.
This is quite easy to do with a multi-cursor capable editor, but it is non-obvious.

Do you really want to return to the old (stage1) days, where we (tab users) had to write an extra pre-processing step in the build.zig just to be able to conveniently use tabs in the source code?

richerfu pushed a commit to richerfu/zig that referenced this issue Oct 28, 2024
Closes ziglang#21358
Closes ziglang#21360

This commit modifies the `multiline_string_literal_line`, `doc_comment`,
and `container_doc_comment` tokens to no longer include the line ending
as part of the token. This makes it easier to handle line endings (which
may be LF, CRLF, or in edge cases possibly nonexistent) consistently.

In the two issues linked above, Autodoc was already assuming this for
doc comments, and yielding incorrect results when handling files with
CRLF line endings (both in Markdown parsing and source rendering).

Applying the same simplification for multiline string literals also
brings `zig fmt` into conformance with
ziglang/zig-spec#38 regarding formatting of
multiline strings with CRLF line endings: the spec says that `zig fmt`
should remove the CR from such line endings, but this was not previously
the case.
@jibal
Copy link

jibal commented Nov 9, 2024

"CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace."

But it's not unambiguously rendered ... there's no telling how it will be displayed, and on some displays it will hide code. If you're concerned about rendering, then you should ban CRs that don't immediately precede NL ... unlike tabs, there's no reason at all for anyone to want a stray CR. The reference manual gets it right:

Each LF may be immediately preceded by a single CR (byte value 0x0d, code point U+000d, '\r') to form a Windows style line ending, but this is discouraged. Note that in multiline strings, CRLF sequences will be encoded as LF when compiled into a zig program. A CR in any other context is not allowed.

BTW, the reference manual has the wrong (example) regex for line ends (how has no one noticed? Or is feedback about such things broken? I like that the D documentation has a link on every page to file an error report): /\r\n?|[\n\u0085\u2028\u2029]/

@jibal
Copy link

jibal commented Nov 9, 2024

Inside Line Comments [...]
Any TAB is rejected by the grammar since it is ambiguous how it should be rendered.

But who cares? I suggest that inside of non-doc comments anything at all should be allowed, including ctrl chars and chars with the high bit set (so arbitrary unicode). Since non-doc comments are completely elided, it's none of the compiler's (or language designers') damn business what is in them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants