Parser does not preserve whitespaces when parsing nested code blocks. #177

bwplotka · 2020-12-27T11:14:49Z

Thank you for the amazing project! 🤗

I think I found a small bug, which is a bit annoying in our markdown formatting project. Particular problem is showcased in this draft PR

What version of goldmark are you using? Checked v1.1.24 and latest 6c741ae251abd461bb7b5ce28e7df7a9306bd005
What version of Go are you using? go version go1.15 linux/amd64
What operating system and processor architecture are you using? go version go1.15 linux/amd64
What did you do?

Parsed nested code block (valid markdown):

* Some items with nested code with strict whitespaces.
  ```Makefile
  include .bingo/Variables.mk
  
  run:
  	@$(GOIMPORTS) <args>
  ```

(Note strict whitespace in above md, especially line <space><space>\t@$(GOIMPORTS) <args>)

What did you expect to see?

goldmark renderer.Renderer.Render(...) method's n ast.Node has correct structure. However lines in ast.FencedCodeBlock has wrong whitespace (somehow codeblock being fenced affects things).

Particularly: Lines in the node should have exactly the same bytes, so for example line 3 should be <space><space>\t@$(GOIMPORTS) <args>

See repro test below.

What did you see instead?

Line 1 and 3 has some semi-random spaces instead what provided in parsed markdown.

Particularly line 3 has <space><space>@$(GOIMPORTS) <args>. See repro test below.

Did you confirm your output is different from CommonMark online demo or other official renderer correspond with an extension?:
YES

Repro go test:

package markdown

import (
	"bytes"
	"fmt"
	"io"
	"testing"

	"github.com/yuin/goldmark"
	"github.com/yuin/goldmark/ast"
	"github.com/yuin/goldmark/renderer"
)

type testRenderer struct {
	t *testing.T
}

func (t testRenderer) AddOptions(...renderer.Option) { return }
func (t testRenderer) Render(_ io.Writer, source []byte, n ast.Node) error {
	fencedCodeBlock := n.FirstChild().FirstChild().FirstChild().NextSibling().(*ast.FencedCodeBlock)

	line := fencedCodeBlock.Lines().At(0)
	if val := line.Value(source); !bytes.Equal([]byte("include .bingo/Variables.mk\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val))
	}
	line = fencedCodeBlock.Lines().At(1)
	if val := line.Value(source); !bytes.Equal([]byte("\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val)) // BUG1: bug_test.go:28: not what we expected, got "  \n"
	}
	line = fencedCodeBlock.Lines().At(2)
	if val := line.Value(source); !bytes.Equal([]byte("run:\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val))
	}
	line = fencedCodeBlock.Lines().At(3)
	if val := line.Value(source); !bytes.Equal([]byte("\t@$(GOIMPORTS) <args>\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val)) // BUG 2: bug_test.go:36: not what we expected, got "  @$(GOIMPORTS) <args>\n"	}
	}
	return nil
}

func TestGoldmarkCodeBlockWhitespaces(t *testing.T) {
	var codeBlock = "```"
	mdContent := []byte(fmt.Sprintf(`* Some item with nested code with strict whitespaces.
  %sMakefile
  include .bingo/Variables.mk
  
  run:
  	@$(GOIMPORTS) <args>
  %s`, codeBlock, codeBlock))

	var buf bytes.Buffer
	if err := goldmark.New(goldmark.WithRenderer(&testRenderer{t: t})).Convert(mdContent, &buf); err != nil {
		t.Fatal(err)
	}
}

I am pretty sure it's somehow easy to fix (:

The text was updated successfully, but these errors were encountered:

stale · 2021-01-26T17:37:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bwplotka · 2021-01-26T18:38:54Z

Still valid.

…

On Tue, 26 Jan 2021 at 17:37, stale[bot] ***@***.***> wrote: This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#177 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3OYMW7DNMKI3G6HYAQ3S334XFANCNFSM4VKXATUQ> .

karelbilek · 2021-01-28T14:58:23Z

It will actually necessitate huge refactor :(

Basically, it treats tabs as spaces, at one point in program it counts the number of spaces, takes tabs as spaces, and then on other point it puts the spaces back, incorrectly. If I get the flow right.

The way I look at the code, when it reads the text in text/reader.go, it first detects the padding (by util/util.go IndentPosition), and then it puts the padding "back" in text/segment.go Value, where it adds Padding number of spaces to left.

This breaks however in code blocks, where there is a difference between tabs and spaces.

maybe I will be able to convince IndentPosition to ignore the \t paddings? but then it will randomly break when someone uses tabs for indenting the items...

karelbilek · 2021-01-28T15:04:45Z

however I am a bit confused why that doesn't happen without the item list. IndentPosition never gets run in that case. 🤔

karelbilek · 2021-01-28T15:11:54Z

oh, the util... gets called just in listItemParser. OK

karelbilek · 2021-01-28T15:31:33Z

Yeah I got why 2 spaces instead of tab now

Goldmark treats tabs like 4 spaces, but normally in text (not in code block) <space><space><tab> is still treated "like four spaces", because the tab "finishes" the spaces

because normally if you have

<space><space><space><space>something
<space><space><tab>something

those are at same width.

so anyway when you have <space><space><tab> in the goimports stuff, the list item parser sees "hey let's treat this as 4 spaces, but 2 spaces are for me as the list, so the code parser gets the remaining two", and it gives the code parser "<space><space>" instead of tab.

Ugh.

No idea how to even approach this

karelbilek · 2021-01-28T15:44:34Z

I guess there can be a way for the code block parser to "cheat", and to look if the padding spaces are actually not spaces and they are tabs? We still have the buffer, we can look at the "padding positions" if they aren't actually tabs?

But still, that would mean creating a new method of Segment, something like func (t *Segment) CodeBlockValue(buffer []byte) []byte {, which can break someone that extends goldmark that implements its own Reader uh I am stupid no

karelbilek · 2021-01-28T15:59:47Z

And btw, what would happen if following happens:

for some reason, you have code block that begins at six spaces
then, you have line that is <tab><space><tab>foo

*.list
...*..sublist
......```Makefile
[tb].[t]text
......```

what even should be in the code block? .... I will look at commonmark, how it handles this tab/space mess

bwplotka · 2021-01-28T16:00:13Z

Thanks for debugging!

So: There is a way to hack markdownfmt to.. hide this? 🤔

karelbilek · 2021-01-28T16:01:20Z

I don't even know what should the correct behavior be :D

bwplotka · 2021-01-28T16:04:08Z

Oh that's easy: markdownfmt has to deterministic. It's kind of stupid If I reformat md with markdownfmt it will produce X and then if reformat again it produces Y and then again, it's X 🤦🏽

karelbilek · 2021-01-28T16:05:33Z

So: There is a way to hack markdownfmt to.. hide this? 🤔

I think it will require change in goldmark, possibly not backwards compatible (requiring new version), I am not sure :D

karelbilek · 2021-01-28T16:07:17Z

Although I am not sure why it is not deterministic in markdownfmt, but I guess that's a different story xD

(sorry @yuin for all the spam)

Note that this is a breaking change and will require new goldmark major version. I have tried to fix problem with leading tabs in fenced code blocks (and probably normal code blocks too). Important note - tabs do not behave like "just 4 spaces". They "finish" 4 space columns. So tab can behave like anything between 1 space to 4 spaces, depending on position. If you have MD like this (. represents space, [tb] , [t] or [] tabs) ``` *.some.text ..``` ..foo ..[]foo ..``` ``` you expect the tab to be kept in the code. This did not work properly in goldmark and I fixed that. However, if you have a code like this ``` *.some.text ..``` ..foo .[t]foo ..``` ``` what should happen? I decided that it should be two spaces, as the tab is not "completely" in the code block. Similarly, what should happen in this case ``` *.some.text ..``` ..foo .[t][tb]foo ..``` ``` I decided that it should be first three spaces and then tab. Not sure what even is the correct solution here... The crux of the fix is - text segments don't have just padding, but also remember what chars is the padding and then print that, if they are called to do so in the code blocks. In other cases, the paddingChars are ignored. This should fix yuin#177 .

yuin · 2021-01-29T07:42:57Z

I'm afraid to say, I'm up to my neck in work every day. So I can not have time for this project. I promise to see this issue in the future.

Note that this is a breaking change and will require new goldmark major version. I have tried to fix problem with leading tabs in fenced code blocks (and probably normal code blocks too). Important note - tabs do not behave like "just 4 spaces". They "finish" 4 space columns. So tab can behave like anything between 1 space to 4 spaces, depending on position. If you have MD like this (. represents space, [tb] , [t] or [] tabs) ``` *.some.text ..``` ..foo ..[]foo ..``` ``` you expect the tab to be kept in the code. This did not work properly in goldmark and I fixed that. However, if you have a code like this ``` *.some.text ..``` ..foo .[t]foo ..``` ``` what should happen? I decided that it should be two spaces, as the tab is not "completely" in the code block. Similarly, what should happen in this case ``` *.some.text ..``` ..foo .[t][tb]foo ..``` ``` I decided that it should be first three spaces and then tab. Not sure what even is the correct solution here... The crux of the fix is - text segments don't have just padding, but also remember what chars is the padding and then print that, if they are called to do so in the code blocks. In other cases, the paddingChars are ignored. This should fix yuin#177 .

karelbilek · 2021-01-29T08:02:11Z

I fixed it in #187

However it is a breaking change because I needed to change Reader interface. So strictly speaking we should increase major version and also import path to /v2

edit: hm, seems I fixed only one of the two bugs.

edit2: opened a second PR for the second bug

Fix for independent issue in yuin#177 Not sure why is this needed, but it works :)

Fix for independent issue in yuin#177

Fix for independent issue in yuin#177 The next parser should have the whitespace removed, even when it's blank

yuin · 2021-02-07T10:49:49Z

@karelbilek @bwplotka
I've tried to fix this issue in 2ffadce with the way simpler than #187.

Could you confirm this issue is fixed?

karelbilek · 2021-02-07T10:53:25Z

I think this is still needed?

(see the testcases there)

#188

there were actually two independent issues. One is the code starting with tab which you fixed (thx!), other is the empty line which is fixed by #188 (which is simple)

yuin · 2021-02-07T11:33:03Z

I've tested test cases in #188 with current head(2ffadce). It passes all test cases.

karelbilek · 2021-02-07T11:36:36Z

OK then! thanks. Then we can close. (or maybe add those test cases too but as you wish). Thanks again.

…

On Sun, 7 Feb 2021 at 18:33 Yusuke Inuzuka ***@***.***> wrote: I've tested test cases in #188 <#188> with current head(2ffadce <2ffadce>). It passes all test cases. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#177 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAZT4KVM2TAHKMGJSZDRE3S5Z275ANCNFSM4VKXATUQ> .

yuin · 2021-02-07T11:37:56Z

@karelbilek, Thanks for your contribution!

bwplotka · 2021-02-07T15:12:29Z

Amazing!

Fix for independent issue in yuin#177 The next parser should have the whitespace removed, even when it's blank

karelbilek · 2021-02-08T08:08:18Z

@yuin that branch is still needed.

The tests were wrong, sorry. (It's hard to check as there is whitespace....)

Now I fixed the tests so they are actually testing the bug.

This was referenced Dec 27, 2020

Whitespace bug repro. bwplotka/markdownfmt#1

Draft

Wrong formatting of code blocks (wrong whitespaces) Kunde21/markdownfmt#20

Closed

stale bot added the stale label Jan 26, 2021

stale bot removed the stale label Jan 26, 2021

karelbilek mentioned this issue Jan 29, 2021

Fix leading tabs with codeblocks #187

Closed

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021

Fix empty line detection in markdown in list

428bd9e

Fix for independent issue in yuin#177 Not sure why is this needed, but it works :)

karelbilek mentioned this issue Jan 29, 2021

Fix empty line detection in markdown in list #188

Merged

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021

Fix empty line detection in markdown in list

a122c5d

Fix for independent issue in yuin#177 Not sure why is this needed, but it works :)

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021

Fix empty line detection in markdown in list

5827b8b

Fix for independent issue in yuin#177

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021

Fix empty line detection in markdown in list

965f08a

Fix for independent issue in yuin#177 The next parser should have the whitespace removed, even when it's blank

yuin closed this as completed in 2ffadce Feb 7, 2021

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Feb 8, 2021

Fix empty line detection in markdown in list

c53c1a4

Fix for independent issue in yuin#177 The next parser should have the whitespace removed, even when it's blank

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser does not preserve whitespaces when parsing nested code blocks. #177

Parser does not preserve whitespaces when parsing nested code blocks. #177

bwplotka commented Dec 27, 2020

stale bot commented Jan 26, 2021

bwplotka commented Jan 26, 2021 via email

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021 •

edited

Loading

karelbilek commented Jan 28, 2021 •

edited

Loading

karelbilek commented Jan 28, 2021 •

edited

Loading

bwplotka commented Jan 28, 2021

karelbilek commented Jan 28, 2021

bwplotka commented Jan 28, 2021

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021 •

edited

Loading

yuin commented Jan 29, 2021

karelbilek commented Jan 29, 2021 •

edited

Loading

yuin commented Feb 7, 2021

karelbilek commented Feb 7, 2021

yuin commented Feb 7, 2021

karelbilek commented Feb 7, 2021 via email

yuin commented Feb 7, 2021

bwplotka commented Feb 7, 2021

karelbilek commented Feb 8, 2021

Parser does not preserve whitespaces when parsing nested code blocks. #177

Parser does not preserve whitespaces when parsing nested code blocks. #177

Comments

bwplotka commented Dec 27, 2020

stale bot commented Jan 26, 2021

bwplotka commented Jan 26, 2021 via email

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021 • edited Loading

karelbilek commented Jan 28, 2021 • edited Loading

karelbilek commented Jan 28, 2021 • edited Loading

bwplotka commented Jan 28, 2021

karelbilek commented Jan 28, 2021

bwplotka commented Jan 28, 2021

karelbilek commented Jan 28, 2021

karelbilek commented Jan 28, 2021 • edited Loading

yuin commented Jan 29, 2021

karelbilek commented Jan 29, 2021 • edited Loading

yuin commented Feb 7, 2021

karelbilek commented Feb 7, 2021

yuin commented Feb 7, 2021

karelbilek commented Feb 7, 2021 via email

yuin commented Feb 7, 2021

bwplotka commented Feb 7, 2021

karelbilek commented Feb 8, 2021

karelbilek commented Jan 28, 2021 •

edited

Loading

karelbilek commented Jan 28, 2021 •

edited

Loading

karelbilek commented Jan 28, 2021 •

edited

Loading

karelbilek commented Jan 28, 2021 •

edited

Loading

karelbilek commented Jan 29, 2021 •

edited

Loading