Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser does not preserve whitespaces when parsing nested code blocks. #177

Closed
bwplotka opened this issue Dec 27, 2020 · 22 comments
Closed

Comments

@bwplotka
Copy link

Thank you for the amazing project! 🤗

I think I found a small bug, which is a bit annoying in our markdown formatting project. Particular problem is showcased in this draft PR

  1. What version of goldmark are you using? Checked v1.1.24 and latest 6c741ae251abd461bb7b5ce28e7df7a9306bd005
  2. What version of Go are you using? go version go1.15 linux/amd64
  3. What operating system and processor architecture are you using? go version go1.15 linux/amd64
  4. What did you do?

Parsed nested code block (valid markdown):

* Some items with nested code with strict whitespaces.
  ```Makefile
  include .bingo/Variables.mk
  
  run:
  	@$(GOIMPORTS) <args>
  ```

(Note strict whitespace in above md, especially line <space><space>\t@$(GOIMPORTS) <args>)

  1. What did you expect to see?

goldmark renderer.Renderer.Render(...) method's n ast.Node has correct structure. However lines in ast.FencedCodeBlock has wrong whitespace (somehow codeblock being fenced affects things).

Particularly: Lines in the node should have exactly the same bytes, so for example line 3 should be <space><space>\t@$(GOIMPORTS) <args>

See repro test below.

  1. What did you see instead?

Line 1 and 3 has some semi-random spaces instead what provided in parsed markdown.

Particularly line 3 has <space><space>@$(GOIMPORTS) <args>. See repro test below.

  1. Did you confirm your output is different from CommonMark online demo or other official renderer correspond with an extension?:
    YES
    image

Repro go test:

package markdown

import (
	"bytes"
	"fmt"
	"io"
	"testing"

	"github.com/yuin/goldmark"
	"github.com/yuin/goldmark/ast"
	"github.com/yuin/goldmark/renderer"
)

type testRenderer struct {
	t *testing.T
}

func (t testRenderer) AddOptions(...renderer.Option) { return }
func (t testRenderer) Render(_ io.Writer, source []byte, n ast.Node) error {
	fencedCodeBlock := n.FirstChild().FirstChild().FirstChild().NextSibling().(*ast.FencedCodeBlock)

	line := fencedCodeBlock.Lines().At(0)
	if val := line.Value(source); !bytes.Equal([]byte("include .bingo/Variables.mk\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val))
	}
	line = fencedCodeBlock.Lines().At(1)
	if val := line.Value(source); !bytes.Equal([]byte("\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val)) // BUG1: bug_test.go:28: not what we expected, got "  \n"
	}
	line = fencedCodeBlock.Lines().At(2)
	if val := line.Value(source); !bytes.Equal([]byte("run:\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val))
	}
	line = fencedCodeBlock.Lines().At(3)
	if val := line.Value(source); !bytes.Equal([]byte("\t@$(GOIMPORTS) <args>\n"), val) {
		t.t.Errorf("not what we expected, got %q", string(val)) // BUG 2: bug_test.go:36: not what we expected, got "  @$(GOIMPORTS) <args>\n"	}
	}
	return nil
}

func TestGoldmarkCodeBlockWhitespaces(t *testing.T) {
	var codeBlock = "```"
	mdContent := []byte(fmt.Sprintf(`* Some item with nested code with strict whitespaces.
  %sMakefile
  include .bingo/Variables.mk
  
  run:
  	@$(GOIMPORTS) <args>
  %s`, codeBlock, codeBlock))

	var buf bytes.Buffer
	if err := goldmark.New(goldmark.WithRenderer(&testRenderer{t: t})).Convert(mdContent, &buf); err != nil {
		t.Fatal(err)
	}
}

I am pretty sure it's somehow easy to fix (:

@stale
Copy link

stale bot commented Jan 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 26, 2021
@bwplotka
Copy link
Author

bwplotka commented Jan 26, 2021 via email

@stale stale bot removed the stale label Jan 26, 2021
@karelbilek
Copy link
Contributor

It will actually necessitate huge refactor :(

Basically, it treats tabs as spaces, at one point in program it counts the number of spaces, takes tabs as spaces, and then on other point it puts the spaces back, incorrectly. If I get the flow right.

The way I look at the code, when it reads the text in text/reader.go, it first detects the padding (by util/util.go IndentPosition), and then it puts the padding "back" in text/segment.go Value, where it adds Padding number of spaces to left.

This breaks however in code blocks, where there is a difference between tabs and spaces.

maybe I will be able to convince IndentPosition to ignore the \t paddings? but then it will randomly break when someone uses tabs for indenting the items...

@karelbilek
Copy link
Contributor

however I am a bit confused why that doesn't happen without the item list. IndentPosition never gets run in that case. 🤔

@karelbilek
Copy link
Contributor

oh, the util... gets called just in listItemParser. OK

@karelbilek
Copy link
Contributor

karelbilek commented Jan 28, 2021

Yeah I got why 2 spaces instead of tab now

Goldmark treats tabs like 4 spaces, but normally in text (not in code block) <space><space><tab> is still treated "like four spaces", because the tab "finishes" the spaces

because normally if you have

<space><space><space><space>something
<space><space><tab>something

those are at same width.

so anyway when you have <space><space><tab> in the goimports stuff, the list item parser sees "hey let's treat this as 4 spaces, but 2 spaces are for me as the list, so the code parser gets the remaining two", and it gives the code parser "<space><space>" instead of tab.

Ugh.

No idea how to even approach this

@karelbilek
Copy link
Contributor

karelbilek commented Jan 28, 2021

I guess there can be a way for the code block parser to "cheat", and to look if the padding spaces are actually not spaces and they are tabs? We still have the buffer, we can look at the "padding positions" if they aren't actually tabs?

But still, that would mean creating a new method of Segment, something like func (t *Segment) CodeBlockValue(buffer []byte) []byte {, which can break someone that extends goldmark that implements its own Reader uh I am stupid no

@karelbilek
Copy link
Contributor

karelbilek commented Jan 28, 2021

And btw, what would happen if following happens:

  • for some reason, you have code block that begins at six spaces
  • then, you have line that is <tab><space><tab>foo
*.list
...*..sublist
......```Makefile
[tb].[t]text
......```

what even should be in the code block? .... I will look at commonmark, how it handles this tab/space mess

@bwplotka
Copy link
Author

Thanks for debugging!

So: There is a way to hack markdownfmt to.. hide this? 🤔

@karelbilek
Copy link
Contributor

I don't even know what should the correct behavior be :D

@bwplotka
Copy link
Author

Oh that's easy: markdownfmt has to deterministic. It's kind of stupid If I reformat md with markdownfmt it will produce X and then if reformat again it produces Y and then again, it's X 🤦🏽

@karelbilek
Copy link
Contributor

So: There is a way to hack markdownfmt to.. hide this? 🤔

I think it will require change in goldmark, possibly not backwards compatible (requiring new version), I am not sure :D

@karelbilek
Copy link
Contributor

karelbilek commented Jan 28, 2021

Although I am not sure why it is not deterministic in markdownfmt, but I guess that's a different story xD

(sorry @yuin for all the spam)

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Note that this is a breaking change and will require new goldmark major version.

I have tried to fix problem with leading tabs in fenced code blocks (and probably normal code blocks too).

Important note - tabs do not behave like "just 4 spaces". They "finish" 4 space columns. So tab can behave like anything between 1 space to 4 spaces, depending on position.

If you have MD like this (. represents space, [tb] , [t] or [] tabs)

```
*.some.text
..```
..foo
..[]foo
..```
```

you expect the tab to be kept in the code. This did not work properly in goldmark and I fixed that.

However, if you have a code like this

```
*.some.text
..```
..foo
.[t]foo
..```
```

what should happen? I decided that it should be two spaces, as the tab is not "completely" in the code block. Similarly, what should happen in this case

```
*.some.text
..```
..foo
.[t][tb]foo
..```
```

I decided that it should be first three spaces and then tab. Not sure what even is the correct solution here...

The crux of the fix is - text segments don't have just padding, but also remember what chars is the padding and then print that, if they are called to do so in the code blocks. In other cases, the paddingChars are ignored.

This should fix yuin#177 .
karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Note that this is a breaking change and will require new goldmark major version.

I have tried to fix problem with leading tabs in fenced code blocks (and probably normal code blocks too).

Important note - tabs do not behave like "just 4 spaces". They "finish" 4 space columns. So tab can behave like anything between 1 space to 4 spaces, depending on position.

If you have MD like this (. represents space, [tb] , [t] or [] tabs)

```
*.some.text
..```
..foo
..[]foo
..```
```

you expect the tab to be kept in the code. This did not work properly in goldmark and I fixed that.

However, if you have a code like this

```
*.some.text
..```
..foo
.[t]foo
..```
```

what should happen? I decided that it should be two spaces, as the tab is not "completely" in the code block. Similarly, what should happen in this case

```
*.some.text
..```
..foo
.[t][tb]foo
..```
```

I decided that it should be first three spaces and then tab. Not sure what even is the correct solution here...

The crux of the fix is - text segments don't have just padding, but also remember what chars is the padding and then print that, if they are called to do so in the code blocks. In other cases, the paddingChars are ignored.

This should fix yuin#177 .
@yuin
Copy link
Owner

yuin commented Jan 29, 2021

I'm afraid to say, I'm up to my neck in work every day. So I can not have time for this project. I promise to see this issue in the future.

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Note that this is a breaking change and will require new goldmark major version.

I have tried to fix problem with leading tabs in fenced code blocks (and probably normal code blocks too).

Important note - tabs do not behave like "just 4 spaces". They "finish" 4 space columns. So tab can behave like anything between 1 space to 4 spaces, depending on position.

If you have MD like this (. represents space, [tb] , [t] or [] tabs)

```
*.some.text
..```
..foo
..[]foo
..```
```

you expect the tab to be kept in the code. This did not work properly in goldmark and I fixed that.

However, if you have a code like this

```
*.some.text
..```
..foo
.[t]foo
..```
```

what should happen? I decided that it should be two spaces, as the tab is not "completely" in the code block. Similarly, what should happen in this case

```
*.some.text
..```
..foo
.[t][tb]foo
..```
```

I decided that it should be first three spaces and then tab. Not sure what even is the correct solution here...

The crux of the fix is - text segments don't have just padding, but also remember what chars is the padding and then print that, if they are called to do so in the code blocks. In other cases, the paddingChars are ignored.

This should fix yuin#177 .
karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Note that this is a breaking change and will require new goldmark major version.

I have tried to fix problem with leading tabs in fenced code blocks (and probably normal code blocks too).

Important note - tabs do not behave like "just 4 spaces". They "finish" 4 space columns. So tab can behave like anything between 1 space to 4 spaces, depending on position.

If you have MD like this (. represents space, [tb] , [t] or [] tabs)

```
*.some.text
..```
..foo
..[]foo
..```
```

you expect the tab to be kept in the code. This did not work properly in goldmark and I fixed that.

However, if you have a code like this

```
*.some.text
..```
..foo
.[t]foo
..```
```

what should happen? I decided that it should be two spaces, as the tab is not "completely" in the code block. Similarly, what should happen in this case

```
*.some.text
..```
..foo
.[t][tb]foo
..```
```

I decided that it should be first three spaces and then tab. Not sure what even is the correct solution here...

The crux of the fix is - text segments don't have just padding, but also remember what chars is the padding and then print that, if they are called to do so in the code blocks. In other cases, the paddingChars are ignored.

This should fix yuin#177 .
@karelbilek
Copy link
Contributor

karelbilek commented Jan 29, 2021

I fixed it in #187

However it is a breaking change because I needed to change Reader interface. So strictly speaking we should increase major version and also import path to /v2

edit: hm, seems I fixed only one of the two bugs.

edit2: opened a second PR for the second bug

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Fix for independent issue in yuin#177

Not sure why is this needed, but it works :)
karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Fix for independent issue in yuin#177

Not sure why is this needed, but it works :)
karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Jan 29, 2021
Fix for independent issue in yuin#177

The next parser should have the whitespace removed, even when it's blank
@yuin yuin closed this as completed in 2ffadce Feb 7, 2021
@yuin
Copy link
Owner

yuin commented Feb 7, 2021

@karelbilek @bwplotka
I've tried to fix this issue in 2ffadce with the way simpler than #187.

Could you confirm this issue is fixed?

@karelbilek
Copy link
Contributor

I think this is still needed?

(see the testcases there)

#188

there were actually two independent issues. One is the code starting with tab which you fixed (thx!), other is the empty line which is fixed by #188 (which is simple)

@yuin
Copy link
Owner

yuin commented Feb 7, 2021

I've tested test cases in #188 with current head(2ffadce). It passes all test cases.

@karelbilek
Copy link
Contributor

karelbilek commented Feb 7, 2021 via email

@yuin
Copy link
Owner

yuin commented Feb 7, 2021

@karelbilek, Thanks for your contribution!

@bwplotka
Copy link
Author

bwplotka commented Feb 7, 2021

Amazing!

karelbilek pushed a commit to karelbilek/goldmark that referenced this issue Feb 8, 2021
Fix for independent issue in yuin#177

The next parser should have the whitespace removed, even when it's blank
@karelbilek
Copy link
Contributor

@yuin that branch is still needed.

The tests were wrong, sorry. (It's hard to check as there is whitespace....)

Now I fixed the tests so they are actually testing the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants