encoding/json: add line number to SyntaxError #43513

AlexanderYastrebov · 2021-01-05T10:59:27Z

What version of Go are you using (`go version`)?

$ go version
go version go1.15.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"os"
)

func main() {
	dec := json.NewDecoder(os.Stdin)
	for {
		var m map[string]interface{}
		if err := dec.Decode(&m); err == io.EOF {
			break
		} else if err != nil {
			fmt.Printf("%#v\n", err)
			break
		}
	}
}

$ echo -e '{"x":\ny}' | go run jsonline.go 
&json.SyntaxError{msg:"invalid character 'y' looking for beginning of value", Offset:7}

At the moment SyntaxError provides an offset of the character that caused parsing error.
The proposal is add the line number as well.

I could not find such an issue therefore opened this one, please close if it was already resolved.

The text was updated successfully, but these errors were encountered:

mvdan · 2021-01-05T11:44:47Z

Could you expand a bit on why the line number would help? Naively, an offset is enough to work out the line number if you were parsing a chunk of bytes, a buffer, or a file. Just count the number of newlines up to the offset.

If you were parsing a stream of data, then I'm not convinced that a line number would actually help anyone. A line number is usually helpful for a human to go look at the input in the right place, but that is not possible if the stream isn't saved somewhere.

Keeping track of extra information like line numbers also has an inherent cost, so I want to make sure the arguments in favor are solid.

AlexanderYastrebov · 2021-01-05T12:09:24Z

I agree, it should be possible to count lines having input at hand and without input it does not makes much sense.
I thought about it as of convenience feature but (after I've attempted to implement it) I am not sure its worth the cost of implementing/maintaining.

The proposal is coming from the usecase where parsing module does not have input at hand but bubbles up the error kubernetes/kubernetes#97651

mvdan · 2021-01-05T12:15:06Z

encoding/json's syntax errors aren't very end-user-friendly, so it's entirely expected that a user-facing piece of software would wrap them or decorate them to better show the user what's going on. Calculating the line number if there is a syntax error should take just a couple of lines of code, so I agree it's not worth adding straight to SyntaxError.

networkimprov · 2021-01-06T03:49:58Z

It would be nice to have a type LineCountReader in io. It's easy to write yourself, but seems like a common utility.

AlexanderYastrebov · 2021-01-06T10:05:23Z

@networkimprov As far as I can tell line counter would not help in this case due to buffering around here

go/src/encoding/json/stream.go

Lines 146 to 169 in 3e1e13c

    
           func (dec *Decoder) refill() error { 
        
           	// Make room to read more into the buffer. 
        
           	// First slide down data already consumed. 
        
           	if dec.scanp > 0 { 
        
           		dec.scanned += int64(dec.scanp) 
        
           		n := copy(dec.buf, dec.buf[dec.scanp:]) 
        
           		dec.buf = dec.buf[:n] 
        
           		dec.scanp = 0 
        
           	} 
        
           	// Grow buffer if not large enough. 
        
           	const minRead = 512 
        
           	if cap(dec.buf)-len(dec.buf) < minRead { 
        
           		newBuf := make([]byte, len(dec.buf), 2*cap(dec.buf)+minRead) 
        
           		copy(newBuf, dec.buf) 
        
           		dec.buf = newBuf 
        
           	} 
        
           	// Read. Delay error for next iteration (after scan). 
        
           	n, err := dec.r.Read(dec.buf[len(dec.buf):cap(dec.buf)]) 
        
           	dec.buf = dec.buf[0 : len(dec.buf)+n] 
        
           	return err 
        
           }

i.e. the parsing error might not necessary happen after last new line read.

mvdan · 2021-01-06T10:14:36Z

Indeed, most parsers/decoders read in big chunks, because doing one read for every single byte consumed would be incredibly expensive. So a io.Reader layer like @networkimprov suggests won't work.

mvdan · 2021-01-06T11:32:10Z

I don't think this needs the attention from the proposal review committee. They already have a backlog of hundreds of proposals as it is. We can make this decision amongst the encoding/json maintainers; CC @dsnet @rsc @bradfitz.

My opinion is still that we should reject this idea. If anyone has thoughts on a helper API to "prettify" json errors for end users, perhaps we could consider that, but one could always write that in a few lines of simple Go code.

networkimprov · 2021-01-06T11:50:47Z

Well two possible APIs are

dc := json.NewDecoder(fd)

dc.SetOpts(&json.DecodeOpts{...}) // applies to all Decode()s
err := dc.Decode(&v)

err := dc.DecodeOpts(&v, &json.DecodeOpts{...})

I'd use this, as my apps import long (hundreds of lines) test sequences from JSON files.

AlexanderYastrebov · 2021-01-06T13:00:05Z

Line counting without input at hand may be implemented like:

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"os"
	"sort"
)

type (
	lineCounter struct {
		r       io.Reader
		d       *json.Decoder
		bytes   int64
		lines   int
		offsets []offset
	}

	offset struct {
		index  int
		offset int64
	}
)

func (c *lineCounter) Read(data []byte) (n int, err error) {
	// remove newline offsets less than input offset
	i := c.search(c.d.InputOffset())
	c.offsets = c.offsets[i:]

	n, err = c.r.Read(data)
	for i, b := range data[:n] {
		if b == '\n' {
			c.offsets = append(c.offsets, offset{c.lines, c.bytes + int64(i)})
			c.lines++
		}
	}
	c.bytes += int64(n)
	return
}

func (c *lineCounter) newlinesBefore(offset int64) int {
	if len(c.offsets) == 0 {
		return c.lines
	}
	i := c.search(offset)
	if i == len(c.offsets) {
		return c.lines
	}
	return c.offsets[i].index
}

func (c *lineCounter) search(offset int64) int {
	return sort.Search(len(c.offsets), func(i int) bool { return c.offsets[i].offset >= offset })
}

func main() {
	c := &lineCounter{r: os.Stdin}
	dec := json.NewDecoder(c)
	c.d = dec

	for {
		var m map[string]interface{}
		if err := dec.Decode(&m); err == io.EOF {
			break
		} else if se, ok := err.(*json.SyntaxError); ok {
			fmt.Printf("%#v\n", err)
			//fmt.Printf("%#v\n", c)
			fmt.Printf("Line: %d\n", c.newlinesBefore(se.Offset)+1)
			break
		}
	}
}

Not trivial but does not require Decoder change.

networkimprov · 2021-01-06T21:57:11Z

I updated my previous comment to suggest encoding/json type DecodeOpts struct {...}

dsnet · 2021-01-06T22:24:43Z

Line counting (given the entire input buffer as jsonData) from the offset is as simple as:

line := 1 + strings.Count(jsonData[:offset], "\n")
column := 1 + offset - (strings.LastIndex(jsonData[:offset], "\n") + len("\n"))

Given it's simplicity, it calls into question whether we must provide this feature. The main advantage of having encoding/json handling it is that it would work for io.Reader inputs as well, but it doesn't seem that difficult to copy the entire input to a []byte so that you have access to the original input after the fact.

Stepping back, I personally, have no problem with the concept of adding line:column information and even explored an implementation last month where that's provided [1]. However, as @mvdan mentioned earlier, the issue is the performance cost it imparts onto the Decoder implementation. It's one thing to advance through all whitespace characters, it's another to also count the number of newlines that occurs. I don't remember the numbers, but there was a non-trivial performance cost for the common case where the line:column information is not even used.

dc.CountLines(true)

I don't think another API endpoint just for more verbose errors justifies its addition. Also, it's not clear to me that this fully avoids any performance slow down. I've been regularly surprised by how simply checking a flag for whether some branch should be taken itself incurs a performance slow down.

Given the strong desire that we avoid slowing down decoding in the common case, I propose two possible approaches:

We add an example (or helper function) with the snippet above that shows how easy it is to manually compute line:column information.
Provide line and column information in SyntaxErrors only if the entire []byte input is provided since it's trivial to compute this after the fact. However, we don't do it for inputs reading from an io.Reader since we want to avoid buffering the entire input. However, I'm not entirely fond of the inconsistent behavior between []byte and io.Reader inputs.

[1] This is in reference to an entirely new Decoder implementation that is an order of magnitude faster than the current one. See my comment elsewhere. Today's implementation is slow enough that the cost of line-counting is probably overshadowed by other inefficiencies. However, the existence of other inefficiencies today should not be justification for adding a performance cost that can never be removed since it leaks to the API.

networkimprov · 2021-01-06T23:00:42Z

When reading a file or network message, you almost always use json.Decoder, not json.Unmarshal. It would be crazy in many cases to save the whole stream to a buffer in case of error.

I don't think a performance cost is unreasonable when the feature is triggered by an API. When you really can't afford the cost, you simply don't incur it. Note that I suggested an extensible API which can accommodate other decoding options. Maybe accept_comments? :-)

dsnet · 2021-01-06T23:21:47Z

When you really can't afford the cost, you simply don't incur it.

That's not quite true. We still need to check a flag for whether the feature is enabled or not and checking the flag itself is a non-zero cost.

Note that I suggested an extensible API which can accommodate other decoding options.

This is outside the scope of this specific issue and has been explored quite a bit in other issues related to encoding/json. It's likely that the json API will evolve towards an options struct, but there's a lot of other issues that need to be considered simultaneously.

networkimprov · 2021-01-07T00:20:31Z

Worst case, you check the flag once per Decode(), to select a parsing function, no?

Is there an issue proposing a decoding options type? If not, this could be turned into one.

What other decoding options have been requested to date?

mvdan · 2021-01-07T10:21:47Z

We add an example (or helper function) with the snippet above that shows how easy it is to manually compute line:column information.

I agree with this solution, if we want a solution.

What other decoding options have been requested to date?

https://github.com/golang/go/issues?q=is%3Aissue+is%3Aopen+encoding+json+in%3Atitle+label%3AProposal is a good start.

Is there an issue proposing a decoding options type? If not, this could be turned into one.

Personally, I don't think a GitHub thread is a good place to bikeshed rethinking the encoding/json options API. I think this thread should stay on topic.

networkimprov · 2021-01-07T19:58:40Z

The only solution suitable for long json streams is the one Alexander suggested here: #43513 (comment). It's not very easy.

Perhaps we should tag this as ProposalHold; to be considered when a DecodeOptions type emerges.

dsnet · 2021-01-08T01:29:13Z

The only solution suitable for long json streams

Providing line:column information for large inputs is a weak use case.

The primary purpose of line:column position information is for human consumption. By nature, line:column information is a poor mechanism for programmatically identifying the location of an error since it fundamentally requires O(n) processing to count the number of preceding lines. In contrast, a byte offset is always O(1) since it identifies exactly where in the input a problem occurred and it's suitable for machine consumption as you can always accurately convert it to a line:column position. Thus, line:column information is generally only useful for inputs that are only a few KiBs in size. Even many text editors today start choking when trying to open a text file that are a few MiB in size.

Worst case, you check the flag once per Decode(), to select a parsing function, no?

That technique assumes that there is a specialized parsing function for each possible combination of features. Since combinations are fundamentally O(n!), this incurs significant implementation complexity (and associated binary bloat). As someone who would potentially be responsible for maintaining encoding/json for the long-run, I'm fairly opposed to over-specialization.

Perhaps we should tag this as ProposalHold; to be considered when a DecodeOptions type emerges.

I don't believe that is necessary. It presumes that an API just to opt into logic for line-number tabulation is the direction we want to head. Personally, I don't think an option just to opt-into line:column tabulation is justified.

jimsmart · 2023-02-18T16:36:42Z

I think it would be good to update the json.Decoder to feature an InputPos method — to bring it in line with both xml.Decoder.InputPos and csv.Reader.FieldPos

It's easy to argue that OffsetPos is enough, because JSON is generally streamed and not read from a file. But I believe that ignores an important common usage pattern here...

— The first thing one often does when faced with an issue when decoding a large complex JSON stream that fails somewhere in the middle, is to capture/save it to disk, as a test/debug case, and then open the file in a text editor (because these are human-readable files, by design), and go and take a look at the exact place where it has failed, to see what's what, and understand what it is that one's JSON-based decoder has fallen over on (whether it is the decoder or whatever, perhaps external tool, is creating the JSON).

IMO not having the decoder be able to return line+column info simply adds friction to working with complex JSON in Go.

All I want to be able to do, is easily find out where exactly my JSON stream decoder is failing, i.e. what incoming tag combination it is failing on (that it doesn't handle or handles incorrectly). And not having a simple API call to do this, when other decoders (XML and CSV) already feature this, seems like a painful oversight. Particularly when the main argument against it is performance, it's not a fast parser anyhow — particularly in the light that most folk are relying on built-in reflection to unmarshal / decode stuff. Personally, I'd rather take the additional minor performance hit parsing every JSON file, than have to jump through additional hoops when it comes to debugging some problem that should be much simpler to debug than it currently is.

Please bring json.Decoder inline with the other text-based decoders here — if adding such functionality to those decoders (ostensibly only for debugging / problem solving), at the cost of some performance was acceptable there, then it should also be perfectly acceptable here.

Related: #45628 — where @rsc says:

I'm skeptical that the counting fundamentally can't be done efficiently even in a faster JSON. But that's not the issue today.

gopherbot added this to the Proposal milestone Jan 5, 2021

gopherbot added the Proposal label Jan 5, 2021

mvdan changed the title ~~proposal: encoding/json: add line number to SyntaxError~~ encoding/json: add line number to SyntaxError Jan 6, 2021

mvdan added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed Proposal labels Jan 6, 2021

mvdan removed this from the Proposal milestone Jan 6, 2021

AlexanderYastrebov mentioned this issue Jan 9, 2021

Out of bounds panic on YAML parsing kubernetes/kubernetes#97651

Closed

dsnet mentioned this issue May 5, 2021

encoding/xml: add Decoder.InputPos #45628

Closed

ttys3 mentioned this issue Sep 1, 2021

return detailed json parse error including line and column info yukithm/json2csv#12

Closed

seankhliao added this to the Unplanned milestone Aug 27, 2022

joonas-fi mentioned this issue Sep 11, 2023

JSON errors: line numbers function61/gokit#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/json: add line number to SyntaxError #43513

encoding/json: add line number to SyntaxError #43513

AlexanderYastrebov commented Jan 5, 2021

mvdan commented Jan 5, 2021

AlexanderYastrebov commented Jan 5, 2021 •

edited

Loading

mvdan commented Jan 5, 2021

networkimprov commented Jan 6, 2021

AlexanderYastrebov commented Jan 6, 2021

mvdan commented Jan 6, 2021

mvdan commented Jan 6, 2021

networkimprov commented Jan 6, 2021 •

edited

Loading

AlexanderYastrebov commented Jan 6, 2021 •

edited

Loading

networkimprov commented Jan 6, 2021

dsnet commented Jan 6, 2021 •

edited

Loading

networkimprov commented Jan 6, 2021

dsnet commented Jan 6, 2021

networkimprov commented Jan 7, 2021

mvdan commented Jan 7, 2021

networkimprov commented Jan 7, 2021

dsnet commented Jan 8, 2021

jimsmart commented Feb 18, 2023 •

edited

Loading

encoding/json: add line number to SyntaxError #43513

encoding/json: add line number to SyntaxError #43513

Comments

AlexanderYastrebov commented Jan 5, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

mvdan commented Jan 5, 2021

AlexanderYastrebov commented Jan 5, 2021 • edited Loading

mvdan commented Jan 5, 2021

networkimprov commented Jan 6, 2021

AlexanderYastrebov commented Jan 6, 2021

mvdan commented Jan 6, 2021

mvdan commented Jan 6, 2021

networkimprov commented Jan 6, 2021 • edited Loading

AlexanderYastrebov commented Jan 6, 2021 • edited Loading

networkimprov commented Jan 6, 2021

dsnet commented Jan 6, 2021 • edited Loading

networkimprov commented Jan 6, 2021

dsnet commented Jan 6, 2021

networkimprov commented Jan 7, 2021

mvdan commented Jan 7, 2021

networkimprov commented Jan 7, 2021

dsnet commented Jan 8, 2021

jimsmart commented Feb 18, 2023 • edited Loading

What version of Go are you using (`go version`)?

AlexanderYastrebov commented Jan 5, 2021 •

edited

Loading

networkimprov commented Jan 6, 2021 •

edited

Loading

AlexanderYastrebov commented Jan 6, 2021 •

edited

Loading

dsnet commented Jan 6, 2021 •

edited

Loading

jimsmart commented Feb 18, 2023 •

edited

Loading