-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: encoding/json: garbage-free reading of tokens #40128
Comments
It's certainly the case that when I designed this API, everything but string fit into an interface without allocating. |
Brad points out that my memory is bad - Token was added after we started making more garbage in interface conversions. |
It seems like people are generally in favor of this (if accepted, it would be for Go 1.17). Do I have that right? |
The documented semantic of
which I interpret to mean that JSON strings are returned in their encoded form. For example: However, the example use of it in _, data, _ := dec.TokenBytes()
...
return string(data) which suggests that the data returned is already in the decoded form (i.e., no quote delimiters and escape sequences). Which format is returned?
My overall concern with this API is that misuse cannot be easily detected, especially since exactly which internal buffer is used depends on characteristics of the input JSON string and other buffering effects, leading to misuse at scale. If we can figure ways to more reliably cause misuse to fail or have obviously incorrect data, I'd feel more comfortable, but perhaps that's just an implementation detail. |
I wrote and maintain a package for encoding and decoding the logfmt logging format (github.com/go-logfmt/logfmt). It's a simple format, but it borrows its escaping rules from JSON, so it must deal with the same question that @dsnet raises above. The
As you can see, I came down on the side of convenience for the caller at the cost of allocating in some cases. However, the API also separates the parsing from the data retrieval. Although the current implementation does not take advantage of it, the API allows an implementation that avoids the cost of allocation and unescaping if Unfortunately that strategy doesn't fit well with the existing |
Sorry, that was ambiguous. I intended it to mean the unescaped form, which indeed implies Requiring an additional step to unquote a string would not only be inconvenient but also
That's true, and similar to existing APIs such as |
OK, so returning the decoded string in a temporary buffer seems like a good solution to me. @dsnet, are you OK with that? |
IIUC, we would:
Thus, subsequent calls to If so, this semantic sufficiently alleviates my misuse concern. |
I wasn't thinking of copying the decoded string to the temporary buffer unless it needed to be (this call is about being performant after all, and copying data is a significant aspect of that). It might be interesting to have some flag that can be enabled to trigger the "always-copy/overwrite" behaviour though (when race detection is enabled, perhaps?); it might also be possible to have some vet rule that helps find likely incorrect uses. |
Note that other APIs where that provide a short-lived []byte slice (e.g. bufio.Scanner) don't necessarily make extra copies. Yes, it's potentially error-prone, but I think that's an inevitability with an API like this, sadly. |
It's one degree of unsafe to return an alias to the same internal buffer, which is what we see with other standard library APIs like |
Stepping back a bit. I don't think we should do this until we've exhausted more of the optimization opportunities without any API changes. I have been experimenting with a new implementation of
This benchmark operates on a
Thus, even with the inefficiencies associated with the current Implementation note: The current decoder is a state machine that performs approximately 1 function call for every byte of input (see scanner.go). Essentially it processes one byte of input, sets the state to the next function to handle the next byte, and then calls that function. It's implemented this way since the decoder needs to handle fragmented data (when reading from EDIT: Updated benchmarks with faster numbers from experimental implementation. |
This is similar to my previous estimates; I've often estimated that replacing the scanner could easily double its speed. Having said that, I think reducing the amount of garbage would be the next logical step once the scanner is fast. I agree that the speed should come before saving allocations, but it doesn't seem like a substitute.
Are we dismissing the option of only doing the extra copying and safety measures when the |
I agree. Even if this proposal was accepted today, I also agree with @dsnet that we should make the scanner fast before we think about making it garbage-free. |
Putting on hold. |
I tried fixing this without any API changes, but I'm fighting the Go inliner. I tried a technique called "function outlining" to move the interface wrapping to an inlineable function so that the caller can better decide whether to heap allocate the token. For example: func (d *Decoder) Token() (Token, error) {
switch tok, num, str, err := d.token(); {
case num != 0:
return num, err
case str != "":
return str, err
default:
return tok, err
}
}
// token returns the next Token but avoids heap allocating the Token.
// Token is used for Delims, bools, and zero value floats and strings
// (i.e., values that can be wrapped in a Token without allocating).
// float64 is non-zero if valid.
// string is non-zero if valid.
func (d *Decoder) token() (Token, float64, string, error) If At present, I can get this technique to be inlineable only if I choose to specialize for strings or floats, but not both (otherwise I exceed the inline budget). In such situations, I do indeed observe that the allocations go away. |
Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal. |
As @bradfitz noted in the reviews of the original API, the
Decoder.ReadToken
API is a garbage factory. Although, as @rsc noted at the time, "a clean API ... is more important here. I expect people to use it to get to the position they want in the stream and then call Decode", the inefficiency is a problem in practice for anyone that wishes to use the encoding/json tokenizer as a basis for some other kind of decoder.Dave Cheney's "Building a high performance JSON parser" details some of the issues involved. He comes to the conclusion that the interface-based nature of json.Token is a fundamental obstacle. I like the current interface-based API, but it does indeed make it impossible to return arbitrary tokens without creating garbage. Dave suggests a new
Scanner
API, somewhat more complex, that is also not backwardly compatible with the current API in encoding/json.I propose instead that the following method be added to the
encoding/json
package:Token
can be implemented in terms ofTokenBytes
as follows:Discussion
This proposal relies on the observation that the
Decoder.Token
API only generates garbage for two kinds of tokens: numbers and strings. For all other token types, no garbage need be generated, as small numbers (json.Delim
and bool) do not incur an allocation when boxed in an interface.It maintains the current API as-is. Users can opt-in to the new API if they require efficiency at some risk of incorrectness (the caller could hold onto the data slice after the next call to Decode). The cognitive overhead of
TokenBytes
is arguably low because of its similarity to the existing API.If this proposal is accepted, an
Encoder.EncodeTokenBytes
could easily be added to provide garbage-free streaming JSON generation too.The text was updated successfully, but these errors were encountered: