Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/write: more regular usage protocol for ?buf parameters #132

Merged

Conversation

gasche
Copy link
Contributor

@gasche gasche commented Feb 10, 2022

The handling of ?buf parameters today makes little sense:

  • to_string and all stream_* functions clear the buffer before
    using it (so: they guarantee that they work in the same way no
    matter what the input state of the buffer is, and they make no
    guarantees on its output state)

  • but to_channel and to_output make weird claims about it in the
    documentation, that don't make sense to me, and appear to not clear
    the buffer and in fact include, in their output, the content of the
    buffer as it is passed.

I believe that the reason for this difference comes from the previous
codebase using Biniou, that was replaced by Buffer in
9310500

The change left the usage of ?buf in to_channel and to_output
rather incoherent (and I think broke the documentation somewhat), due
to different usage properties of Biniou_outbuf, who "owns" the
underlying output channel and implements its own buffering, and
Buffer, which is not related to an output channel: the previous API
may have made sense for Biniou, but it does not anymore.

What's not completely clear to me is whether some code in the wild
could rely on the previous, nonsensical behaviour. This may happen if
the code was written with Biniou in the past, and converted to use
Buffer in the same systematic (and slightly wrong) way as Yojson
itself.

(Note: there are two independent questions: (1) whether the buffer is
cleared on entry, and (2) whether the buffer is cleared on
exit. Currently for to_channel and to_output neither is done, and
at least for (2) this seems in direct contradiction with what the
documentation says -- at least with the natural interpretation of "buf
is flushed" as "the content is sent to the output channel, and the
buffer is cleared". So the lack of (2) is clearly a bug, and the lack
of (1) might not be.)

@c-cube
Copy link
Member

c-cube commented Feb 10, 2022

Intuitively I'd say that clearing on exit is redundant, because it's better to always clear on entry. But it's probably a good defensive practice anyway and is basically free.

@gasche
Copy link
Contributor Author

gasche commented Feb 10, 2022

I agree, but I sticked to "clear on exit" as this is what the documentation of to_string describes.

@gasche
Copy link
Contributor Author

gasche commented Feb 15, 2022

To be a bit more concrete, here is an example of behaviors that I think are bugs:

$ dune utop
(* build a buffer with some garbage inside *)
# let buf = Buffer.create 42;
# Buffer.add_string buf "Hello World\n";;

(* pass the buffer to Yojson.to_channel *)
# let oc = open_out "/tmp/foo.txt" in
Fun.protect ~finally:(fun () -> close_out oc) @@ fun () ->
Yojson.to_channel ~buf oc (`List [`String "foo"]);;

(* the garbage ends up included in the file! *)
# Sys.command "cat /tmp/foo.txt && echo";;
Hello World
["foo"]

(* write to a different file now, with the same buffer *)
# let oc = open_out "/tmp/bar.txt" in
Fun.protect ~finally:(fun () -> close_out oc) @@ fun () ->
Yojson.to_channel ~buf oc (`List [`String "bar"]);;

(* oh noes, even more garbage! *)
# Sys.command "cat /tmp/bar.txt && echo";;
Hello World
["foo"]["bar"]

@Leonidas-from-XIV
Copy link
Member

What's not completely clear to me is whether some code in the wild
could rely on the previous, nonsensical behaviour. This may happen if
the code was written with Biniou in the past, and converted to use
Buffer in the same systematic (and slightly wrong) way as Yojson
itself.

I doubt it given there was never a release that had Biniou replaced with Buffer so unless they pinned the git version it is quite unlikely, as such the behavior can be improved as we see fit.

@gasche
Copy link
Contributor Author

gasche commented Feb 15, 2022

Ah! I hadn't realized. Then I believe you should merge this PR before the next release :-)

@Leonidas-from-XIV Leonidas-from-XIV added this to the 2.0.0 milestone Feb 15, 2022
@Leonidas-from-XIV
Copy link
Member

I agree that we should be consistent. Though it would be good to know what we want it to behave like, e.g why there is a ?buf parameter.

The way I understand it it seems that what you consider a bug seems to be exactly the behavior that it is meant to have: you pass a buffer that you got from somewhere and want it to be written to the oc. If it is cleared before writing it out, there is very little reason to even pass it, because whatever you had in that buffer will be clobbered by to_channel and if you wanted to start from an empty buffer you could as well… not pass a buffer to begin with because then it will use its own buffer.

Though I have to say I have a hard time understanding what usecase passing such a buffer would have. If you want to mix different sources it seems to me that it would make more sense to just write to the out_channel beforehand, if you want to push something like newline separated JSON to a FIFO you can just keep the channel open and call to_channel in a loop so maybe the proper solution would be to remove ?buf altogether. Especially to_string clears it before and after so the only difference I could imagine is some kind of optimization where you avoid the buffer being GC'd and recreated every time?

@c-cube
Copy link
Member

c-cube commented Feb 15, 2022

Though I have to say I have a hard time understanding what usecase passing such a buffer would have.

The way I immediately interpreted this API, something IO related with an optional ?buf parameter, is that it's purely a performance optimization. You need a buffer for intermediate storage, and maybe you're going to write 10,000 json objects, so you can pre-allocate one large buffer and pass it to every call to write so that you don't allocate 10,000 intermediate buffers.

The content of the buffer is, and should be, totally irrelevant. If that's not the intended use case then the API and comments need to be super clear about it :)

@Leonidas-from-XIV
Copy link
Member

Yes, that's somewhat the most sensible explanation. In that case, yes, it is a bug and we should fix it. Also, I think we should document it as such an optimization and for that we only need to clear in the beginning.

(Sidenote: I'd be curious to see how much this optimization saves. Maybe I should add a benchmark entry for it)

@c-cube
Copy link
Member

c-cube commented Feb 15, 2022

Intuitively I think it must save a lot if you write many values. It also should impact favorably the memory consumption :-).

@gasche
Copy link
Contributor Author

gasche commented Feb 15, 2022

Indeed, I believe that the buffer is there to serve as intermediate storage. The specification should be: "we (Yojson) don't assume anything about the content of the buffer coming in, and you (the user) can't assume anything about the content after the call". But to_string has a stronger, pre-existing specification, that enforces this (that user content is ignored) by explicitly clearing the buffer before (and after) the call, so I followed that style.

@gasche
Copy link
Contributor Author

gasche commented Feb 15, 2022

Yet another way to see that the current behavior is broken:

# let oc = open_out "/tmp/out.txt";;
# let buf = Buffer.create 42;;
# (* printing "foo" three times *)
  for i = 1 to 3 do Yojson.to_channel ~buf oc (`String "foo"); done;;
# close_out oc;;
# Sys.command "cat /tmp/out.txt && echo";;

What output do you expect? Probably not the current one:

"foo""foo""foo""foo""foo""foo"

@tcoopman
Copy link

Another bug is that the docs say:

Write a newline-separated sequence of compact one-line JSON values to a string

There is no newline inserted at the moment. I don't really mind that this is not the case but maybe we should decide to either keep it or update the documentation

@c-cube
Copy link
Member

c-cube commented Feb 16, 2022

Write a newline-separated sequence of compact one-line JSON values to a string
There is no newline inserted at the moment.

When writing a sequence of json values, I think providing a ?sep:string argument can be very useful. Some people use "JSON separated by newline" as a kind of standard (since the actual standard doesn't cope with more than one toplevel value); some might use good streaming parsers that do not need the newlines at all.

@Leonidas-from-XIV
Copy link
Member

?sep:string sounds like a good idea indeed. As @c-cube states, ndjson (newline delimited JSON) is a sort of semi-standard for streaming JSON values, so it is kinda more flexible.

Though, should it be a separator? Or an end-marker (aka trailing ?sep: yes or no). To not complicate matters here, I suggest fixing the behaviour in another PR.

What I would suggest is make it consistent:

  • Always clear the buffer on entrance, not on exit. If the user wants it they can do it themselves. The case where there would be stuff left over might break some code but as we have discussed this is purely a performance optimization (Add benchmark for running to_channel with a preallocated buffer #134) so I would expect it to avoid doing things it doesn't need to.
  • Update the documentation of the behavior
  • Maybe have the newlines added as the documentation states and handle ?sep in a future PR?

@gasche
Copy link
Contributor Author

gasche commented Feb 17, 2022

Newlines are unrelated to the present PR, which is about the use of temporary buffers. That should go in its own PR. (@panglesd implemented 6696f55, so there's a natural candidate for more newline-related work :-)

I'm not convinced that not clearing at the end is a good idea. It would be a fine design if we wrote these functions from scratch, but they are adapted from previous functions using Biniou channels (you chose to adapt the functions instead of removing them completely), and then I think that clearing at the end makes more sense (because otherwise the data is duplicated, it is sent to the output channel and kept in the buffer). It is also what is documented (and has always been) for to_string, so using the same approach for all functions makes things simpler. And there is no performance difference, Buffer.clear is fast.

So: my vote would be the PR as it currently is.

@c-cube
Copy link
Member

c-cube commented Feb 17, 2022

@Leonidas-from-XIV good question, but in any case I think it's better to do just like ndjson :)

@Leonidas-from-XIV
Copy link
Member

@c-cube I looked it up in the (or a?) spec and it states

Each JSON text MUST conform to the [RFC7159] standard and MUST be written to the stream followed by the newline character \n (0x0A). The newline character MAY be preceded by a carriage return \r (0x0D).

So yes, trailing newlines.

But fair enough, let's make the ?buf handling consistent wrt. clearing first.

@Leonidas-from-XIV
Copy link
Member

@gasche Can you add a changelog entry?

The handling of `?buf` parameters today makes little sense:

- `to_string` and all `stream_*` functions clear the buffer before
  using it (so: they guarantee that they work in the same way no
  matter what the input state of the buffer is, and they make no
  guarantees on its output state)

- but `to_channel` and `to_output` make weird claims about it in the
  documentation, that don't make sense to me, and appear to *not* clear
  the buffer and in fact include, in their output, the content of the
  buffer as it is passed.

I believe that the reason for this difference comes from the previous
codebase using Biniou, that was replaced by Buffer in
  9310500

The change left the usage of ?buf in `to_channel` and `to_output`
rather incoherent (and I think broke the documentation somewhat), due
to different usage properties of Biniou_outbuf, who "owns" the
underlying output channel and implements its own buffering, and
Buffer, which is not related to an output channel: the previous API
may have made sense for Binio, but it does not anymore.

What's not completely clear to me is whether some code in the wild
could rely on the previous, nonsensical behaviour. This may happen if
the code was written with Biniou in the past, and converted to use
Buffer in the same systematic (and slightly wrong) way as Yojson
itself.

(Note: there are two independent questions: (1) whether the buffer is
cleared on entry, and (2) whether the buffer is cleared on
exit. Currently for `to_channel` and `to_output` neither is done, and
at least for (2) this seems in direct contradiction with what the
documentation says -- at least with the natural interpretation of "buf
is flushed" as "the content is send to the output channel, and the
buffer is cleared". So the lack of (2) is clearly a bug, and the lack
of (1) might not be.)
@gasche gasche force-pushed the write-buffer-cleanup branch from 66016ce to 29f56de Compare February 20, 2022 15:20
@gasche
Copy link
Contributor Author

gasche commented Feb 20, 2022

I amended the Changes entry corresponding to the removal of the Biniou dependency, because this PR is just a continuation of that work.

Copy link
Member

@Leonidas-from-XIV Leonidas-from-XIV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with merging this. The CI is broken for unrelated reasons (opam lint issues) but the code builds fine across the range.

@Leonidas-from-XIV
Copy link
Member

CI is fixed too, thanks OCaml-CI maintainers, the build is green now.

@Leonidas-from-XIV Leonidas-from-XIV merged commit 9be7916 into ocaml-community:master Feb 28, 2022
panglesd added a commit to panglesd/opam-repository that referenced this pull request Jun 2, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
panglesd added a commit to panglesd/opam-repository that referenced this pull request Jun 3, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
Leonidas-from-XIV pushed a commit to panglesd/opam-repository that referenced this pull request Jun 9, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants