Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Remove internal parser API used by atdgen #151

Open
Leonidas-from-XIV opened this issue Jul 26, 2022 · 9 comments
Open

[RFC] Remove internal parser API used by atdgen #151

Leonidas-from-XIV opened this issue Jul 26, 2022 · 9 comments
Labels
rfc Request for Comments on future direction

Comments

@Leonidas-from-XIV
Copy link
Member

Current state & Rationale

Yojson currently has a semi-secret undocumented API that allows controlling the parser - when you know what the structure of the JSON is you can tell the parser what you expect and it will give you the values in a somewhat more streaming way.

The downside is that this exposes the internals of Yojson which is not great because this locks us into our current implementation. Another issue is that this code is under-used and most importantly under-tested, so improvements sometimes can break code and it doesn't get detected before the release.

Proposal

The proposal is to remove the internal code. Initially this can be just hidden from the interface and then we can shake down and rewrite the unused code.

A possible replacement for the API could be to use the Yojson.Safe.t API. Given there are a number of consumers that convert a predefined structure from JSON to OCaml (like atdgen, ppx_deriving_yojson, ppx_yojson_conv) we could consider adding some optimized API that avoids creating a full AST but rather allows libraries to get JSON to parse with lower overhead while not exposing the internals too much.

Impact

This is under-researched at the moment, but the major consumer of this API is atdgen. mjambon suggests that atdgen could be changed to use the Yojson.Safe.t AST.

Roadmap

This is another breaking change, so would need a dependency bump. Given a lot of code uses atdgen, a number of packages would be impacted, but once atdgen is adjusted it could generate new code that would Just Work(TM).

@Leonidas-from-XIV Leonidas-from-XIV added the rfc Request for Comments on future direction label Jul 26, 2022
@mjambon
Copy link
Member

mjambon commented Jul 26, 2022

We would first need to modify atdgen's code generation and see how it goes in practice. I'm rather optimistic. Then, we'd have to run benchmarks and poll the users to see if they can accept the new performance.

@Lupus
Copy link

Lupus commented Aug 18, 2022

XML world has a notion of Simple API for XML (SAX). SAX parsers are event-based, they parse the XML format and let the upper level know that we've encountered a node or an attribute, but without constructing any trees in memory. Possibly something similar could be done with Yojson, events could be lower level than Yojson.Safe.t, like emit each associative map field name and value as they are parsed and do not try to assemble them into `Assoc, and wrap that event sequence with events like "start of associative map" and "end of associative map". Yojson.Safe.t API could be built on top of this lower level events API to streamline the maintainability process - single code path is always parsing the source input and always produces just events, AST layer is only constructed out of events, existing tests for AST layer can transitively cover the events API.

@Leonidas-from-XIV
Copy link
Member Author

Leonidas-from-XIV commented Aug 18, 2022

@Lupus This is essentially jsonm, which is a streaming JSON codec, vaguely comparable to SAX in XML.

@Lupus
Copy link

Lupus commented Aug 18, 2022

Hm, if we have that already, why can't Yojson be built on top of jsonm codec? Atdgen could also leverage jsonm directly for performance reasons. This should satisfy perfectly valid @mjambon 's desire to spend more time on features than on maintenance (motivation for ahrefs/atd#313).

@Leonidas-from-XIV
Copy link
Member Author

  1. Yojson potentially predates jsonm, the code already existed when jsonm started
  2. Yojson builds an AST, so by itself has no need for a streaming mode
  3. Rewriting Yojson to use jsonm adds dependencies and work and removes freedom to affect the way JSON is parsed.
  4. atdgen could use jsonm if it wanted to I guess, but then atdgen users might face different parsing characteristics, different dependency trees with different up and downsides. I don't know what ATD authors think about it.

@Lupus
Copy link

Lupus commented Aug 18, 2022

Yojson builds an AST, so by itself has no need for a streaming mode

Yojson is pretty popular in the ecosystem, would be awesome to expose streaming mode and ast mode that is built on top of the streaming mode to offer full stack experience for JSON parsing/serialization. Some kind of streaming mode is already in place and is secretly used by atdgen, no?

atdgen could use jsonm if it wanted to I guess, but then atdgen users might face different parsing characteristics, different dependency trees with different up and downsides. I don't know what ATD authors think about it.

As atd user, I find it handy to have ast at times - when I want some opaque json data as part of my record to be process manually, or when I want to serialize atd type specifically to ast (currently this is not possible and we have to resort to hack like part json string to ast with yojson, acceptable in tests but still ugly). But maybe using jsonm for parsing/serialization for performance reasons does not block from getting ast representation when requested by the caller with associated performance penalty. @mjambon what do you think about this?

@Leonidas-from-XIV
Copy link
Member Author

Yojson is pretty popular in the ecosystem, would be awesome to expose streaming mode and ast mode that is built on top of the streaming mode to offer full stack experience for JSON parsing/serialization.

Yes, but if you need streaming mode, why not just use jsonm? It exists, it is tested, stable. An addition of a properly specified streaming mode to Yojson would be essentially copying what jsonm already does, again raising the question what the additional value of it would be?

Some kind of streaming mode is already in place and is secretly used by atdgen, no?

Yes, but it uses the internal representation of Yojson and thus would break were Yojson to change things in there. Which is not a good way to design an API that is supposed to be used by other programmers unless you want to perpetually have people fix their code or never change the implementation ever again. Note that the streaming mode of jsonm does not expose internals.

With regard to the AST representation: in ppx_deriving_yojson you specify your types and the deriver tries to unpack Yojson.Safe.t into your types but you can also specify some nodes as Yojson.Safe.t thus telling it "put whatever you find here in an unparsed form". Maybe such a thing is possible with ATD as well?

@Lupus
Copy link

Lupus commented Aug 18, 2022

Yes, but if you need streaming mode, why not just use jsonm?

Makes sense overall. Folowing this logic atdgen should not use Yojson for parsing in streaming mode, but should use jsonm, which also decreases maintenance cost. But Ast mode is still convenient and is actually supported by atd out of the box (you can have abstract type and it will be represented as Yojson.Safe.t in the OCaml type), resulting in necessity to depend on 2 json libraries, which gets weird :) Or drop streaming parsing and associated high performance level and stick to Ast only and Yojson. Or just copy the type declaration for Yojson.Safe.t and do not depend on the library itself, type should be quite stable in the long run.

Yes, but it uses the internal representation of Yojson and thus would break were Yojson to change things in there. Which is not a good way to design an API that is supposed to be used by other programmers unless you want to perpetually have people fix their code or never change the implementation ever again.

I fully agree. Just though that if that secret api is like 95% of somewhat usable public streaming-like API, why not bite the bullet :)

@mjambon
Copy link
Member

mjambon commented Nov 21, 2022

Hi. Sorry for chiming in a little late.

I started considering a replacement for atdgen called atdml. Check out the currently proposed list of features for details. The aspects that concern Yojson are:

  • JSON parsing would always go through the Yojson AST.
  • Once atdgen is fully retired, Yojson's secret parsing API used by atdgen could also go away.
  • We still need to gather input from current users of atdgen.
  • There is no timeline at this point. Creating an initial version of atdml would take me about two weeks full time, and I don't have this time right now. It would however save maintenance time later on. In particular, it would make it easier for people (including myself!) to contribute to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Request for Comments on future direction
Projects
None yet
Development

No branches or pull requests

3 participants