Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: simplify OCaml code generation by using Yojson AST -> atdml #313

Open
mjambon opened this issue Jul 26, 2022 · 11 comments
Open

RFC: simplify OCaml code generation by using Yojson AST -> atdml #313

mjambon opened this issue Jul 26, 2022 · 11 comments
Labels
design and planning Various discussions about the future

Comments

@mjambon
Copy link
Collaborator

mjambon commented Jul 26, 2022

The goal is to simplify the maintenance of both yojson and atdgen. Yojson is the library used to parse JSON from OCaml code generated by atdgen. Atdgen is the command-line tool that generates OCaml code that implements the mapping between raw JSON data and OCaml types. Currently, atdgen calls semi-secret parsing functions provided by yojson and constructs idiomatic OCaml data directly without going through a JSON AST. It was done this way to achieve the best performance but it turns out to be harder to maintain than if we were going through an AST. Yojson already provides such an AST in the form of the type Yojson.Safe.t. See ocaml-community/yojson#151 for more context from the yojson perspective.

Proposal

Reimplement the code generation parts of atdgen for parsing JSON (files starting with oj_) so they rely only on the usual yojson AST type (Yojson.Safe.t) and on the few JSON parsing functions that ordinary users would use (from_string, from_file, from_channel, from_lexbuf).

To get a sense of the function calls that would get removed, run git grep Yojson.Safe.read_ on the atd repository.

Expected impact

  • easier maintenance, resulting in more work spent on new features
  • degraded performance of JSON parsing from OCaml. Benchmarks will tell us by how much. It could be 4x or worse.

Request for comments

Please tell us how this would affect you.

Relevant questions include:

  • will the performance degradation be acceptable?
  • who will implement this and when?
  • how much will it cost?
@Lupus
Copy link

Lupus commented Aug 18, 2022

4x performance impact sounds scary. But some absolute numbers would be interesting to see, maybe compared with JSON parsing performance in other GC languages, like Golang or Java.

Currently, atdgen calls semi-secret parsing functions provided by yojson and constructs idiomatic OCaml data directly without going through a JSON AST

Is this interface something like Simple API for XML (SAX)? SAX parsers exist in XML world and are the only option to get decent performance/memory footprint when parsing large number of XML documents.

@mjambon
Copy link
Collaborator Author

mjambon commented Aug 18, 2022

Looking at the generated ml code will give you an idea.

$ echo 'type t = { x : int }' > toto.atd
$ atdgen -j -j-std toto.atd
$ cat toto_j.ml
...
let read_t = (
  fun p lb ->
    Yojson.Safe.read_space p lb;
    Yojson.Safe.read_lcurl p lb;
    let field_x = ref (None) in
    try
      Yojson.Safe.read_space p lb;
      Yojson.Safe.read_object_end lb;
      Yojson.Safe.read_space p lb;
      let f =
        fun s pos len ->
          if pos < 0 || len < 0 || pos + len > String.length s then
            invalid_arg "out-of-bounds substring position or length";
          if len = 1 && String.unsafe_get s pos = 'x' then (
...

and it keeps going with large amounts of non-obvious code that was optimized for speed.

Somewhere in there, there's a call to Atdgen_runtime.Oj_run.read_int which makes sense but we have a bunch of other functions for reading syntactic elements (Yojson.Safe.read_space, Yojson.Safe.read_object_end, ...) and some that are more consequential like Yojson.Safe.skip_json which parses and ignores a whole JSON node, which is nice for efficiently ignoring many fields in an object.

@mjambon
Copy link
Collaborator Author

mjambon commented Nov 21, 2022

I've been working for 4 days on adding support for imports in atdgen and I'm still not done. The implementation is a giant mess for which I'm responsible. The same task for atdpy took me a few hours.

Things that make it hard to work with the current atdgen implementation include:

  • No documentation indicating what to expect from various modules, types and functions. Note that the original code from 10+ years ago (written by me) was never reviewed by anyone, which didn't help.
  • Multiple output files: foo.atd translates to foo_t.ml for types, foo_j.ml for JSON, foo_b.ml for biniou, foo_v.ml for validators etc. This adds complexity when translating foo.bar into OCaml because depending on the context, it can be Foo_t.bar, Foo_j.bar, or Foo_b.bar, etc. While it was a nice goal to have a shared OCaml type that could be exported to either JSON or Biniou, it is rarely useful in practice. The priority should be have been have keep the common case simple, which is to produce OCaml types and JSON conversion.
  • Many layers of translation and confusion about what each of them does. Notably, the ('a, 'b) Mapping.t type doesn't exist in the other implementation (atdj, atdpy, etc.) and everything is much simpler that way. Perhaps this mapping thing could have been fine if it had been thoroughly documented and properly structured which wasn't the case.
  • Various bells and whistles that were hastily designed and need to be maintained even though they're deprecated and have better replacements.
  • Other bells and whistles that nobody really needs, such as producing ocamldoc comments from special <doc> annotations. Reading the plain ATD file is fine. Alternatively, an atddoc tool that would benefit all the target languages could be developed and would be much more valuable.
  • Trying to be too smart, like when dealing with optional record fields. The atdgen implementation for { ?foo: bar; } will check if bar can be inlined as thing option. Doing this lookup is unnecessarily complex. The other backends simply require that the record definition be written directly as { ?foo: thing option; } which works fine in practice.
  • Optimized code generation. This was important at the time, in particular for the Biniou backend. This isn't something I touched recently but it's a lot of code that can be hard to follow as it tries as much as possible to avoid memory allocation, for example by not creating closures.

My recent experience with writing atdpy (Python JSON support) and atdts (TypeScript JSON support) has been great because the focus was on basic functionality and not implementing things that are complicated to implement. So it would be really neat from a maintainer's but also from a user's perspective to have a simpler atdml tool that's based on the same model.

Let me recapitulate the features that I think are essential:

  • Simplicity of maintenance.
  • Simplicity of use: generate a single .ml file from a .atd file.
  • JSON support.
  • Easy support for external type definitions whose converters could be written by hand if necessary.

Features to avoid include:

  • Any new feature that requires extra work for each target language.
  • Performance optimizations consisting in generating unidiomatic code. As a rule of thumb, the generated code should be
    • easy to review by a human and
    • about as fast as if it were written by hand

@aryx
Copy link

aryx commented Nov 21, 2022

Let's do it!

@mjambon
Copy link
Collaborator Author

mjambon commented Nov 21, 2022

I'd like to hear what features people rely on, what works well for them, what doesn't work so well, etc. Before rushing into an implementation, let's come up with a list of supported and unsupported features that we want in atdml. I'll start one here based on my experience and known mistakes.

Supported legacy features

  • ✅ JSON as a data format
  • ✅ generates OCaml module interfaces (.mli files in addition to .ml files)
  • ✅ ATD modules and imports: ATD type names like foo.bar translating to OCaml type Foo.bar.
  • ✅ polymorphic ATD types aka generics
  • ✅ polymorphic types in OCaml interfaces
  • ✅ artificial wrap construct as the mechanism for converting nodes into a target-specific construct. A typical example is to convert a date-time field represented as a string using whatever date-time library the target language uses. Sample ATD code looks like type datetime = string wrap <ocaml module="ATD_wrappers.Datetime">.
  • ✅ custom default field values { ~foo: bar.buz <ocaml default="Bar.default_buz">; }
  • ✅ generates OCaml functions with labeled and optional arguments for creating records
  • ✅ generates ordinary OCaml code. No C code or Obj.magic coming with security and maintenance concerns.
  • ✅ support for JSON5 and YAML. This was already possible with atdgen by using JSON as an intermediate. Now instead of using raw JSON bytes as an intermediate, we'll use a Yojson AST.
  • ✅ compatibility with ppx derivers. The generated OCaml types will include the ppx annotations specified in the ATD source.

New features

  • 🆕 one ATD file = one OCaml module (one pair of ml/mli files possibly with submodules)
  • 🆕 always generates OCaml functions with labeled and optional arguments for creating records. These will be generated alongside type definitions (no more atdgen -v foo.atd resulting in separate files foo_v.mli and foo_v.ml)
  • 🆕 defaults to classic variants instead of polymorphic variants. Type-directed disambiguation in modern OCaml makes it easy to reuse constructor names.
  • 🆕 new command-line interface with some simplifications and more popular defaults

Legacy features that won't be supported by design

  • ❌ better performance allowed by not going through an AST
  • ❌ better performance allowed by not creating closures
  • type foo = abstract <...>. This was already mostly superseded by wrap and is completely superseded by ATD modules and imports.
  • { ?foo: bar; } is now a syntax error even if bar is an alias for an option type. Use { ?foo: baz option; } instead.
  • { ~foo: bar; } is now a syntax error even if bar is an alias for a built-in type with a standard default. Use { ~foo: baz list; }, { ~foo: int; }, etc. instead.
  • ❌ no support for ATD-specific JSON syntax extensions. These include the <Foo: "bar"> syntax for JSON variants.

Features remaining unsupported by design

  • ❌ type error messages containing source location (file name and line/column position in the JSON file). This would require the JSON parser to track source location and this would add runtime costs that can be significant for some applications. We could use optionally a JSON or YAML parser that reports syntax errors accurately (with a cost) but invalid node kinds such as "we found an int where a string was expected" won't be reported with precise location. The best we can do is print the AST node that has the unexpected structure as we do in atdgen today.

Unsupported legacy features due only to budget limits

  • ❓ Biniou as a data format
  • ❓ Custom validation with user-provided functions
  • ❓ Inheritance of record fields from a base record type
  • ❓ Inheritance of variants or enums fields from a base type

@mjambon
Copy link
Collaborator Author

mjambon commented Nov 21, 2022

@Lupus wrote:

4x performance impact sounds scary.

The atdgen manual gives 3x. I believe it is an accurate average of read and write performance. I don't think it's that big unless all your application does is read JSON data, inspect a single field, and write it back. We had this pattern in a big 25-step map-reduce pipeline running on Hadoop back in the day. For this, we used mostly Biniou rather than JSON because it is again something like 4x faster. Taking care of such optimizations matters in a large-scale context, so it could make sense for a big data company to invest in this to reduce energy and/or hardware costs. It doesn't make sense to me at the moment or to my current employer (r2c), so I'd rather leave optimizations possible but secondary.

@cyberhuman
Copy link
Contributor

 Inheritance of record fields from a base record type
 Inheritance of variants or enums fields from a base type

We use inheritance A LOT.

@Khady
Copy link
Contributor

Khady commented Nov 23, 2022

defaults to classic variants instead of polymorphic variants. Type-directed disambiguation in modern OCaml makes it easy to reuse constructor names.

This is going to be a giant pain to update a large codebase? Hundreds of types or thousands or usages to update.

@mjambon
Copy link
Collaborator Author

mjambon commented Nov 23, 2022

 Inheritance of record fields from a base record type
 Inheritance of variants or enums fields from a base type

We use inheritance A LOT.

Good to know. It shouldn't be difficult to add without adding complexity.

@mjambon
Copy link
Collaborator Author

mjambon commented Nov 23, 2022

defaults to classic variants instead of polymorphic variants. Type-directed disambiguation in modern OCaml makes it easy to reuse constructor names.

This is going to be a giant pain to update a large codebase? Hundreds of types or thousands or usages to update.

Users would need to add the annotation <ocaml repr="poly"> after the ATD type definitions to use polymorphic variants instead of classic variants, similarly to what must be done today to select classic variants. This would require no change in OCaml source code. Here's an example:

type fruit = [
| Apple
| Orange
] <ocaml repr="poly">

If ATD files tend to contain many different variant type definitions, I think a good solution would be to support a global setting in the head part of the file:

(* head: global options *)
<ocaml default_variant_repr="poly">

(* imports *)
import thing

(* body: type definitions *)
type ... = ...
...

@aryx
Copy link

aryx commented Nov 23, 2022

Global annotations would be useful also in atdpy to avoid all those dataclass annot on every types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design and planning Various discussions about the future
Projects
None yet
Development

No branches or pull requests

5 participants