Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out-of-bounds error in atdgen parsers #150

Merged
merged 6 commits into from
Jul 25, 2022
Merged

Conversation

mjambon
Copy link
Member

@mjambon mjambon commented Jul 20, 2022

When running make test in atd, which just switched to yojson 2.0.x, we get an out-of-bounds error on two unit tests:

$ make test
...
> [FAIL]        atdgen         34   test ambiguous record with json adapters.
  [FAIL]        atdgen         35   test wrapping of polymorphic types.
...
┌──────────────────────────────────────────────────────────────────────────────┐
│ [FAIL]        atdgen         34   test ambiguous record with json adapters.  │
└──────────────────────────────────────────────────────────────────────────────┘
[invalid] out-of-bounds substring position or length: string = "ambiguous", requested position = 1, requested length = 9
          Raised at Stdlib.invalid_arg in file "stdlib.ml", line 30, characters 20-45
          Called from Dune__exe__Test_ambiguous_record_j.read_ambiguous.f in file "atdgen/test/test_ambiguous_record_j.ml", line 203, characters 12-156
          Called from Yojson.Safe.map_ident in file "lib/read.ml" (inlined), line 1977, characters 3-43
          Called from Dune__exe__Test_ambiguous_record_j.read_ambiguous in file "atdgen/test/test_ambiguous_record_j.ml", line 225, characters 14-42
          Called from Dune__exe__Test_atdgen_main.test_ambiguous_record in file "atdgen/test/test_atdgen_main.ml", line 579, characters 4-55
          Called from Alcotest_engine__Core.Make.protect_test.(fun) in file "src/alcotest-engine/core.ml", line 180, characters 17-23
          Called from Alcotest_engine__Monad.Identity.catch in file "src/alcotest-engine/monad.ml", line 24, characters 31-35

The problem occurs when parsing unquoted JSON fields, which are supported as an extension of JSON. I believe I fixed the bug, but didn't add a unit test yet because I'm not sure yet where to put it.

To get the detailed error message above, check out ahrefs/atd#310 and run make test.

parses an unquoted JSON field.
TODO: write a unit test.
@Leonidas-from-XIV
Copy link
Member

I'll take a look tomorrow to see if some kind of unit test can be built to exercise this, given it is quite subtle and has not been detected by anything but atdgen. Clearly that part is under-tested.

@Leonidas-from-XIV
Copy link
Member

While figuring out the situation it seems like I broke this in #108 following #85, so I've added a test making sure that the f function gets the right range passed.

@Leonidas-from-XIV
Copy link
Member

I've been wondering whether I can write a test for {foo: null, bar: "hello"} using map_string on "hello", but when I reach the read_space after the bar: applying map_string calls the mapping function with "" 0 0 since it assumes that the opening " of the string is the ending " of the string, so returns the buffer that hasn't accumulated anything.

I am not sure how this API is supposed to be used. The only function I see for reading this is to call start_any_variant to discard the opening " but it feels a bit… wrong. What does atdgen do to map_string?

According to this autogenerated code it does seem to start_any_variant and expect the double-quote constructor?

@Leonidas-from-XIV
Copy link
Member

Ok, I added tests for map_string as well, at least in as far as I understand the API.

Looks like if we remove the variants as suggested in #104 we'd either need to break the API for atdgen or live with somewhat confusing function names but that can be revisited at that time.

Copy link
Collaborator

@panglesd panglesd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the fix!
I guess the fact that map_ident and map_lexeme are values that are not actually used by yojson itself, but by atdgen, made them escape the tests...
This is now fixed, thanks @Leonidas-from-XIV for the added tests.

test/test_read.ml Outdated Show resolved Hide resolved
test/test_read.ml Outdated Show resolved Hide resolved
test/test_read.ml Outdated Show resolved Hide resolved
@Leonidas-from-XIV Leonidas-from-XIV merged commit 5966bcc into master Jul 25, 2022
@Leonidas-from-XIV Leonidas-from-XIV deleted the mj-out-of-bounds branch July 25, 2022 13:37
@mjambon
Copy link
Member Author

mjambon commented Jul 25, 2022

What does atdgen do to map_string?

It calls it via map_ident. This used to be an optimization that prevented allocating a string for the JSON field name. This is one of the reasons why the Buffer module from the standard library wasn't suitable. See for example this generated code which does the equivalent of

match String.sub s pos len with
| "sid" -> 2
...

(returning an int rather than constructing an expression is itself part of an optimization that prevents the creation of a closure)

Now that we use the Buffer module and are forced to allocate a fresh string to hold the substring, the optimization that saved the allocation of the substring is gone and we could simplify the generated code above.

Note that there are other optimizations that make atdgen's code generation more complicated than it could be. I estimate the speedup provided by these optimizations to be about 4x compared to naive implementation. It's something but it's not huge. I think these optimizations are only valuable when dealing with massive amounts of JSON data. I'm open to getting rid of them. Atdgen parsers could work on the Yojson.Safe.t AST instead of calling a bunch of partial parsing functions. This is what we do for the other languages supported by atd (Java, Python, Scala, TypeScript) and used to do with atdgen's predecessor based on camlp4, json-static. It would reduce the maintenance burden of atdgen but also of yojson.

@Leonidas-from-XIV
Copy link
Member

I was just thinking about it the other day - there is a lot of support-code for atdgen that's dealing with the internals of Yojson, which forces us to either keep the implementation fixed to not break atdgen or break atdgen all the time, neither of these options is particularly exciting (that said, there is also not much of a need to change things willy-nilly)

So making atdgen use the Yojson.Safe.t AST might solve this and allow us to concentrate on improving one surface API. But I realize that this might create issues for atdgen users. Unfortunately the slowdown is hard to quantify without... essentially doing the work on the atdgen side. I'll create an RFC to keep track of this and maybe we can come up with another fast API that makes sense for consumers that already know what structure they expect (not only atdgen but also ppx_deriving_yojson and ppx_yojson_conv come to mind).

Leonidas-from-XIV added a commit to Leonidas-from-XIV/opam-repository that referenced this pull request Aug 9, 2022
CHANGES:

*2022-08-09*

### Added

- Expanded documentation of exceptions (@sim642, ocaml-community/yojson#148)

### Removed

- Removed undocumented and unused functions `write_float_fast` and
  `write_std_float_fast` from `Yojson`, `Yojson.Basic` and `Yojson.Safe`
  (@sim642, ocaml-community/yojson#149)

### Fixed

- Fix out-of-bounds error occurring when parsing object field names
  with atdgen parsers using `map_ident` or `map_lexeme` (@mjambon, ocaml-community/yojson#150)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants