Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the yaml.js library with another YAML parser #165

Closed
generalmimon opened this issue Feb 4, 2024 · 3 comments · Fixed by #166
Closed

Replace the yaml.js library with another YAML parser #165

generalmimon opened this issue Feb 4, 2024 · 3 comments · Fixed by #166

Comments

@generalmimon
Copy link
Member

So far we've been using jeremyfa/yaml.js to parse YAML in the Web IDE. However, it is more or less abandoned. Last release 0.3.0 was published on 2017-06-24 and there's a note at the top of its README that it is unlikely to receive new features or bugfixes (https://github.com/jeremyfa/yaml.js/tree/efe8ce18704ae43383e4177aa7de1a2619bd4e67#readme):

Development of this library has slowed-down

(...) I don't have much bandwidth to actively provide support to the posted issues asking for new features or bugfixes that don't affect my own use cases of the library. If this situation is an issue for you, I suggest you use js-yaml which is a great and pretty feature-complete yaml parser and dumper for javascript.

But more importantly, we've encountered a number of real problems with jeremyfa/yaml.js in the Web IDE over time (more on that in following comments). So it has long been clear that it's time to switch to another library.

As described at https://philna.sh/blog/2023/02/02/yaml-document-from-hell-javascript-edition/#yaml-in-javascript, there are 3 JavaScript libraries available for YAML parsing:

  1. https://github.com/jeremyfa/yaml.js
  2. https://github.com/nodeca/js-yaml
  3. https://github.com/eemeli/yaml

I wasn't sure whether option 2 or 3 would be a better choice, so I wanted to do some tests. Also, @GreyCat mentioned in https://github.com/kaitai-io/kaitai_struct_webide/pull/84/files#r276498978 that he'd want to make sure that the new YAML parsing library addresses the issues of the old one:

This is actually a pretty big deal, as replacement of this library means that we'll need to re-evaluate all the complaints about YAML parsing - like #63

Please at least put a note about this library change - what exactly is changing and why - into PR info.

So I wrote a few scripts for testing and comparing the parsing results and put them in a repo: https://github.com/generalmimon/js-yaml-parsers-test

Both libraries seem to be pretty solid, definitely better than jeremyfa/yaml.js that we've been using. They both solve many pain points of jeremyfa/yaml.js - I'll list below the issues solved by switching to either library.

Legend to all diffs below:

--- 1/results/yamljs.txt
+++ 2/results/js-yaml.txt
@@ ... @@
  1. No more broken hex literal parsing

  2. No more parsing YAML 1.1 binary literals (0b...) or YAML 1.2 octal literals (0o...) as 0

  3. Flow-style multi-line strings don't trigger a parse error and are parsed correctly as per the spec (i.e. single newlines should not translate to \n in the output string, a double newline is needed for this, see https://yaml-multiline.info/#flow-scalars-plain)

  4. A mapping key-value pair of a mapping indented more than the previous one is considered an error, not included in the value of the previous key-value pair

  5. The notorious colon : as part of the ternary operator in an unquoted string is no longer allowed, which is consistent with the SnakeYAML library used in the JVM compiler (and it is in fact correct behavior according to the YAML spec, see https://matrix.yaml.info/details/ZCZ6.html and https://matrix.yaml.info/details/ZL4Z.html)

  6. Duplicate mapping keys are rejected with an error (this is the default behavior in both nodeca/js-yaml and eemeli/yaml), not silently allowed as in jeremyfa/yaml.js, where only the first entry is kept (i.e. resulting in a data loss, which tends to be surprising for many people that don't know this property of YAML)

    Details
    meta:
      id: yaml_dup_keys
    seq:
      - id: foo
        type: u1
    seq:
      - id: bar
        type: u1
    -{ meta: { id: 'yaml_dup_keys' }, seq: [ { id: 'foo', type: 'u1' } ] }
    +ERROR:
    +> YAMLException: duplicated mapping key in "input.ksy" (6:1)
    +>
    +>  3 | seq:
    +>  4 |   - id: foo
    +>  5 |     type: u1
    +>  6 | seq:
    +> -----^
    +>  7 |   - id: bar
    +>  8 |     type: u1

    (see https://github.com/generalmimon/js-yaml-parsers-test/blob/662870c1c092b8092f458f8ad94c23b3ad62f93c/results-diffs/yamljs_vs_js-yaml.diff#L151-L170)

    Note that this behavior has been suggested several times before:

    In 0.9, the feature of treating duplicate keys as errors was enabled in SnakeYAML used in JVM compiler builds: Multiple type declarations behaviour kaitai_struct#641 (comment)

  7. Comment lines do not suspend counting of line numbers displayed in error messages like they do in jeremyfa/yaml.js

    • Fixes Invalid line number while parsing YAML if comments are used #62

      Details
      # Line 1
      # Line 2
      # Line 3
      # Line 4
      types:
        animal:
            doc: An animal species
          seq:  # line 8, but yaml.js says line 4!
            - id: species
              type: s4
      -> <ParseException> Indentation problem. (line 4: '  seq:  # line 8, but yaml.js says line 4!')
      +> YAMLException: bad indentation of a mapping entry in "input.ksy" (8:5)
      +>
      +>  5 | types:
      +>  6 |   animal:
      +>  7 |       doc: An animal species
      +>  8 |     seq:  # line 8, but yaml.js says ...
      +> ---------^
      +>  9 |       - id: species
      +>  10 |         type: s4
  8. JSON is accepted

@generalmimon
Copy link
Member Author

generalmimon commented Feb 4, 2024

So basically both nodeca/js-yaml and eemeli/yaml look great. However, there are a few practical differences after all.

YAML 1.1 vs. YAML 1.2

eemeli/yaml

More focused on the compliance with YAML specs. It claims to pass all of tests in the YAML test suite (https://github.com/eemeli/yaml/blob/b7696fc001837a2e9d66ad78d7c04f47943daeca/README.md?plain=1#L7):

It allows choosing between YAML 1.1 and YAML 1.2 mode. The differences are listed at https://yaml.org/spec/1.2.2/ext/changes/#changes-in-version-12-revision-120-2009-07-21. These are relevant to our purposes (I'll add numbering for easier reference):

  1. Only true and false strings are parsed as booleans (including True and TRUE); y, yes, on, and their negative counterparts are parsed as strings.
  2. Underlines _ cannot be used within numerical values.
  3. The binary and sexagesimal integer formats have been dropped.

Basically, we want number 1 quite badly, because otherwise seq: [{id: x}, {id: y}] turns to {"seq": [{"id": "x"}, {"id": true}]} (this would obviously affect .ksy files that contain id: y with y unquoted, which is... all of them: https://github.com/search?q=repo%3Akaitai-io%2Fkaitai_struct_formats+%22id%3A+y%22&type=code).

(BTW, SnakeYAML used in KSC builds for JVM targets YAML 1.1, so it parses yes and on as booleans, but apparently makes an exception for y such that y is parsed as a string, not as a boolean.)

However, at the moment, KSC code implicitly relies on the YAML parser to behave according to YAML 1.1 in points 2 and 3 (i.e. recognize underscores _ in numbers and support binary notation). Otherwise, you couldn't use numbers with underscores and binary notation at places where KSC expects to get a number from the YAML parser, i.e.:

  1. enum values

    Details
    • underscores _

      enums:
        fruit:
          '0x12_34': apple
      enum_test.ksy: /enums/fruit:
              error: unable to parse `0x12_34` as int
      
    • binary notation 0b...

      enums:
        fruit:
          '0b11': banana
      enum_test.ksy: /enums/fruit:
              error: unable to parse `0b11` as int
      
  2. terminator and pad-right (though this probably wouldn't be a big deal)

This is quite easy to fix on the KSC side, though. Even if the YAML library parsed all integers as strings (this would be the case if we'd ever configure the YAML parser to use failsafe schema, BTW), it wouldn't be a huge problem, but it would require some additional code in KSC.

nodeca/js-yaml

Although it claims to follow YAML 1.2 (https://github.com/nodeca/js-yaml/blob/2cef47bebf60da141b78b085f3dea3b5733dcc12/README.md?plain=1#L1):

JS-YAML - YAML 1.2 parser / writer for JavaScript

... in practice it still supports some features of YAML 1.1 removed in YAML 1.2. In particular underscores _ in numeric literals and binary notation 0b.... This is convenient for us at the moment, as it means these features will be supported in enum values (and terminator, pad-right) without any changes to KSC.

Error messages

I personally find error messages of nodeca/js-yaml nicer and more understandable, even though eemeli/yaml's errors are somewhat more descriptive instead of just "bad indentation" (but that's probably enough in many cases).

Details

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L43-L52

--- 1/results/yaml-1.2.txt
+++ 2/results/js-yaml.txt
@@ ... @@
-> YAMLParseError: Nested mappings are not allowed in compact mappings at line 3, column 12:
->
->     value: condition ? 4 : 8
->            ^
+> YAMLException: bad indentation of a mapping entry in "input.ksy" (3:26)
 >
+>  1 | instances:
+>  2 |   foo:
+>  3 |     value: condition ? 4 : 8
+> ------------------------------^

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L136-L145

-> YAMLParseError: Nested mappings are not allowed in compact mappings at line 2, column 7:
->
->   id: yaml_1
->       ^
+> YAMLException: bad indentation of a mapping entry in "input.ksy" (3:8)
 >
+>  1 | meta:
+>  2 |   id: yaml_1
+>  3 |     seq:
+> ------------^

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L195-L208

-> YAMLParseError: All mapping items must start at the same column at line 8, column 1:
->
->       doc: An animal species
->     seq:  # line 8, but yaml.js says line 4!
-> ^
+> YAMLException: bad indentation of a mapping entry in "input.ksy" (8:5)
 >
+>  5 | types:
+>  6 |   animal:
+>  7 |       doc: An animal species
+>  8 |     seq:  # line 8, but yaml.js says ...
+> ---------^
+>  9 |       - id: species
+>  10 |         type: s4

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L157-L170

-> YAMLParseError: Map keys must be unique at line 6, column 1:
->
->     type: u1
-> seq:
-> ^
+> YAMLException: duplicated mapping key in "input.ksy" (6:1)
 >
+>  3 | seq:
+>  4 |   - id: foo
+>  5 |     type: u1
+>  6 | seq:
+> -----^
+>  7 |   - id: bar
+>  8 |     type: u1

nodeca/js-yaml also seems to have better detection of duplicate keys - it's able to handle even this case (this is rather an edge case, but still):

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L172-L181

2: "bar"
"2": "foo"
--- 1/results/yaml-1.2.txt
+++ 2/results/js-yaml.txt
@@ ... @@
-{ '2': 'foo' }
+ERROR:
+> YAMLException: duplicated mapping key in "input.ksy" (2:1)
+>
+>  1 | 2: "bar"
+>  2 | "2": "foo"
+> -----^

Ease of integration

nodeca/js-yaml is easier to integrate in the Web IDE, because there's already a pre-built minified JS file: https://github.com/nodeca/js-yaml/blob/4.1.0/dist/js-yaml.min.js

eemeli/yaml doesn't provide a single packaged minified file, see eemeli/yaml#480 (comment). When you install it from npm, there's just the node_modules/yaml/browser folder with 75 unminified files, which are 268 KiB in total:

pp@DESKTOP-89OPGF3 MINGW64 /c/temp/js-yaml-parsers-test/node_modules/yaml/browser (master)
$ find . -type f -exec wc -c {} +
...
274427 total

In comparison, nodeca/js-yaml has a .min.js file, which is 38.5 KiB in size:

pp@DESKTOP-89OPGF3 MINGW64 /c/temp/js-yaml-parsers-test/node_modules/js-yaml/dist (master)
$ wc -c js-yaml.min.js
39430 js-yaml.min.js

That's almost 7 times smaller, and it's only 1 file instead of 75 individual ones, meaning the browser has to send only 1 request instead of 75, so it should all be faster. (Yes, technically it's not really fair to compare unminified files to a minified one, but unfortunately no proper minification/packaging is part of our existing Web IDE infrastructure, and I don't want to spend time on it at this point.)

Conclusion

In the end I chose nodeca/js-yaml for the reasons I stated above.

It wouldn't be a big problem to switch to eemeli/yaml if we ever wanted to, but it would require some changes to KSC and the Web IDE infrastructure (add minification step etc.). For now, nodeca/js-yaml is easier to switch to, and I suppose it will work a bit better for us than eemeli/yaml would.

Obviously, it would be best to use a YAML parser in Scala that works in all environments (kaitai-io/kaitai_struct#229), but until that becomes a reality, I believe switching to nodeca/js-yaml is much better than staying with jeremyfa/yaml.js.

@GreyCat
Copy link
Member

GreyCat commented Feb 5, 2024

Hey, @generalmimon, as always, this is very very thorough and I can only thank you for doing all this work. This indeed looks like a magnificent improvement and fixes quite a lot of bugs that have pestered us and our users for years!

I agree with all your assessments. For YAML 1.1 vs YAML 1.2 — I believe ultimately it we'll have to settle on certain pattern of treatment these values holistically — e.g.:

  • either go "all strict everywhere", demanding all values coming in to be exactly strings (i.e. no foo: true or foo: 123 allowed, it must be always foo: "true" or foo: "123")
  • or go "we never demand a specific type when we need value, we're ok to be flexible to accept certain numbers and booleans as values, and we're ok for YAML parser to do it, otherwise our own parse will take it".

So far, looks like most of the places when we had it leans towards the latter (it makes it very useful for authoring YAML by hand).

@generalmimon
Copy link
Member Author

generalmimon commented Feb 7, 2024

@GreyCat Thanks for your comment!

  • either go "all strict everywhere", demanding all values coming in to be exactly strings (i.e. no foo: true or foo: 123 allowed, it must be always foo: "true" or foo: "123")
  • or go "we never demand a specific type when we need value, we're ok to be flexible to accept certain numbers and booleans as values, and we're ok for YAML parser to do it, otherwise our own parse will take it".

So far, looks like most of the places when we had it leans towards the latter (it makes it very useful for authoring YAML by hand).

Yeah, the latter option sounds more reasonable of the two. I don't see a reason for option one, that sounds too radical and unnecessary (but at least using two pairs of quotes to express a string literal in a KS expression as in https://doc.kaitai.io/user_guide.html#_switching_over_strings wouldn't feel weird anymore 🙂).

Instead, as you've already mentioned in kaitai-io/kaitai_struct#229 (comment):

What we're generally ok to drop

Some YAML compatibility, i.e. we're generally ok to implement a smaller YAML subset. For example, we don't need:

... we can eventually configure our YAML parsers to use failsafe schema, which basically disables scalar interpretation done by the YAML parser. In this mode, there are only 3 data types the YAML parser recognizes: mapping, sequence and string. This means that it no longer knows anything about null, booleans, integers or floats - you'll get any representation of these types (that would normally be mapped to the appropriate type) as a plain string.

For KSC purposes, I suppose this would be pretty much perfect and all we really need. In our case, having the YAML parser produce all these different data types is more annoying than helpful.

For example, right now if you want to have a value instance in the Web IDE with an integer literal that requires more than 53 bits of precision to store exactly, you have to wrap it into quotes like value: '9007199254740993' so that the number is only parsed from string by our expression parser in KSC (which uses BigInts), rather than by the JavaScript YAML parser (as with unquoted value: 9007199254740993), which uses the native JS number type (a.k.a. double, i.e. a 64-bit IEEE 754 floating-point number). However, this inevitably leads to losing precision, since double can only represent 9007199254740992 or 9007199254740994, but nothing in between.

But even on the JVM with SnakeYAML that reads big integers as java.math.BigInteger, the KSC code then converts it back to string, so the string-integer conversion done by SnakeYAML isn't much appreciated either :) (see https://github.com/kaitai-io/kaitai_struct_compiler/blob/3af7a08e0f67212f42a4f8b97a0b25a72395a2c4/jvm/src/main/scala/io/kaitai/struct/formats/JavaKSYParser.scala#L99-L100):

      case _: java.math.BigInteger =>
        src.toString

But at least there's no loss of precision.


Looking at https://doc.kaitai.io/ksy_diagram.html, most of YAML keys that we recognize in .ksy specs are either pure strings (e.g. id, doc) or expressions (if, size), where it's perfectly fine to always receive them as strings as well.

There are a few pure boolean keys (see also https://github.com/search?q=repo%3Akaitai-io%2Fkaitai_struct_compiler%20getOptValueBool&type=code):

  • meta/ks-debug
  • meta/ks-opaque-types
  • size-eos
  • consume
  • include
  • eos-error

and pure integer keys (see https://github.com/search?q=repo%3Akaitai-io%2Fkaitai_struct_compiler+getOptValueInt&type=code):

  • terminator
  • pad-right

And as already mentioned, the integer keys of enum entries would have to be parsed too.

But it's not a big problem to adapt these in KSC to accept strings instead and do the appropriate parsing.

Perhaps the only potential problem is that this change will create some slight incompatibilities with other standalone tools that consume .ksy files and don't use failsafe schema (most likely some situations that were allowed with a YAML parser using implicit typing won't be allowed anymore and vice versa). This is because with failsafe schema, we simply always get strings, but we have no way of knowing whether a scalar was unquoted (meaning that it may be subject to implicit typing in YAML parsers that don't use failsafe schema) or not. For example, size-eos: 'true' or terminator: '0x00' would now be allowed, even though they weren't before. On the other hand, size-eos: yes or size-eos: on would now be probably disallowed (unless we decided to allow these YAML 1.1 representations of true, but I'd rather not to - just literal true and false are enough IMO), although they worked before. But this is probably not a big deal.

Both nodeca/js-yaml and eemeli/yaml that I've talked about in this issue support failsafe schema. SnakeYAML we're currently using doesn't seem to, but SnakeYAML Engine claiming to be a YAML 1.2 processor (as opposed to SnakeYAML, which targets YAML 1.1) apparently supports it (see https://bitbucket.org/snakeyaml/snakeyaml-engine/src/f4b8cdb1e846d5354fb4791916ba3c5932496770/src/test/java/org/snakeyaml/engine/schema/FailsafeTest.java).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants