UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

GlenAppleton · 2020-09-16T18:18:38Z

I've been running some tests using your yajsv (windows 64 version) utility, and I've found that both UTF-8 (BOM) and UTF-16 LE encoded files throw the following errors:

UTF-8 with BOM : error: validate: invalid character 'ï' looking for beginning of value
UTF-16 LE : error: validate: invalid character 'ÿ' looking for beginning of value

For my purpose implementing in a PowerShell script, it's not a big deal since I can re-encode the files to standard UTF-8 and it works, but I thought you'd like to know about the encoding limitations.

Great utility, BTW!

neilpa · 2020-09-17T01:33:29Z

I had forgotten about the prevalence of UTF-16 on Windows although I can't say I've ever seen UTF-8 files with a BOM. I suspect this would be relatively easy to detect and support via the golang.org/x/text package.

Out of curiosity, do you happen to know what utilities generated the UTF-8 w/ BOM files?

neilpa · 2020-09-17T02:22:06Z

Turns out it's invalid to include a BOM in a JSON document. I'll make this case fail with a better error message but it doesn't make sense for a JSON validator to accept invalid JSON 😉. As for the other UTF-* encodings, that post offers a simple approach to detect them based on the first few bytes.

neilpa · 2020-09-17T02:42:41Z

Further complicating matters, YAML does support BOMs and are required for UTF-16 detection.

GlenAppleton · 2020-09-17T12:17:47Z

@neilpa - There are / were certain situations where older PowerShell versions will encode UTF-8 with BOM when "piping" the output of a text value to a file. As for the UTF-16 LE encoding, I can't say where those are coming from except to say that some of our developers are using Macs, so maybe the default encoding for them is different. It could also just be a matter of editor choice by the developers as well.

GlenAppleton · 2020-09-17T12:53:34Z

FYI: The RFC cited was obsoleted by this one: https://tools.ietf.org/html/rfc8259

And, in both the following text is present:

Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.

So, it's really up to you as to whether or not you want your schema validation utility to be able to parse what might be considered "invalid" JSON data. I understand if you do, but you could also consider it a warning condition and maybe add a switch to ignore warnings, leaving the default to treat them as errors.

neilpa · 2020-09-17T19:28:11Z

Thanks for looking more closely at the RFCs. Obviously I didn't read those carefully enough last night. Given that second sentence it seems reasonable to allow it.

you could also consider it a warning condition and maybe add a switch to ignore warnings, leaving the default to treat them as errors.

Right now files are bucketed to a pass, fail, or error, the last being invalid JSON. I'd say these should still default to an "error" but with a better message about the BOM. The rationale being the moment they are served as-is over the internet the become invalid.

As for the switch, I'm thinking along the lines of -allow-bom to opt-out of that behavior. Although that would diverge from the current single character flags. Might be time to rethink/break given the feedback in #12

GlenAppleton · 2020-09-18T15:24:04Z

@neilpa - As a potential future feature request: Have you considered adding a "debug" switch like -d, where you would send useful information to standard out, such as the text encoding? It could work with the switch you suggested -allow-bom to help those of us who are implementing your utility in a pipeline to figure out potential issues.

Your thoughts?

neilpa · 2020-10-06T23:52:24Z

@GlenAppleton I posted a v1.4.0 release with support for UTF-16 files. I also added the -b flag for allowing BOMs in JSON files where it would otherwise be an error.

Let me know if you have any issues. I haven't tried this manually on Windows but the large test matrix I added passes the Github CI build on Windows.

Have you considered adding a "debug" switch like -d, where you would send useful information to standard out, such as the text encoding?

I decided to not do this for now but made a note about the possibility in #12

GlenAppleton · 2020-10-07T20:08:13Z

@neilpa - Thanks! I'll give it a try once I finish my current project task.

neilpa mentioned this issue Sep 18, 2020

UTF-16 support for JSON and option to skip BOMs #14

Merged

2 tasks

neilpa closed this as completed in #14 Oct 6, 2020

neilpa mentioned this issue Oct 6, 2020

Add flag for zero-output/exit code only #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

GlenAppleton commented Sep 16, 2020

neilpa commented Sep 17, 2020

neilpa commented Sep 17, 2020

neilpa commented Sep 17, 2020 •

edited

Loading

GlenAppleton commented Sep 17, 2020

GlenAppleton commented Sep 17, 2020

neilpa commented Sep 17, 2020 •

edited

Loading

GlenAppleton commented Sep 18, 2020 •

edited

Loading

neilpa commented Oct 6, 2020

GlenAppleton commented Oct 7, 2020

UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

Comments

GlenAppleton commented Sep 16, 2020

neilpa commented Sep 17, 2020

neilpa commented Sep 17, 2020

neilpa commented Sep 17, 2020 • edited Loading

GlenAppleton commented Sep 17, 2020

GlenAppleton commented Sep 17, 2020

neilpa commented Sep 17, 2020 • edited Loading

GlenAppleton commented Sep 18, 2020 • edited Loading

neilpa commented Oct 6, 2020

GlenAppleton commented Oct 7, 2020

neilpa commented Sep 17, 2020 •

edited

Loading

neilpa commented Sep 17, 2020 •

edited

Loading

GlenAppleton commented Sep 18, 2020 •

edited

Loading