Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

Closed
GlenAppleton opened this issue Sep 16, 2020 · 9 comments · Fixed by #14
Closed

UTF-8 with BOM and UTF-16 LE encoded files produce read error #13

GlenAppleton opened this issue Sep 16, 2020 · 9 comments · Fixed by #14

Comments

@GlenAppleton
Copy link

I've been running some tests using your yajsv (windows 64 version) utility, and I've found that both UTF-8 (BOM) and UTF-16 LE encoded files throw the following errors:

UTF-8 with BOM : error: validate: invalid character 'ï' looking for beginning of value
UTF-16 LE : error: validate: invalid character 'ÿ' looking for beginning of value

For my purpose implementing in a PowerShell script, it's not a big deal since I can re-encode the files to standard UTF-8 and it works, but I thought you'd like to know about the encoding limitations.

Great utility, BTW!

@neilpa
Copy link
Owner

neilpa commented Sep 17, 2020

I had forgotten about the prevalence of UTF-16 on Windows although I can't say I've ever seen UTF-8 files with a BOM. I suspect this would be relatively easy to detect and support via the golang.org/x/text package.

Out of curiosity, do you happen to know what utilities generated the UTF-8 w/ BOM files?

@neilpa
Copy link
Owner

neilpa commented Sep 17, 2020

Turns out it's invalid to include a BOM in a JSON document. I'll make this case fail with a better error message but it doesn't make sense for a JSON validator to accept invalid JSON 😉. As for the other UTF-* encodings, that post offers a simple approach to detect them based on the first few bytes.

@neilpa
Copy link
Owner

neilpa commented Sep 17, 2020

Further complicating matters, YAML does support BOMs and are required for UTF-16 detection.

@GlenAppleton
Copy link
Author

@neilpa - There are / were certain situations where older PowerShell versions will encode UTF-8 with BOM when "piping" the output of a text value to a file. As for the UTF-16 LE encoding, I can't say where those are coming from except to say that some of our developers are using Macs, so maybe the default encoding for them is different. It could also just be a matter of editor choice by the developers as well.

@GlenAppleton
Copy link
Author

FYI: The RFC cited was obsoleted by this one: https://tools.ietf.org/html/rfc8259

And, in both the following text is present:

Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.

So, it's really up to you as to whether or not you want your schema validation utility to be able to parse what might be considered "invalid" JSON data. I understand if you do, but you could also consider it a warning condition and maybe add a switch to ignore warnings, leaving the default to treat them as errors.

@neilpa
Copy link
Owner

neilpa commented Sep 17, 2020

Thanks for looking more closely at the RFCs. Obviously I didn't read those carefully enough last night. Given that second sentence it seems reasonable to allow it.

you could also consider it a warning condition and maybe add a switch to ignore warnings, leaving the default to treat them as errors.

Right now files are bucketed to a pass, fail, or error, the last being invalid JSON. I'd say these should still default to an "error" but with a better message about the BOM. The rationale being the moment they are served as-is over the internet the become invalid.

As for the switch, I'm thinking along the lines of -allow-bom to opt-out of that behavior. Although that would diverge from the current single character flags. Might be time to rethink/break given the feedback in #12

@GlenAppleton
Copy link
Author

GlenAppleton commented Sep 18, 2020

@neilpa - As a potential future feature request: Have you considered adding a "debug" switch like -d, where you would send useful information to standard out, such as the text encoding? It could work with the switch you suggested -allow-bom to help those of us who are implementing your utility in a pipeline to figure out potential issues.

Your thoughts?

@neilpa
Copy link
Owner

neilpa commented Oct 6, 2020

@GlenAppleton I posted a v1.4.0 release with support for UTF-16 files. I also added the -b flag for allowing BOMs in JSON files where it would otherwise be an error.

Let me know if you have any issues. I haven't tried this manually on Windows but the large test matrix I added passes the Github CI build on Windows.

Have you considered adding a "debug" switch like -d, where you would send useful information to standard out, such as the text encoding?

I decided to not do this for now but made a note about the possibility in #12

@GlenAppleton
Copy link
Author

@neilpa - Thanks! I'll give it a try once I finish my current project task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants