Charset terribleness breaks Electron apps #6011

emackey · 2017-11-29T21:36:46Z

This is going to be a strange one.

Quick refresher: UTF-8 is, sadly, not the default charset, and must be manually specified. On the web, this is done with HTTP headers (or in HTML, a <meta> tag that emulates the correct HTTP header). On the desktop however, this is done with a BOM (Byte Order Mark, about 3 bytes at the start of a text file). The BOM is tolerated by most browsers, but is not intended to be transmitted over the web, as the web is supposed to be using the charset encoding HTTP headers.

The built, combined/minified Cesium.js file now contains some UTF-8 range characters, as of this month's 1.39 release, due to Arabic and Hebrew chars added to a RegExp in #5931. Cesium's minifier dutifully makes this smaller by taking out the \u codes and replacing them with actual UTF-8 chars.

On the web, this is no problem. Cesium.js does not start with a BOM, since the web does not want a BOM. Most web servers are configured to serve the JS file as UTF-8, and so they interpret these characters correctly.

In a desktop Electron environment though, we're hosed. Electron does not automatically pick this up as UTF-8. The entire Cesium.js file fails to parse, and breaks the entire app even if Cesium is not being used.

One possible solution would be to somehow prevent the minifier from converting these \u codes to actual UTF-8. It would be nice to keep Cesium.js free of live UTF-8 chars for the sake of transmission through unorthodox mechanisms that can't handle the extended character set. The end client's own JS implementation should be able to recover the needed chars from their \u codes that can survive an ASCII transmission. Thoughts?

/cc @mramato

The text was updated successfully, but these errors were encountered:

shunter · 2017-11-29T21:43:08Z

looks like you want the ascii_only option passed to uglify2 by the r.js optimizer.

https://github.com/mishoo/UglifyJS2

https://github.com/requirejs/r.js/blob/master/build/example.build.js#L179

shunter · 2017-11-29T21:48:30Z

also electron/electron#678 suggests that <meta charset="UTF-8"> affects referenced script files, so that might be an easier fix.

emackey · 2017-11-29T21:59:00Z

Thanks @shunter! I do have the <meta> tag in there already, but my viewer is inside a sandboxed iframe and it's not picking up the charset the way it should.

I tried another thing that appears to work: Double slashes. I believe this leaves it to the client JS parser to interpret the codes, instead of our build process. Would this be OK? Or would it be better to add the ascii_only option?

    var hebrew = '\\u05D0-\\u05EA';
    var arabic = '\\u0600-\\u06FF\\u0750–\\u077F\\u08A0–\\u08FF';
    var rtlChars = new RegExp('[' + hebrew + arabic + ']');

emackey · 2017-11-29T22:43:05Z

@shunter I tried to enable this but the build has no diffs, so it didn't take effect.

shunter · 2017-11-29T22:49:50Z

Looks like ascii_only is an "output" option so try in an output sub-object. Compare with the r.js example linked above.

mramato · 2017-11-30T00:10:50Z

@emackey I'm confused as to why this is a problem for you. The latest Cesium release works fine in Electron for me. Here's the HTML (index.js just creates the browser window and loads this HTML page and that's the entire app).

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no">
    <title>Cesium</title>
    <style>
        @import url(./node_modules/cesium/Build/Cesium/Widgets/widgets.css);

        html,
        body,
        #cesiumContainer {
            width: 100%;
            height: 100%;
            margin: 0;
            padding: 0;
            overflow: hidden;
        }
    </style>
</head>

<body>
    <div id="cesiumContainer"></div>
    <script type="text/javascript" src="./node_modules/cesium/Build/Cesium/Cesium.js"></script>
    <script>
        window.viewer = new Cesium.CesiumWidget('cesiumContainer');
    </script>
</body>

</html>

emackey · 2017-11-30T00:17:57Z

It breaks in VSCode's HTML preview window. Scott's fix is going to work. Also I found some minus signs that weren't ASCII minus signs. PR soon.

mramato · 2017-11-30T00:21:11Z

This seems like a bug in VSCode though, doesn't it? Even if we "fix" it in Cesium, sounds like we should opena a VSCode issue for it. I have no idea if other build systems, like webpack, have ascii only options.

emackey · 2017-11-30T00:32:30Z

It used to work though, in VSCode. The hebrew/arabic stuff is what broke it specifically. And oddly, that's nowhere near the only place in Cesium where we've been using UTF-8 codes. I don't yet fully understand what's different about the new RTL regex from all the other old places we encode non-ascii things.

mramato · 2017-11-30T00:36:17Z

Even more of a reason to open an issue in VSCode, they might be able to pinpoint the problem a lot faster since it works in browsers and in Electron and from our understanding we are doing nothing incorrect.

I just want to make sure we full understand the problem before we implement a solution.

emackey · 2017-11-30T00:42:40Z

Well. New clue: All of the existing non-ascii chars in Cesium are wrong in VSCode. But I'm only using the Scene class, and it doesn't depend on most of that, so you can't see they're wrong. The RegExp however contains a sanity check, so when you do something like [8-2], it throws a parse error saying the range is in the wrong order. When the UTF-8 chars are misinterpreted, it looks like an invalid RegExp range, and Cesium fails to parse.

emackey · 2017-11-30T00:44:13Z

Also I confirmed that I can manually add a BOM to the top of Cesium.js to fix VSCode. Naturally I don't want to do that, as it's a manual edit with each Cesium upgrade, and would prevent using the npm package eventually.

emackey · 2017-11-30T00:48:25Z

Switching to an ASCII build adds 1330 bytes to the 2MB Cesium.js. It looks like there really are that many non-ascii chars in play.

emackey · 2017-11-30T15:23:40Z

Well, dang. Almost too easy:

<script type="text/javascript" charset="UTF-8" src="Cesium.js"></script>

mramato · 2017-11-30T15:30:44Z

Cool, just to close the loop on this, the charset specified at the HTML level should normally take care of this, but for whatever reason in VSCode you need to specify the charset directly on the script?

emackey · 2017-11-30T15:46:31Z

I blame the file:/// scheme (it has been the source of a lot of weirdness lately so it's an easy scapegoat). The file scheme has no HTTP headers, so needs its own manual charset specification (in lieu of an actual BOM, which it does respect, but Cesium won't supply).

And if you think about it, the charset specified on the HTML technically should only apply to the HTML itself, not any included files. The trick of applying it to included JS files that you mention above is basically just an assumption on the part of the loader, that two separate files "probably" use the same charset. I would expect the real answer to come from HTTP headers on the JS itself, but, there's no HTTP in my environment here.

Also, one more loop needs closing: Two of your minus signs aren't real minus signs. Do these chars act as actual RegExp range characters, or are they just loose UTF-8 things mucking up your language detection? We should write a unit test to test a char in the middle of each affected range.

shunter · 2017-11-30T15:56:49Z

I was curious so I tested it out. The not-real minus sign is actually an en dash, \u2013. As @emackey predicts, this does not actually work in defining a range of characters in a regex.

new RegExp('a\u2013z').test('b')

=> false

// using the unicode character
new RegExp('a–z').test('b')

=> false

So, yes, the current code does not actually work. Looks like a unit test is missing.

emackey · 2017-11-30T16:21:45Z

@shunter Your sample above is missing square brackets, but the findings are correct.

// en dash
new RegExp('[a–z]').test('b')
false

// minus
new RegExp('[a-z]').test('b')
true

emackey added the category - tooling label Nov 29, 2017

emackey mentioned this issue Nov 30, 2017

Use ASCII chars in minified output #6013

Closed

emackey closed this as completed Nov 30, 2017

emackey mentioned this issue Nov 30, 2017

UTF-8 scripts AnalyticalGraphicsInc/gltf-vscode#70

Merged

emackey mentioned this issue Nov 30, 2017

Real minus signs #6016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charset terribleness breaks Electron apps #6011

Charset terribleness breaks Electron apps #6011

emackey commented Nov 29, 2017

shunter commented Nov 29, 2017

shunter commented Nov 29, 2017

emackey commented Nov 29, 2017

emackey commented Nov 29, 2017

shunter commented Nov 29, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017 •

edited

Loading

emackey commented Nov 30, 2017 •

edited

Loading

emackey commented Nov 30, 2017

emackey commented Nov 30, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017

shunter commented Nov 30, 2017

emackey commented Nov 30, 2017

Charset terribleness breaks Electron apps #6011

Charset terribleness breaks Electron apps #6011

Comments

emackey commented Nov 29, 2017

shunter commented Nov 29, 2017

shunter commented Nov 29, 2017

emackey commented Nov 29, 2017

emackey commented Nov 29, 2017

shunter commented Nov 29, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017 • edited Loading

emackey commented Nov 30, 2017 • edited Loading

emackey commented Nov 30, 2017

emackey commented Nov 30, 2017

mramato commented Nov 30, 2017

emackey commented Nov 30, 2017

shunter commented Nov 30, 2017

emackey commented Nov 30, 2017

emackey commented Nov 30, 2017 •

edited

Loading

emackey commented Nov 30, 2017 •

edited

Loading