Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset terribleness breaks Electron apps #6011

Closed
emackey opened this issue Nov 29, 2017 · 18 comments
Closed

Charset terribleness breaks Electron apps #6011

emackey opened this issue Nov 29, 2017 · 18 comments

Comments

@emackey
Copy link
Contributor

emackey commented Nov 29, 2017

This is going to be a strange one.

Quick refresher: UTF-8 is, sadly, not the default charset, and must be manually specified. On the web, this is done with HTTP headers (or in HTML, a <meta> tag that emulates the correct HTTP header). On the desktop however, this is done with a BOM (Byte Order Mark, about 3 bytes at the start of a text file). The BOM is tolerated by most browsers, but is not intended to be transmitted over the web, as the web is supposed to be using the charset encoding HTTP headers.

The built, combined/minified Cesium.js file now contains some UTF-8 range characters, as of this month's 1.39 release, due to Arabic and Hebrew chars added to a RegExp in #5931. Cesium's minifier dutifully makes this smaller by taking out the \u codes and replacing them with actual UTF-8 chars.

On the web, this is no problem. Cesium.js does not start with a BOM, since the web does not want a BOM. Most web servers are configured to serve the JS file as UTF-8, and so they interpret these characters correctly.

In a desktop Electron environment though, we're hosed. Electron does not automatically pick this up as UTF-8. The entire Cesium.js file fails to parse, and breaks the entire app even if Cesium is not being used.

One possible solution would be to somehow prevent the minifier from converting these \u codes to actual UTF-8. It would be nice to keep Cesium.js free of live UTF-8 chars for the sake of transmission through unorthodox mechanisms that can't handle the extended character set. The end client's own JS implementation should be able to recover the needed chars from their \u codes that can survive an ASCII transmission. Thoughts?

/cc @mramato

@shunter
Copy link
Contributor

shunter commented Nov 29, 2017

looks like you want the ascii_only option passed to uglify2 by the r.js optimizer.

https://github.com/mishoo/UglifyJS2

https://github.com/requirejs/r.js/blob/master/build/example.build.js#L179

@shunter
Copy link
Contributor

shunter commented Nov 29, 2017

also electron/electron#678 suggests that <meta charset="UTF-8"> affects referenced script files, so that might be an easier fix.

@emackey
Copy link
Contributor Author

emackey commented Nov 29, 2017

Thanks @shunter! I do have the <meta> tag in there already, but my viewer is inside a sandboxed iframe and it's not picking up the charset the way it should.

I tried another thing that appears to work: Double slashes. I believe this leaves it to the client JS parser to interpret the codes, instead of our build process. Would this be OK? Or would it be better to add the ascii_only option?

    var hebrew = '\\u05D0-\\u05EA';
    var arabic = '\\u0600-\\u06FF\\u0750–\\u077F\\u08A0–\\u08FF';
    var rtlChars = new RegExp('[' + hebrew + arabic + ']');

@emackey
Copy link
Contributor Author

emackey commented Nov 29, 2017

@shunter I tried to enable this but the build has no diffs, so it didn't take effect.

@shunter
Copy link
Contributor

shunter commented Nov 29, 2017

Looks like ascii_only is an "output" option so try in an output sub-object. Compare with the r.js example linked above.

@mramato
Copy link
Contributor

mramato commented Nov 30, 2017

@emackey I'm confused as to why this is a problem for you. The latest Cesium release works fine in Electron for me. Here's the HTML (index.js just creates the browser window and loads this HTML page and that's the entire app).

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no">
    <title>Cesium</title>
    <style>
        @import url(./node_modules/cesium/Build/Cesium/Widgets/widgets.css);

        html,
        body,
        #cesiumContainer {
            width: 100%;
            height: 100%;
            margin: 0;
            padding: 0;
            overflow: hidden;
        }
    </style>
</head>

<body>
    <div id="cesiumContainer"></div>
    <script type="text/javascript" src="./node_modules/cesium/Build/Cesium/Cesium.js"></script>
    <script>
        window.viewer = new Cesium.CesiumWidget('cesiumContainer');
    </script>
</body>

</html>

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

It breaks in VSCode's HTML preview window. Scott's fix is going to work. Also I found some minus signs that weren't ASCII minus signs. PR soon.

@mramato
Copy link
Contributor

mramato commented Nov 30, 2017

This seems like a bug in VSCode though, doesn't it? Even if we "fix" it in Cesium, sounds like we should opena a VSCode issue for it. I have no idea if other build systems, like webpack, have ascii only options.

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

It used to work though, in VSCode. The hebrew/arabic stuff is what broke it specifically. And oddly, that's nowhere near the only place in Cesium where we've been using UTF-8 codes. I don't yet fully understand what's different about the new RTL regex from all the other old places we encode non-ascii things.

@mramato
Copy link
Contributor

mramato commented Nov 30, 2017

Even more of a reason to open an issue in VSCode, they might be able to pinpoint the problem a lot faster since it works in browsers and in Electron and from our understanding we are doing nothing incorrect.

I just want to make sure we full understand the problem before we implement a solution.

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

Well. New clue: All of the existing non-ascii chars in Cesium are wrong in VSCode. But I'm only using the Scene class, and it doesn't depend on most of that, so you can't see they're wrong. The RegExp however contains a sanity check, so when you do something like [8-2], it throws a parse error saying the range is in the wrong order. When the UTF-8 chars are misinterpreted, it looks like an invalid RegExp range, and Cesium fails to parse.

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

Also I confirmed that I can manually add a BOM to the top of Cesium.js to fix VSCode. Naturally I don't want to do that, as it's a manual edit with each Cesium upgrade, and would prevent using the npm package eventually.

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

Switching to an ASCII build adds 1330 bytes to the 2MB Cesium.js. It looks like there really are that many non-ascii chars in play.

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

Well, dang. Almost too easy:

<script type="text/javascript" charset="UTF-8" src="Cesium.js"></script>

@emackey emackey closed this as completed Nov 30, 2017
@mramato
Copy link
Contributor

mramato commented Nov 30, 2017

Cool, just to close the loop on this, the charset specified at the HTML level should normally take care of this, but for whatever reason in VSCode you need to specify the charset directly on the script?

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

I blame the file:/// scheme (it has been the source of a lot of weirdness lately so it's an easy scapegoat). The file scheme has no HTTP headers, so needs its own manual charset specification (in lieu of an actual BOM, which it does respect, but Cesium won't supply).

And if you think about it, the charset specified on the HTML technically should only apply to the HTML itself, not any included files. The trick of applying it to included JS files that you mention above is basically just an assumption on the part of the loader, that two separate files "probably" use the same charset. I would expect the real answer to come from HTTP headers on the JS itself, but, there's no HTTP in my environment here.

Also, one more loop needs closing: Two of your minus signs aren't real minus signs. Do these chars act as actual RegExp range characters, or are they just loose UTF-8 things mucking up your language detection? We should write a unit test to test a char in the middle of each affected range.

@shunter
Copy link
Contributor

shunter commented Nov 30, 2017

I was curious so I tested it out. The not-real minus sign is actually an en dash, \u2013. As @emackey predicts, this does not actually work in defining a range of characters in a regex.

new RegExp('a\u2013z').test('b')

=> false

// using the unicode character
new RegExp('a–z').test('b')

=> false

So, yes, the current code does not actually work. Looks like a unit test is missing.

@emackey
Copy link
Contributor Author

emackey commented Nov 30, 2017

@shunter Your sample above is missing square brackets, but the findings are correct.

// en dash
new RegExp('[a–z]').test('b')
false

// minus
new RegExp('[a-z]').test('b')
true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants