Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Resultset UTF-8 encoding issues when escaped with \u #303

Closed
fgiasson opened this issue Jan 19, 2015 · 5 comments
Closed

JSON Resultset UTF-8 encoding issues when escaped with \u #303

fgiasson opened this issue Jan 19, 2015 · 5 comments

Comments

@fgiasson
Copy link

Hi,

It appears that UTF-8 characters returned in SPARQL JSON resultsets are not properly encoded with \u.

Here is a DBPedia query that fails:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+%3Fo+%3Falt%0D%0Awhere%0D%0A%7B%0D%0A++%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FwikiPageRedirects%3E+%3Fo+.%0D%0A%0D%0A++%7B%0D%0A++++%3Fs+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label%3E+%3Falt+.%0D%0A++%7D%0D%0A%7D%0D%0Alimit+10000%0D%0Aoffset+10000&format=application%2Fsparql-results%2Bjson&timeout=30000&debug=on#

Encoded characters such as "\U0001B000" should probably encoded as "\uD82C\uDC00" instead.

@knoan
Copy link

knoan commented May 14, 2015

Spot on… JSON only supports 4-digit Unicode escape sequences. Unicode characters outside the BMP must be emitted directly as a UTF-8 sequence (allowed by JSON production char) or encoded as surrogate pairs.

This is a serious bug as browser-provided JSON.parse() doesn't support lenient parsing and breaks on illegal escape sequences, as in

JSON.parse("\U0001B000")

May be reproduced by the following query on the DBpedia endpoint:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select * {
   <http://dbpedia.org/resource/Ancient_Carthage> rdfs:comment ?c filter (lang(?c) = 'en')
}

@knoan
Copy link

knoan commented May 14, 2015

The following should work as a stopgap measure:

    JSON.parse(text.replace(/\\U([0-9A-Fa-f]{8})/g, function ($0, $1) {

        var c=parseInt($1, 16)-0x010000;
        var h=(c>>10)+ 0xD800;
        var l=(c & 0x3FF) + 0xDC00;

        return String.fromCharCode(h, l)

    }))

@HughWilliams
Copy link
Collaborator

This issue was fixed a few days ago , and will be making its way to the commercial and open source archives , dbpedia included in the coming days ...

@HughWilliams
Copy link
Collaborator

The fix for this issue has been pushed to the open source develop/7 branch:

http://sourceforge.net/p/virtuoso/virtuoso-opensource/ci/e0f65ec67f980251579fbd614be1fb0ac6b18786

@fgiasson
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants