Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to assign encoding of response content? #26

Open
winglight opened this issue Aug 4, 2016 · 6 comments
Open

How to assign encoding of response content? #26

winglight opened this issue Aug 4, 2016 · 6 comments
Assignees
Labels

Comments

@winglight
Copy link

I found wrong charset from the response content from non-utf8 web page. Here's a url for example: http://www.cartoomad.com/comic/276400012051002.html

@amoilanen
Copy link
Owner

Hi, thank you for filing the issue, I will take a look. Normally we would read the encoding from the HTTP headers, but maybe in this case it does not quite work and we can think of alternatives.

@amoilanen amoilanen self-assigned this Aug 10, 2016
@winglight
Copy link
Author

I checked the response from this url that hadn't an encoding value in the response headers so the current code can't get the correct encoding. Maybe it's an alternative way to check meta values of the response body, such as:
<meta http-equiv="Content-Type" content="text/html; charset=big5">

@tibetty
Copy link
Contributor

tibetty commented Nov 11, 2016

@winglight in this case, you can use indexOf function (and other string analysis functions) of Buffer to digest the encoding from body. Please pay particular attention that by default Node.js doesn't support too many character encodings, and big5 is not in the supporting list, so you may need to find decoder/transcoder before processing big5 encoded content given most likely your code is working with utf-8.

@ngouy
Copy link

ngouy commented Oct 17, 2017

same problem here with a page contains charset=iso-8859-1

@aidik
Copy link

aidik commented Nov 18, 2018

+1

@tibetty
Copy link
Contributor

tibetty commented Nov 19, 2018

When you don't set the encoding, the crawler will not do any encoding work for you (actually Node.js itself does not support other encoding except UTF-8/16 and ASCII either, so it's a helpless choice). In this case, the received body can be treated as a Buffer that contains all the raw bytes encoded in given encoding, and what you can do is to use 3rd-party decoding tools like node-iconv or iconv-lite to do the conversion to unicode String that is supported by JavaScript language, after that you can process the converted string in the manner you are accustomed to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants