-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese .msg HTML converted from RTF is garbled. Missing chinese (DBCS) encoding support. #3
Comments
Do you have the source .msg? |
yeah, I have this msg, and As long as Chinese is garbled. So You can write a Chinese reference email. I offer one. link:<link removed> |
Is this displayed correctly? Just checking if I my Outlook shows correctly: It seems the message is encoded with an unknown encoding. If the above display is correct, then I'm not sure how Outlook detects it. I'm trying to find out. Actually, I think Outlook doesn't use the HTML body at all. I think that is the RTF being displayed. Still looking into it... In fact the HTML content is only 56 bytes, so actually the source .msg is messed up. This library is just reading it in, while Outlook ignores it and displays the RTF instead. @whaozl, how did you get that HTML output from your first post? Can you show me the code you used? |
public class Demo04_outlook_message_parser {
private static OutlookMessage parseMsgFile(String msgPath)
throws IOException {
InputStream resourceAsStream = OutlookMessageParser.class.getClassLoader().getResourceAsStream(msgPath);
return new OutlookMessageParser().parseMsg(resourceAsStream);
}
public static void main(String[] args) {
String path="D:\\Anjos\\01_project\\005_201808OverseaMail\\test\\example.msg";
try {
//OutlookMessage msg = parseMsgFile(path);
InputStream in = new FileInputStream(path);
OutlookMessage msg=new OutlookMessageParser().parseMsg(in);
System.out.println(msg.getConvertedBodyHTML());
System.out.println(path);
} catch (IOException e) {
e.printStackTrace();
}
}
} I am garbled through outlook-message-parser. But this is normal by using MsgViewer. I know it uses msgparser in the MsgViewer. Although it turned Chinese characters into GBKcode, I had to convert GBK encoding to UTF8. However, msgparser and outlook-message-parser is mostly same.But I am garbled alone in msgparser or outlook-message-parser. I do not know how it happened. I am studying Chinese natural language processing and now need to handler outlook msg. I need to convert msg to html. Because I want to extract the entire table separately, it belongs to a complete semantic content. text can't do it. I am very grateful for your help, thank you very much. The following is obtained through Aspose, he is also normal, but his code is GBK2312. public static void main(String[] args) {
String path="D:\\Anjos\\01_project\\005_201808OverseaMail\\test\\example.msg";
MapiMessage message = MapiMessage.fromFile(path);
//Display sender's name
System.out.println("Sender Name : " + message.getSenderName());
//Display Subject
System.out.println("Subject : " + message.getBodyHtml());
//Display Body
System.out.println("Body : " + message.getBody());
//Dispaly HTML
System.out.println("HTML:" + message.getBodyHtml());
message.save("D:\\down\\output.html", SaveOptions.getDefaultHtml());
}
|
Ok, so first thing: that .msg has a native HTML body, but it is corrupted. The method you used ( I can't fix the native HTML body (that's a problem in the .msg itself), but I can fix the |
yeah, the native html is not trusted. generally from RTF to HTML. As I know, MsgViewer and Aspose is normal, although they did not display Chinese characters normally. Best Regards. Thank you for your help. |
I have found the problem and almost have a fix ready. |
…ge as defined in the RTF source. This will enable support for chinese and all other character sets
That was quite an adventure. I had included the RTF -> HTML support from another library loong time ago, but now I had to dive into the RTF spec and double byte encoding technology 😐 I ran into more problems, as RTF should actually be parsed using the provided charset (in your .msg chinese, cp936) for visible text, but windows charset (cp1250) for control characters. Since the old library was implemented with regular expressions, this posed quite a challange. I managed to solve everything with some hacking work, but in the future I would like to switch to a token-based lexer/parser. Released to maven version 1.1.17. Please verify the solution. |
yeah, Thank you so much for you help. I had test the solution, it is successful. you are very good. |
Thank you for the (detailed) bug report, @whaozl! |
chinese msg is garbled, this can not handler chinese msg of outlook.
I want to convert msg to html.
The text was updated successfully, but these errors were encountered: