Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese .msg HTML converted from RTF is garbled. Missing chinese (DBCS) encoding support. #3

Closed
whaozl opened this issue Aug 16, 2018 · 10 comments
Labels

Comments

@whaozl
Copy link

whaozl commented Aug 16, 2018

o???á?<span lang=EN-US>   <o:p></o:p></span>  </span>  </span>    </p></td>  <span style='mso-bookmark:_MailAutoSig'>  </span>  <td width=142 nowrap style='width:106.3pt;border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0cm 5.4pt 0cm 5.4pt;height:14.25pt'>  <p class=MsoNormal align=left style='text-align:left'>   <span style='mso-bookmark:_MailAutoSig'>  <span style='font-size:11.0pt;line-height:110%;color:black'>  è?ìì<span lang=EN-US>   <o:p></o:p></span>  </span>  </span>    </p></td>  <span style='mso-bookmark:_MailAutoSig'>  </span>  </tr></table>  <p class=MsoNormal style='line-height:normal;mso-pagination:none'>  <span style='mso-bookmark:_MailAutoSig'>  <span lang=EN-US style='mso-bidi-font-size:10.5pt;font-family:"?¢èí??oú",sans-serif;mso-no-proof:yes'>    <o:p>  </o:p></span>  </span>  </p><p class=MsoNormal style='line-height:normal;mso-pagination:none'>  <span style='mso-bookmark:_MailAutoSig'>  <span lang=EN-US style='mso-bidi-font-size:10.5pt;font-family:"?¢èí??oú",sans-serif;mso-no-proof:yes'>    <o:p>  </o:p></span>  </span>  </p><p class=MsoNormal style='line-height:normal;mso-pagination:none'>  <span style='mso-bookmark:_MailAutoSig'>  <span lang=EN-US style='mso-bidi-font-size:10.5pt;font-family:"?¢èí??oú",sans-serif;mso-no-proof:yes'>    Best Regards<o:p></o:p></span>  </span>  </p><p class=MsoNormal style='line-height:normal;mso-pagination:none'>  <span style='mso-bookmark:_MailAutoSig'>  <span lang=EN-US style='mso-bidi-font-size:10.5pt;font-family:"?¢èí??oú",sans-serif;mso-no-proof:yes'>

chinese msg is garbled, this can not handler chinese msg of outlook.

I want to convert msg to html.

@bbottema
Copy link
Owner

Do you have the source .msg?

@bbottema bbottema added the bug label Aug 16, 2018
@bbottema bbottema changed the title chinese msg is garbled, this can not handler chinese msg of outlook. Chinese .msg is garbled when converted (missing UTF-8 support) Aug 16, 2018
@whaozl
Copy link
Author

whaozl commented Aug 17, 2018

yeah, I have this msg, and As long as Chinese is garbled. So You can write a Chinese reference email. I offer one. link:<link removed>

@bbottema
Copy link
Owner

bbottema commented Aug 17, 2018

Is this displayed correctly? Just checking if I my Outlook shows correctly:

image

It seems the message is encoded with an unknown encoding. If the above display is correct, then I'm not sure how Outlook detects it. I'm trying to find out.

Actually, I think Outlook doesn't use the HTML body at all. I think that is the RTF being displayed. Still looking into it...

In fact the HTML content is only 56 bytes, so actually the source .msg is messed up. This library is just reading it in, while Outlook ignores it and displays the RTF instead.

@whaozl, how did you get that HTML output from your first post? Can you show me the code you used?

@whaozl
Copy link
Author

whaozl commented Aug 19, 2018

@bbottema

public class Demo04_outlook_message_parser {


	private static OutlookMessage parseMsgFile(String msgPath)
			throws IOException {
		InputStream resourceAsStream = OutlookMessageParser.class.getClassLoader().getResourceAsStream(msgPath);
		return new OutlookMessageParser().parseMsg(resourceAsStream);
	}


    public static void main(String[] args) {
        String path="D:\\Anjos\\01_project\\005_201808OverseaMail\\test\\example.msg";
        try {
            //OutlookMessage msg = parseMsgFile(path);
            InputStream in = new FileInputStream(path);
            OutlookMessage msg=new OutlookMessageParser().parseMsg(in);
            System.out.println(msg.getConvertedBodyHTML());
            System.out.println(path);
        } catch (IOException e) {
            e.printStackTrace();
        }


    }

}

I am garbled through outlook-message-parser.

But this is normal by using MsgViewer.

I know it uses msgparser in the MsgViewer. Although it turned Chinese characters into GBKcode, I had to convert GBK encoding to UTF8.

However, msgparser and outlook-message-parser is mostly same.But I am garbled alone in msgparser or outlook-message-parser.

I do not know how it happened.

I am studying Chinese natural language processing and now need to handler outlook msg. I need to convert msg to html. Because I want to extract the entire table separately, it belongs to a complete semantic content. text can't do it.

I am very grateful for your help, thank you very much.

0

1

The following is obtained through Aspose, he is also normal, but his code is GBK2312.

    public static void main(String[] args) {
        String path="D:\\Anjos\\01_project\\005_201808OverseaMail\\test\\example.msg";
        MapiMessage message = MapiMessage.fromFile(path);
        //Display sender's name
        System.out.println("Sender Name : " + message.getSenderName());
        //Display Subject
        System.out.println("Subject : " + message.getBodyHtml());
        //Display Body
        System.out.println("Body : " + message.getBody());
        //Dispaly HTML
        System.out.println("HTML:" + message.getBodyHtml());
        message.save("D:\\down\\output.html", SaveOptions.getDefaultHtml());
    }
       <dependency>
          <groupId>org.simplejavamail</groupId>
          <artifactId>outlook-message-parser</artifactId>
          <version>1.1.16</version>
        </dependency>



        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-email</artifactId>
            <version>18.6</version>
            <classifier>jdk16</classifier>
        </dependency>

		<dependency>
			<groupId>javax.media.jai</groupId>
			<artifactId>com.springsource.javax.media.jai.core</artifactId>
			<version>1.1.3</version>
		</dependency>
		<dependency>
			<groupId>commons-codec</groupId>
			<artifactId>commons-codec</artifactId>
			<version>1.9</version>
		</dependency>
		<dependency>
    		<groupId>commons-io</groupId>
   		 	<artifactId>commons-io</artifactId>
    		<version>2.5</version>
		</dependency>
	</dependencies>
	<repositories>
        <repository>
            <id>AsposeJavaAPI</id>
            <name>Aspose Java API</name>
            <url>https://artifact.aspose.com/repo/</url>
        </repository>

	</repositories>

@bbottema
Copy link
Owner

bbottema commented Aug 19, 2018

Ok, so first thing: that .msg has a native HTML body, but it is corrupted. The method you used (getConvertedBodyHTML()) is actually the RTF body converted to HTML. The TEST body and RTF body is encoded properly, but the converted HTML is not encoded properly, giving you the garbled text.

I can't fix the native HTML body (that's a problem in the .msg itself), but I can fix the getConvertedBodyHTML. I will work on it.

@whaozl
Copy link
Author

whaozl commented Aug 19, 2018

yeah, the native html is not trusted. generally from RTF to HTML. As I know, MsgViewer and Aspose is normal, although they did not display Chinese characters normally.

Best Regards. Thank you for your help.

@bbottema
Copy link
Owner

I have found the problem and almost have a fix ready.

bbottema added a commit that referenced this issue Aug 19, 2018
…ge as defined in the RTF source. This will enable support for chinese and all other character sets
@bbottema bbottema changed the title Chinese .msg is garbled when converted (missing UTF-8 support) Chinese .msg HTML converted from RTF is garbled when converted. Missing chinese (DBCS) encoding support. Aug 19, 2018
@bbottema bbottema changed the title Chinese .msg HTML converted from RTF is garbled when converted. Missing chinese (DBCS) encoding support. Chinese .msg HTML converted from RTF is garbled. Missing chinese (DBCS) encoding support. Aug 19, 2018
@bbottema
Copy link
Owner

That was quite an adventure.

I had included the RTF -> HTML support from another library loong time ago, but now I had to dive into the RTF spec and double byte encoding technology 😐

I ran into more problems, as RTF should actually be parsed using the provided charset (in your .msg chinese, cp936) for visible text, but windows charset (cp1250) for control characters. Since the old library was implemented with regular expressions, this posed quite a challange.

I managed to solve everything with some hacking work, but in the future I would like to switch to a token-based lexer/parser.

Released to maven version 1.1.17. Please verify the solution.

@whaozl
Copy link
Author

whaozl commented Aug 20, 2018

yeah, Thank you so much for you help. I had test the solution, it is successful. you are very good.

@bbottema
Copy link
Owner

Thank you for the (detailed) bug report, @whaozl!

bbottema pushed a commit to bbottema/simple-java-mail that referenced this issue Aug 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants