Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd Unicode characters instead of real letters are now used to render texts #10205

Closed
wojtekmaj opened this issue Nov 1, 2018 · 11 comments
Closed

Comments

@wojtekmaj
Copy link
Contributor

wojtekmaj commented Nov 1, 2018

Hello,
in v2.0.550 text rendered in SVG rendering mode used normal letters:

<tspan x="0 16.308 33.264 61.74 80.46 88.416 114.48 133.2 151.956 167.256 185.976 214.452 232.236 250.74" y="0" font-family="g_d0_f2" font-size="36px" fill="rgb(46,83,149)">
  Sampledocument
</tspan>

The same node in v2.0.943 (after #9192) looks like so:

<tspan x="0 16.308 33.264 61.74 80.46 88.416 114.48 133.2 151.956 167.256 185.976 214.452 232.236 250.74" y="0" font-family="g_d1_f2" font-size="36px" fill="rgb(47,84,150)">
  
</tspan>

I don't see how losing the ability to read the source would benefit anyone. Is there a way to get the old behavior back?

@wojtekmaj
Copy link
Contributor Author

wojtekmaj commented Nov 1, 2018

It looks like glyphs array items have fontChar property broken now. unicode is fine.

Going further to charToGlyph function, we will notice that fontCharCode was usually equal to charcode. In a version working properly, I can see the line:

      var unicode = this.toUnicode.get(charcode) || charcode;

while in the version not working properly:

      var unicode = this.toUnicode.get(charcode) || this.fallbackToUnicode.get(charcode) || charcode;

There are no other notable differences in that function, so I assume it's this line that causes the problems.

@timvandermeij
Copy link
Contributor

From the information here I assume this is an SVG back-end specific issue, is that correct?

@brendandahl @Snuffleupagus Do you perhaps know more about what can cause this?

@wojtekmaj
Copy link
Contributor Author

wojtekmaj commented Nov 2, 2018

It's not back-end specific. It's the easiests to see the consequences when using SVG rendering (which can also be done front-end side). Prior to 2.0.943, SVG used a sane textContent for text to be rendered. Have a look here:

http://projects.wojtekmaj.pl/react-pdf/test/ (this is version based on older PDF.js)

  1. "Use imported file"
  2. Choose SVG render mode
  3. Inspect text in rendered SVG, e.g. "Sample document"

On this version, you'll find "Sampledocument" textContent. Alright, PDFs doing their PDF-y thingies, that's close enough to me.

Now do the same steps using
http://projects.wojtekmaj.pl/react-pdf/test/beta (this is version based on 2.0.943)

In this version you will see that while the letters appear correct, the HTML rendered is garbage.

This has two serious consequences:

  1. I'm unable to programatically create a text layer which would match text and font of the original PDF file
  2. Copying the text from SVG results in a complete garbage:






@wojtekmaj wojtekmaj changed the title SVG rendering now uses odd Unicode characters instead of real letters Odd Unicode characters instead of real letters are now used to render texts Nov 2, 2018
@brendandahl
Copy link
Contributor

I commented in the other bug, this is by design because of #9340. All the char codes are moved into the private use area unicode range. To properly do text selection w/ svg we should do something like the canvas backend and create a text layer from the unicode mappings.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Nov 2, 2018

To properly do text selection w/ svg we should do something like the canvas backend and create a text layer from the unicode mappings.

Note that this is already done in the default viewer, when the renderer preference is set to svg.

I'm unable to programatically create a text layer which would match text and font of the original PDF file

Keep in mind that that will never be a complete solution for text-selection/copying/searching purposes, since the PDF format distinguishes between rendering/text-extraction; hence why e.g. ToUnicode exists.
In particular, consider the case of ligatures (e.g. fi, ff, ...) which PDF viewers generally will expand to their separate characters. Since there's no guarantee that a font will contain data for the separate characters of a ligature, attempting to use the original font for text-selection purposes will never be a complete solution.

Also, please keep in mind that the status of the SVG back-end is probably, as far as I know, best described as "experimental" and that it's thus not officially supported; #9211 (comment) is probably relevant here as well.

@wojtekmaj
Copy link
Contributor Author

It's not only applicable to SVG though. It's especially harmful for SVGs for the reasons I pointed out, like copying the original text, but that can be worked around using the same text layer that's being used for canvas rendering.

I'm using the original fonts to create a text layer over the canvas in my implementation, and using the same font as the original source in vast majority of cases gave me much more accurate results than using some default font. Moving all the char codes are moved into the private use area Unicode range without leaving them in their default positions made the fonts completely unusable.

@wojtekmaj
Copy link
Contributor Author

Is there anything I could do to resolve this issue? Perhaps it could be an option, like disablePrivateUnicodeArea on page.render?

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Nov 13, 2018

[...] and using the same font as the original source in vast majority of cases gave me much more accurate results than using some default font.

A text-selection implementation that by design breaks a relatively common feature, such as ligatures, should probably not be described as a "good solution" in general; but I digress.

Perhaps it could be an option, like disablePrivateUnicodeArea on page.render?

If glyphs are left in their original positions, and are not being re-mapped to a PUA, that is guaranteed to completely break font rendering in a very large number of PDF files; refer to PR #9340 for additional details.
Honestly, it really makes no sense whatsoever to add an option (and related code) that will knowingly break font rendering in this way.

Perhaps it may be slightly more acceptable to add an option, false by default of course (to not unnecessarily bloat toFontChar), that would leave glyphs in their original position in addition to re-mapping them to a PUA. Edit: D'oh, but obviously that won't work, and you'd need an additional array (e.g. originalToFontChar, naming things is hard) to hold this data.
However, before anyone attempts to implement something, it's advisable to wait for Brendan to comment.

@brendandahl
Copy link
Contributor

If we really just want to improve text selection there some other things we could try. One option would be to generate a font that has the same width glyphs as the original font, but each glyph would just draw a square or line and it would be assigned to the unicode value..

@Snuffleupagus
Copy link
Collaborator

One option would be to generate a font that has the same width glyphs as the original font, but each glyph would just draw a square or line and it would be assigned to the unicode value..

In this case, it seems that this issue could just be marked as a duplicate of #1914.

@timvandermeij
Copy link
Contributor

Yes, let's close this as such and track the issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants