Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Copy/Paste #5783

Closed

Conversation

speedplane
Copy link

The Problem

Copy/paste for many PDFs is broken. Even in the TraceMonkey example, try copying and pasting the area highlighted in the image below, and you will get the following text:

 informationis available

image
Notice that when you paste, there is no space between information and is. That is, there is no space between new lines.

But there are examples that are much worse than the above. In the PDF at the link here, the PDF structure itself has no spaces. That is, whatever software created this PDF, only created PDF objects for the text itself, and did not consider spacing. Thus, when one copy/pastes anything from this document using pdf.js, they get no spaces in their pasted text.

It's not just copy/paste that's a problem, but CTRL+F find also does not work. If you CTRL+F for information is in the TraceMonkey example, you will not get any results.

Other Viewers

Other PDF software gets this right, they figure out that there is a space between words or lines and automatically insert the proper space when the user copies text. Adobe Reader, for example, will not only detect a space and insert it, but it will also detect when a line ends with a hyphen and automatically remove it when doing copy/paste. For example, if you copied the word comp-ile in the image above, Reader would actually copy compile. That is probably overkill, but it is interesting that others gave this signifigant amount of thought.

Other software that also gets this right includes pdf miner, a project that goes to great lengths to properly extract text from PDFs. The Chrome viewer (Foxit) also gets this right (it also does hyphens right too).

The Solution

This commit aims to resolve these issues by automatically detecting when we need spaces, and inserting them into the text chunks where appropriate. The insertion occurs within the worker, so not only will copy/paste work, but getTextContent() will now also return text with proper spacing. Accordingly, any headless text extraction will also benefit from this PR.

Unfortunately, for PDFs that do not have the wordSpacing text state set, this feature must resort to using a bit of a heuristic, but I do not see anyway around that.

I also added a test which hooks into the getTextContent function.

@speedplane speedplane force-pushed the text-extract-with-spaces branch from 6fb6dbe to 03fae4b Compare March 4, 2015 12:40
speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 4, 2015
@speedplane speedplane force-pushed the text-extract-with-spaces branch from 03fae4b to c6da1ee Compare March 4, 2015 12:59
speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 4, 2015
@existentialism
Copy link
Contributor

Nice work! Early run through looks great, will testing more thoroughly and will report anything odd. I had started something similar a while back to try and improve copy/paste, glad to see I was on the right path 👍

@speedplane
Copy link
Author

@timvandermeij Hi Tim - can you make a build of this when you get a chance?

@timvandermeij
Copy link
Contributor

Hm, I see no change at all for the Tracemonkey paper. I tried copying the part in your image, but it still copies without a space at the newline for me.

@speedplane
Copy link
Author

@timvandermeij Hi Tim... I just tried it with the viewer you built, and it works fine.

@speedplane speedplane force-pushed the text-extract-with-spaces branch from c6da1ee to 6e56084 Compare March 5, 2015 15:14
speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 5, 2015
@speedplane
Copy link
Author

@timvandermeij I just updated the commit, adding a test specifically for the tracemonkey paper. It's passing for me.

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

pdfjsbot commented Mar 5, 2015

From: Bot.io (Linux)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/7b23cd1eedc95f6/output.txt

@timvandermeij
Copy link
Contributor

Still no luck for me. If I copy the text "information is available" from the first paragraph, I get:

information
is available

with the current master and

information
is available

with the build above, i.e. both are the same. Also if I copy both texts and paste them in for example the Firefox URL bar, both appear as "informationis available", i.e. without a space before "is". You cannot reproduce that?

@speedplane
Copy link
Author

@timvandermeij Very strange. First off, where are you testing master? I assumed that the demo page was master, but is that incorrect?

I don't get a line break in either of them.

Here is my result copied from http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html

information is available

And here is the result copied from http://mozilla.github.io/pdf.js/web/viewer.html

informationis available

I copied both into Notepad++. I'm using Chrome on Windows. Could it be a browser thing?

@speedplane
Copy link
Author

Also, if you have time, could you take a look at the DOM in both and see what you get?

This is the DOM from http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html:
image

And here is the DOM from http://mozilla.github.io/pdf.js/web/viewer.html:
image

Notice there is an extra space in the first.

@timvandermeij
Copy link
Contributor

I do get the space in the DOM, but for some reason it doesn't copy it for both links you posted. I'm really confused about that: is it a browser bug perhaps? I'm using Firefox 36.0.1 on Windows 7 x64.

@timvandermeij
Copy link
Contributor

It does appear to be something with the browser. Take http://www.sersc.org/journals/IJSIP/vol4_no3/5.pdf for example. At the bottom of the first page it says "established by ISO". If you select the text carefully, you can also select a space character behind ISO. Copy/pasting that does give "ISO " with a space, but if you also take the next line the space disappears when you copy/paste. Really weird...

(Note that for this PDF I have tested with the current add-on.)

@speedplane
Copy link
Author

Maybe it isn't a browser bug. Maybe Firefox is actually smarter than the others. As you pointed out, in the TraceMonkey paper, Firefox is inserting line-breaks between lines. There are no line-break characters within the DOM itself, so Firefox must be doing something to detect it and insert it. Other browsers do not insert line-breaks.

I would bet that as part of that logic, when it detects a line break in selected text, it strips the end of the line of any whitespace.

@timvandermeij
Copy link
Contributor

Good point! That would mean that this patch would not have much effect in Firefox, but it will help for other browsers.

@speedplane
Copy link
Author

Try copying and pasting text in the following jsfiddle: https://jsfiddle.net/4wwd654y/

In Chrome, if you try copying and pasting, the entire thing, you get:

Line OneLine Two

I put money on the fact that in Firefox you would get:

Line One
Line Two

@speedplane
Copy link
Author

Regarding your second point: this patch would not have an effect on the Firefox Tracemonkey paper. However, it would still fix messed up PDFs like this one in firefox.

@timvandermeij
Copy link
Contributor

You're right, that is what is happening in Firefox. However, for the PDF you linked in the comment just above this one, I also see no difference in Firefox. Both https://mozilla.github.io/pdf.js/web/viewer.html and http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html copy any text without spaces for me. The DOM shows the spaces in each div, but since they are separate divs, Firefox strips those spaces too. I think we'll have to look into this later, but nevertheless this seems like a great patch for other browsers and for when we figure out what exactly is different in Firefox.

fontAscent = (1 + font.descent) * fontAscent;
}
return {
x : (angle === 0) ? tx[4] :(tx[4] + (fontAscent * Math.sin(angle))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: change this line to x: (angle === 0 ? tx[4] : tx[4] + (fontAscent * Math.sin(angle))), to get the spaces and parentheses right

addSpace = newPosisition.x >= lastPosition.x + lastChunk.width +
wordSpacing;
} else {
// Right to left. Add space if next is before sart.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typ: sart -> start ?

@timvandermeij
Copy link
Contributor

@speedplane Probably it is trying to spawn a process/application that it cannot find. Run npm update to get all dependencies. Probably it is https://github.com/mozilla/pdf.js/blob/master/test/webbrowser.js#L25 that is failing.

@speedplane speedplane force-pushed the text-extract-with-spaces branch from 8519c6b to ee75402 Compare March 8, 2015 15:08
speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 8, 2015
@speedplane speedplane force-pushed the text-extract-with-spaces branch from ee75402 to e9ad0b4 Compare March 8, 2015 15:18
@jazzy-em
Copy link
Contributor

@speedplane, great work! IMHO, you resolve very important problem!

@jazzy-em
Copy link
Contributor

I found some bugs:

  1. Small difference in highlighting. There is small shift to the left.
    For example, search for 'up' in the TraceMonkey example document
    Original:
    original
    Your:
    your

@jazzy-em
Copy link
Contributor

  1. In this document (http://www.selab.isti.cnr.it/ws-mate/example.pdf) try to search "an example paper".
    There will no matches, because there is no space between 'example' and 'paper'.
    ex1

    ex2

@timvandermeij
Copy link
Contributor

@speedplane This also needs a rebase onto the current master.

jeffsack pushed a commit to jeffsack/pdf.js that referenced this pull request Mar 16, 2015
@danez
Copy link

danez commented Jun 30, 2015

What is the status of this? Any news?

Edit: Is this PR only fixing the missing space for line breaks? Or also PDFs that completely miss spaces. In our case we have a pdf that even with this PR applied is not copying with spaces.
https://www.researchgate.net/publication/257019040_Shortcomings_of_classical_phenological_forcing_models_and_a_way_to_overcome_them

@speedplane
Copy link
Author

@danez been very busy with other projects. I want to clean up this PR (and
others that I submitted), but won't realistically get to them until late
August.

On Tue, Jun 30, 2015 at 11:18 AM, Daniel Tschinder <[email protected]

wrote:

What is the status of this? Any news?


Reply to this email directly or view it on GitHub
#5783 (comment).

@danez
Copy link

danez commented Jul 1, 2015

In the case of this pdf there are chunks between each word that only contain the space itself.

The text-layer/viewer is not rendering the text-divs as the width of a space is calculated as 0. I first tried to change something there and render these divs with a single space, but then all the scale-calculation cannot be done as of width being zero.

So I ended up adding spaces to the surrounding words if there are space-chunks in-between to fix that. What do you think about that?

-        // If the last chunk ends with a space it does not need one.
+
         var lastChunk = bidiTexts[bidiTexts.length - 1];
+
+        // If the last chunk equals a single space concat the space to chunk before the last chunk
+        if (lastChunk.str === ' ' && bidiTexts.length - 2 >= 0) {
+            var beforeLastChunk = bidiTexts[bidiTexts.length - 2];
+            if (beforeLastChunk.str.length === 0) {
+                return;
+            }
+            var lastChar = beforeLastChunk.str[beforeLastChunk.str.length - 1];
+            if (lastChar !== ' ') {
+                beforeLastChunk.str += ' ';
+            }
+            return;
+        }
+
         if (lastChunk.str.length === 0) {
           return;
         }
-        var lastChar = lastChunk.str[lastChunk.str.length - 1];
+
+        // If the last chunk ends with a space it does not need one.
+        lastChar = lastChunk.str[lastChunk.str.length - 1];
         if (lastChar === ' ' || lastChar === '-') {
           return;
         }

@danez
Copy link

danez commented Jul 1, 2015

I also noticed that the selection is now off in the web-viewer, because the space is added to the div, but the scale is wrong.
You see the selection ends inside the character. (can also be seen in the image from jazzy-em above)
bildschirmfoto 2015-07-01 um 12 09 52

This of course gets worse with my patch from above.

@Vad1mo
Copy link

Vad1mo commented Oct 12, 2015

hello,
what is the status of the pull request. As I see it even if it doesn't solve the problem as desired it is already a big step forward! Can we not merge it into master?

@timvandermeij
Copy link
Contributor

It needs to be rebased onto the current master and reviewed before we can land this.

@fbender
Copy link

fbender commented Nov 16, 2015

@speedplane any idea when you will get to this?

@timvandermeij
Copy link
Contributor

Closing since this was never completely finished and it's not in a mergeable state. However, the associated issue will remain open with a reference to this pull request so the work can be continued later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants