-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Copy/Paste #5783
Improve Copy/Paste #5783
Conversation
6fb6dbe
to
03fae4b
Compare
…appropriate. Add test re same. PR mozilla#5783.
03fae4b
to
c6da1ee
Compare
…appropriate. Add test re same. PR mozilla#5783.
Nice work! Early run through looks great, will testing more thoroughly and will report anything odd. I had started something similar a while back to try and improve copy/paste, glad to see I was on the right path 👍 |
@timvandermeij Hi Tim - can you make a build of this when you get a chance? |
Hm, I see no change at all for the Tracemonkey paper. I tried copying the part in your image, but it still copies without a space at the newline for me. |
@timvandermeij Hi Tim... I just tried it with the viewer you built, and it works fine. |
c6da1ee
to
6e56084
Compare
…appropriate. Add test re same. PR mozilla#5783.
@timvandermeij I just updated the commit, adding a test specifically for the tracemonkey paper. It's passing for me. |
/botio-linux preview |
From: Bot.io (Linux)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://107.21.233.14:8877/7b23cd1eedc95f6/output.txt |
From: Bot.io (Linux)SuccessFull output at http://107.21.233.14:8877/7b23cd1eedc95f6/output.txt Total script time: 0.79 mins Published |
Still no luck for me. If I copy the text "information is available" from the first paragraph, I get: information is available with the current master and information is available with the build above, i.e. both are the same. Also if I copy both texts and paste them in for example the Firefox URL bar, both appear as "informationis available", i.e. without a space before "is". You cannot reproduce that? |
@timvandermeij Very strange. First off, where are you testing master? I assumed that the demo page was master, but is that incorrect? I don't get a line break in either of them. Here is my result copied from http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html
And here is the result copied from http://mozilla.github.io/pdf.js/web/viewer.html
I copied both into Notepad++. I'm using Chrome on Windows. Could it be a browser thing? |
Also, if you have time, could you take a look at the DOM in both and see what you get? This is the DOM from http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html: And here is the DOM from http://mozilla.github.io/pdf.js/web/viewer.html: Notice there is an extra space in the first. |
I do get the space in the DOM, but for some reason it doesn't copy it for both links you posted. I'm really confused about that: is it a browser bug perhaps? I'm using Firefox 36.0.1 on Windows 7 x64. |
It does appear to be something with the browser. Take http://www.sersc.org/journals/IJSIP/vol4_no3/5.pdf for example. At the bottom of the first page it says "established by ISO". If you select the text carefully, you can also select a space character behind ISO. Copy/pasting that does give "ISO " with a space, but if you also take the next line the space disappears when you copy/paste. Really weird... (Note that for this PDF I have tested with the current add-on.) |
Maybe it isn't a browser bug. Maybe Firefox is actually smarter than the others. As you pointed out, in the TraceMonkey paper, Firefox is inserting line-breaks between lines. There are no line-break characters within the DOM itself, so Firefox must be doing something to detect it and insert it. Other browsers do not insert line-breaks. I would bet that as part of that logic, when it detects a line break in selected text, it strips the end of the line of any whitespace. |
Good point! That would mean that this patch would not have much effect in Firefox, but it will help for other browsers. |
Try copying and pasting text in the following jsfiddle: https://jsfiddle.net/4wwd654y/ In Chrome, if you try copying and pasting, the entire thing, you get:
I put money on the fact that in Firefox you would get:
|
Regarding your second point: this patch would not have an effect on the Firefox Tracemonkey paper. However, it would still fix messed up PDFs like this one in firefox. |
You're right, that is what is happening in Firefox. However, for the PDF you linked in the comment just above this one, I also see no difference in Firefox. Both https://mozilla.github.io/pdf.js/web/viewer.html and http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html copy any text without spaces for me. The DOM shows the spaces in each |
fontAscent = (1 + font.descent) * fontAscent; | ||
} | ||
return { | ||
x : (angle === 0) ? tx[4] :(tx[4] + (fontAscent * Math.sin(angle))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: change this line to x: (angle === 0 ? tx[4] : tx[4] + (fontAscent * Math.sin(angle))),
to get the spaces and parentheses right
addSpace = newPosisition.x >= lastPosition.x + lastChunk.width + | ||
wordSpacing; | ||
} else { | ||
// Right to left. Add space if next is before sart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typ: sart
-> start
?
@speedplane Probably it is trying to spawn a process/application that it cannot find. Run |
8519c6b
to
ee75402
Compare
…appropriate. Add test re same. PR mozilla#5783.
…appropriate. Add test re same. PR mozilla#5783.
ee75402
to
e9ad0b4
Compare
@speedplane, great work! IMHO, you resolve very important problem! |
|
@speedplane This also needs a rebase onto the current master. |
…appropriate. Add test re same. PR mozilla#5783.
What is the status of this? Any news? Edit: Is this PR only fixing the missing space for line breaks? Or also PDFs that completely miss spaces. In our case we have a pdf that even with this PR applied is not copying with spaces. |
@danez been very busy with other projects. I want to clean up this PR (and On Tue, Jun 30, 2015 at 11:18 AM, Daniel Tschinder <[email protected]
|
In the case of this pdf there are chunks between each word that only contain the space itself. The text-layer/viewer is not rendering the text-divs as the width of a space is calculated as 0. I first tried to change something there and render these divs with a single space, but then all the scale-calculation cannot be done as of width being zero. So I ended up adding spaces to the surrounding words if there are space-chunks in-between to fix that. What do you think about that? - // If the last chunk ends with a space it does not need one.
+
var lastChunk = bidiTexts[bidiTexts.length - 1];
+
+ // If the last chunk equals a single space concat the space to chunk before the last chunk
+ if (lastChunk.str === ' ' && bidiTexts.length - 2 >= 0) {
+ var beforeLastChunk = bidiTexts[bidiTexts.length - 2];
+ if (beforeLastChunk.str.length === 0) {
+ return;
+ }
+ var lastChar = beforeLastChunk.str[beforeLastChunk.str.length - 1];
+ if (lastChar !== ' ') {
+ beforeLastChunk.str += ' ';
+ }
+ return;
+ }
+
if (lastChunk.str.length === 0) {
return;
}
- var lastChar = lastChunk.str[lastChunk.str.length - 1];
+
+ // If the last chunk ends with a space it does not need one.
+ lastChar = lastChunk.str[lastChunk.str.length - 1];
if (lastChar === ' ' || lastChar === '-') {
return;
} |
hello, |
It needs to be rebased onto the current master and reviewed before we can land this. |
@speedplane any idea when you will get to this? |
Closing since this was never completely finished and it's not in a mergeable state. However, the associated issue will remain open with a reference to this pull request so the work can be continued later on. |
The Problem
Copy/paste for many PDFs is broken. Even in the TraceMonkey example, try copying and pasting the area highlighted in the image below, and you will get the following text:
Notice that when you paste, there is no space between
information
andis
. That is, there is no space between new lines.But there are examples that are much worse than the above. In the PDF at the link here, the PDF structure itself has no spaces. That is, whatever software created this PDF, only created PDF objects for the text itself, and did not consider spacing. Thus, when one copy/pastes anything from this document using pdf.js, they get no spaces in their pasted text.
It's not just copy/paste that's a problem, but CTRL+F find also does not work. If you CTRL+F for
information is
in the TraceMonkey example, you will not get any results.Other Viewers
Other PDF software gets this right, they figure out that there is a space between words or lines and automatically insert the proper space when the user copies text. Adobe Reader, for example, will not only detect a space and insert it, but it will also detect when a line ends with a hyphen and automatically remove it when doing copy/paste. For example, if you copied the word
comp-ile
in the image above, Reader would actually copycompile
. That is probably overkill, but it is interesting that others gave this signifigant amount of thought.Other software that also gets this right includes pdf miner, a project that goes to great lengths to properly extract text from PDFs. The Chrome viewer (Foxit) also gets this right (it also does hyphens right too).
The Solution
This commit aims to resolve these issues by automatically detecting when we need spaces, and inserting them into the text chunks where appropriate. The insertion occurs within the worker, so not only will copy/paste work, but
getTextContent()
will now also return text with proper spacing. Accordingly, any headless text extraction will also benefit from this PR.Unfortunately, for PDFs that do not have the
wordSpacing
text state set, this feature must resort to using a bit of a heuristic, but I do not see anyway around that.I also added a test which hooks into the
getTextContent
function.