Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pdf "/ActualText" feature #494

Open
mnjames opened this issue Nov 20, 2017 · 9 comments
Open

Use pdf "/ActualText" feature #494

mnjames opened this issue Nov 20, 2017 · 9 comments
Labels
enhancement Software improvement or feature request

Comments

@mnjames
Copy link

mnjames commented Nov 20, 2017

The pdf standard includes a command called /ActualText which allows you to include the unicode text along with the normally occurring glyphs in the pdf. This is wonderful for Arabic and other non-Latin languages that have never had the ability to copy-paste out of pdfs.

XeTeX added the command "\XeTeXgenerateactualtext=1" a year or so ago so that pdfs encoded through it would include the ActualText data in them.

Is it possible to add a similar feature to SILE?

@alerque
Copy link
Member

alerque commented Nov 21, 2017

Never mind Arabic, I can't reliably copy/paste out of a PDF in Latin alphabet based languages!

I've heard of this feature in PDFs before but never played around with it. How widespread is reader support? Do you happen to know of a chart somewhere that shows what readers do or don't support PDF features like this?

@mnjames
Copy link
Author

mnjames commented Nov 21, 2017 via email

@simoncozens
Copy link
Member

There is some support for this through the pdfstructure package. (Linking to #110) Unfortunately I didn't document it and can't remember what it does. But I think if you include pdfstructure, it should automatically generate ActualText.

@neoh4x0r
Copy link

neoh4x0r commented Nov 14, 2022

I haven’t been able to find much documentation on it. From myself and one other user I can currently report:
evince (linux) – doesn’t work

Edit: evince supports this now.
I'm not sure about qpdfview (I couldn't figure out how to copy text)

Edit 2: The only problem with enabling \XeTeXgenerateactualtext is that
when you select text (to copy) is turns invivisble and only shows some squares
possibly indicating missing characters.


I know I'm posting this 5 years later....but....

I was writing a game-list (pdf) through latex (using a script to find the games
and generating a table).

Without specifiying \XeTeXgenerateactualtext=1, in the tex file, any text
containing a plain dash would show in the pdf but would not be present when
copied and pasted elsewhere.

After generating th pdf with the setting active, evince (as of now) has actual
dashes in the text that are able to be copy/pasted as one would expect.

PS: I see no reason why a feature like this shouldn't be turned on by default --
if a reader doesn't support the feature then it should, IMHO, simply ignore it and
display whatever it would have shown previously.

Long story short:

  1. \XeTeXgenerateactualtext=1 could solve an issue with unicode text copy/paste,
  2. It might make the text invisible when selected (happened in evince)

For my use-case -- plain dashes were not being copied and I didn't like the text turning
invisible when selected.

So I ultimately used the ascii package and replaced all dashes with \textascii{\char"2D}

@leorosa
Copy link
Contributor

leorosa commented Nov 14, 2022

I'm not sure about qpdfview (I couldn't figure out how to copy text)

In qpdfview, you can press control+C , select with the mouse the area containing text, and then choose "copy text".

@neoh4x0r
Copy link

neoh4x0r commented Nov 14, 2022

I'm not sure about qpdfview (I couldn't figure out how to copy text)

In qpdfview, you can press control+C , select with the mouse the area containing text, and then choose "copy text".

The text was copied correctly in qpdfview both with and without \XeTeXgenerateactualtext=1

So, it does look like this is purely a PDF-viewer issue (very similar to the old issue of what css features does a browser support) -- and not releated to LaTex, Sile, or xelatex. etc.

@Omikhleia
Copy link
Member

See somewhat related discussion #1927

@Omikhleia
Copy link
Member

Omikhleia commented Sep 13, 2024

For the mere record, I experimented bringing directly /ActualText in the libtexpdf outputter around text boxes, as I suggested in a discussion some time ago: #1927 (reply in thread)

Then, search (and copy) work well in Evince (before, it would fail on the fi ligature...):

image

But when selecting the text, it shows ugly things...

image

It might be an Evince-only problem (using v46.0) -- Okular (using v24.05.2) doesn't have this problem (= it also failed to find/copy the fi ligature, but with the suggested code change everything seems fine)

image

So I'm unsure it's a PDF-viewer problem or there's some deeper issue in this /ActualText naive approach.

@Omikhleia
Copy link
Member

N.B. The "naive" patch:

diff --git a/outputters/libtexpdf.lua b/outputters/libtexpdf.lua
index c7f7d42b..cf7c8c60 100644
--- a/outputters/libtexpdf.lua
+++ b/outputters/libtexpdf.lua
@@ -132,6 +132,8 @@ function outputter:drawHbox (value, width)
    if not value.glyphString then
       return
    end
+   local txt = SU.utf8_to_utf16be_hexencoded(value.text)
+   pdf.add_content("/Span << /ActualText <" .. txt .. "> >>\nBDC\n")
    -- Nodes which require kerning or have offsets to the glyph
    -- position should be output a glyph at a time. We pass the
    -- glyph advance from the htmx table, so that libtexpdf knows
@@ -155,6 +157,7 @@ function outputter:drawHbox (value, width)
       buf = table.concat(buf, "")
       self:_drawString(buf, width, 0, 0)
    end
+   pdf.add_content("\nEMC")
 end
 
 function outputter:_withDebugFont (callback)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Software improvement or feature request
Projects
None yet
Development

No branches or pull requests

6 participants