[Cross-posted from the Forum/Suggestion] Implement a way to integrate (original image file, detected text) →searchable PDF #660

Wikinaut · 2017-01-13T09:17:27Z

https://groups.google.com/forum/#!topic/tesseract-ocr/vvMldrkcuOQ has asked:

I have a pdf (scanned) and now i make a searchable pdf from this.
First i generate a black/white multipage tif, and with tesseract i can make a searchable pdf.
But is it somehow possible to integrate the original pdf images?
because the generated tif has not the same quality like the original (maybe the scaned image is in color).

How to reproduce:

Assume one page with a colored background in.pdf, converted to in.ppm image
preprocess unpaper in.ppm in-cleaned.ppm
process with (example) tesseract in-cleaned.ppm out -l deu+eng --oem 2 pdf txt
tesseract mixed output file out.pdfhas now a blotchy background (from the unpaper step above)

Is there any way to "feed-in" the original in.ppm as image, so that this is used instead of in-cleaned.ppm when creating the out.pdf ?

So what is wanted is original input image plus ocr layer, so that output looks like

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2017-01-14T01:02:07Z

This is a complicated way of asking for an option to send one image through OCR and insert a different image in the output PDF.

tesseract --pdf-image original.png cleaned.png -l eng --oem 2 pdf  # not implemented, could work like this

I know this was requested before and I believe @jbreiden said it would be added to the PDF renderer at some point.

jbreiden · 2017-01-14T02:33:31Z

I'm very reluctant to make Tesseract PDF generation fancy. I wonder if we can do an image swap like this outside of Tesseract, using one of the PDF manipulation toolkits.

jbarlow83 · 2017-01-14T02:38:57Z

Sounds reasonable. It is fairly simple to swap an image using qpdf's C++ API.

…

On Fri, Jan 13, 2017 at 18:33 jbreiden ***@***.***> wrote: I'm very reluctant to make Tesseract PDF generation fancy. I wonder if we can do an image swap like this outside of Tesseract, using one of the PDF manipulation toolkits. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcMz65b8BY11SURxHH8RJPXNgJj_N6ks5rSDQHgaJpZM4Liq1Q> .

Wikinaut · 2017-01-14T02:42:23Z

@jbreiden It's the last really missing issue.
The new algorithm is already a boost in quality. I reach here up to 100% OCR quality (for --oem 2 -l deu+eng) including these beasty "Umlauts" äöüÄÖÜß....

If this helps, I will donate some mBTC for implementing it just right now. Just post your receiving address.

Wikinaut · 2017-01-14T02:52:04Z

@jbarlow83 background info. As you know, I recently wanted to try your OCRmyPDF because I found the interesting -clean option (source: https://media.readthedocs.org/pdf/ocrmypdf/latest/ocrmypdf.pdf ) which would have solved my problem:

which "does not alter the final output":

--clean
uses unpaper to clean up pages before OCR, but does not alter the final output.  This makes it less
likely that OCR will try to find text in background noise.
•
--clean-final
uses unpaper to clean up pages before OCR and inserts the page into the final output.  You
will want to review each page to ensure that unpaper did not remove something important.

but unfortunately this does not work with tesseract 4, at the present.

So I looked for bug reports, if tesseract could pass the original input image to the output; and filed the present issue.

jbreiden · 2017-01-14T05:22:58Z

Really? That's interesting, qpdf is very well written. Maybe the right thing to do is allow Tesseract to produce a multi-page PDF with invisible symbolic text PDF only, no images. Then another tool (perhaps an enhanced qpdf tool) would merge and composite two PDFs together. One being the original image-only PDF, and the other an invisible-text-only PDF. What do you think, @jbarlow83? Please point me at the relevant qpdf API calls if you happen to know them.

jbarlow83 · 2017-01-15T00:30:06Z

I think invisible text only output would be far more useful for developers that integrate tesseract or anyone who wants to do something fancy. It would still make sense to keep the existing OCR with image option of course. As a plus, it's should be easier to suppress the image than add a different one.

OCRmyPDF (which I maintain) use Ghostscript to rasterize and then runs one of its two PDF renderers. One uses Tesseract hOCR and provides more features but is not as good at producing the OCR text layer as Tesseract PDF, so I also provide Tesseract PDF. If Tesseract could produce a invisible text only I could offer all the features for both, and work towards phasing out the hOCR renderer. When possible I already do graft the text layer onto the existing PDF instead of constructing a new one.

In addition to OCRmyPDF pdftk multibackground could merge an OCR layer onto an existing PDF (by "watermarking"), so there is at least one other supported tool out there that should work out of the box. There's some other tools that wrap tesseract for use with PDFs as well.

In writing this I've made a case for not using qpdf because other tools should be able to do the job with an invisible text PDF, but for interest's sake case here is example code that inverts black and white for all images; clearly this is close to how one would replace an image outright.

jbreiden · 2017-01-15T03:51:10Z

This sounds reasonable to me. I'll try to find time over this coming week to make an experimental invisible-text-only PDF that we can play with. All the other pieces of the puzzle are there; for example Leptonica already ships with a images->pdf tool that avoids transcoding for PNG, JP2K, and JPEG. It would be cool to use qpdf for the merge step because it is already so useful for linearizing. But it's great that there are more options. The qpdf author is extremely friendly in my experience, in case we eventually chat with him. Oh, I now vaguely remember that PDFBox had something for merging as well, but I've never tried it and can't find it at the moment.

jbreiden · 2017-01-17T23:37:52Z

Here's an experimental PDF pair, image-only and text-only. Let the merging begin!

images.pdf
text.pdf

jbreiden · 2017-01-18T05:46:59Z

This works brilliantly. I will implement for real if someone promises that they will use it. Also, what do we call the configuration option? My best idea so far to describe a PDF that has invisible text only is 'naked'. I'm sure someone has a better idea.

$ time pdftk text.pdf multibackground images.pdf output full.pdf
real	0m0.253s

Actually this works better the other way around, for preserving the bookmarks and things like that.

 pdftk  images.pdf multibackground text.pdf output full.pdf

jbreiden · 2017-01-18T20:23:10Z

Implementation complete and under review by Ray. @jbarlow83 this is a good time to look at the samples above and make sure they meet your needs.

tesseract -c naked_pdf=true HelloWorld.png HelloWorld pdf

jbarlow83 · 2017-01-18T23:01:10Z

Looks really good @jbreiden.

Works great in pdftk. No display issues and PDF syntax looks fine.

PyPDF2 is also capable of merging. It does not have the equivalent of "multibackground" but merge pages manually. Here is merging one page:

In [1]: import PyPDF2 as pypdf

In [4]: pdf_text = pypdf.PdfFileReader(open('text.pdf', 'rb'))

In [5]: pdf_image = pypdf.PdfFileReader(open('images.pdf', 'rb'))

In [6]: page_text = pdf_text.pages[1]

In [7]: page_image = pdf_image.pages[1]

In [8]: page_text.mergeRotatedScaledTranslatedPage(page_image, 0, 1.0, 0, 0, expand=False)

In [9]: out = pypdf.PdfFileWriter()

In [10]: out.addPage(page_text)

In [11]: with open('pypdfmerge.pdf','wb') as o:
    ...:     out.write(o)
    ...:

For reference, pdfbox did not work out of the box. As far as I can tell the closest command in pdfbox is

java -jar pdfbox-app-2.0.2.jar OverlayPDF images.pdf text.pdf pdfboxoverlay.pdf

However pdfbox takes the unusual approach of rasterizing the overlay PDF as a bitmap and drawing it on top of the base page, making it useless regardless of image/text order. (I suppose when you go to the trouble implementing a full PDF renderer in Java you feel compelled to use it even when it's not strictly needed.)

jbarlow83 · 2017-01-18T23:12:06Z

I don't know about calling it a naked PDF because there's nothing exciting to see in it. It's more of a phantom or spectral apparition PDF, having form without substance.

ocr_text_only would do, or suppress_images? Not nearly as fun, but practical.

jbreiden · 2017-01-18T23:19:39Z

Spectral writing. Perhaps a kind of ghost script, if you will.

Shreeshrii · 2017-01-19T09:35:27Z

How about text_only_pdf ?

@jbreiden is it also possible to use a .pdf file as input to tesseract directly?

amitdo · 2017-01-19T13:53:49Z

pdf_invisible_text_layer_only
+
a config file pdfinvisible (or maybe pdf0)

jbarlow83 · 2017-01-19T19:52:41Z

@Shreeshrii PDF is a very complex vector-based file format. Tesseract works only on images. It is much easier to write PDFs that use a limited set of PDF features than read arbitrary PDFs. Have a look at OCRmyPDF (which I develop) - it addresses the details of using tesseract to apply OCR to PDFs.

Wikinaut · 2017-01-19T21:41:41Z

@jbreiden @jbarlow83 @amitdo info: I just built the whole toolchain from their git repos (tesseract, ocrmypdf, unpaper), and have ghostscript version 9.20 ready in a dedicated debian 9 "OCR VM" on my Qubes OS system.

Pls. let me know, what (if) you want me to test - I have time to test and want to help you.

jbreiden · 2017-01-19T22:37:20Z

Hmmm, an invisible text layer, invisible text, let's see ... iText? Anyway, I'll pick something. There is zero chance that a PDF rasterizer will ever be part of Tesseract or Leptonica. In theory one could write an PDF image extractor for Leptonica, but there isn't really enough motivation to do so.

jbreiden · 2017-01-20T02:22:41Z

Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited.

--- api/pdfrenderer.cpp	2016-12-13 14:43:24.000000000 -0800
+++ api/pdfrenderer.cpp	2017-01-19 14:50:56.000000000 -0800
@@ -178,10 +178,12 @@
  * PDF Renderer interface implementation
  **********************************************************************/
 
-TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
+TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
+                                 bool textonly)
     : TessResultRenderer(outputbase, "pdf") {
   obj_  = 0;
   datadir_ = datadir;
+  textonly_ = textonly;
   offsets_.push_back(0);
 }
 
@@ -326,7 +328,11 @@
   pdf_str.add_str_double("", prec(width));
   pdf_str += " 0 0 ";
   pdf_str.add_str_double("", prec(height));
-  pdf_str += " 0 0 cm /Im1 Do Q\n";
+  pdf_str += " 0 0 cm";
+  if (!textonly_) {
+    pdf_str += " /Im1 Do";
+  }
+  pdf_str += " Q\n";
 
   int line_x1 = 0;
   int line_y1 = 0;
@@ -832,6 +838,7 @@
 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
   size_t n;
   char buf[kBasicBufSize];
+  char buf2[kBasicBufSize];
   Pix *pix = api->GetInputImage();
   char *filename = (char *)api->GetInputName();
   int ppi = api->GetSourceYResolution();
@@ -840,6 +847,9 @@
   double width = pixGetWidth(pix) * 72.0 / ppi;
   double height = pixGetHeight(pix) * 72.0 / ppi;
 
+  snprintf(buf2, sizeof(buf2), "XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
+  const char *xobject = (textonly_) ? "" : buf2;
+
   // PAGE
   n = snprintf(buf, sizeof(buf),
                "%ld 0 obj\n"
@@ -850,19 +860,18 @@
                "  /Contents %ld 0 R\n"
                "  /Resources\n"
                "  <<\n"
-               "    /XObject << /Im1 %ld 0 R >>\n"
+               "    %s"
                "    /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
                "    /Font << /f-0-0 %ld 0 R >>\n"
                "  >>\n"
                ">>\n"
                "endobj\n",
                obj_,
-               2L,            // Pages object
-               width,
-               height,
-               obj_ + 1,      // Contents object
-               obj_ + 2,      // Image object
-               3L);           // Type0 Font
+               2L,  // Pages object
+               width, height,
+               obj_ + 1,  // Contents object
+               xobject,   // Image object
+               3L);       // Type0 Font
   if (n >= sizeof(buf)) return false;
   pages_.push_back(obj_);
   AppendPDFObject(buf);
@@ -899,13 +908,15 @@
   objsize += strlen(b2);
   AppendPDFObjectDIY(objsize);
 
-  char *pdf_object;
-  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
-    return false;
+  if (!textonly_) {
+    char *pdf_object = nullptr;
+    if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
+      return false;
+    }
+    AppendData(pdf_object, objsize);
+    AppendPDFObjectDIY(objsize);
+    delete[] pdf_object;
   }
-  AppendData(pdf_object, objsize);
-  AppendPDFObjectDIY(objsize);
-  delete[] pdf_object;
   return true;
 }
 

--- api/renderer.h	2016-11-07 07:44:03.000000000 -0800
+++ api/renderer.h	2017-01-19 14:50:56.000000000 -0800
@@ -186,7 +186,7 @@
  public:
   // datadir is the location of the TESSDATA. We need it because
   // we load a custom PDF font from this location.
-  TessPDFRenderer(const char *outputbase, const char *datadir);
+  TessPDFRenderer(const char* outputbase, const char* datadir, bool textonly);
 
  protected:
   virtual bool BeginDocumentHandler();
@@ -196,20 +196,20 @@
  private:
   // We don't want to have every image in memory at once,
   // so we store some metadata as we go along producing
-  // PDFs one page at a time. At the end that metadata is
+  // PDFs one page at a time. At the end, that metadata is
   // used to make everything that isn't easily handled in a
   // streaming fashion.
   long int obj_;                     // counter for PDF objects
   GenericVector<long int> offsets_;  // offset of every PDF object in bytes
   GenericVector<long int> pages_;    // object number for every /Page object
   const char *datadir_;              // where to find the custom font
+  bool textonly_;                    // skip images if set
   // Bookkeeping only. DIY = Do It Yourself.
   void AppendPDFObjectDIY(size_t objectsize);
   // Bookkeeping + emit data.
   void AppendPDFObject(const char *data);
   // Create the /Contents object for an entire page.
-  static char* GetPDFTextObjects(TessBaseAPI* api,
-                                 double width, double height);
+  char* GetPDFTextObjects(TessBaseAPI* api, double width, double height);
   // Turn an image into a PDF object. Only transcode if we have to.
   static bool imageToPDFObj(Pix *pix, char *filename, long int objnum,
                           char **pdf_object, long int *pdf_object_size);

--- api/tesseractmain.cpp	2016-12-15 15:28:37.000000000 -0800
+++ api/tesseractmain.cpp	2017-01-19 14:50:56.000000000 -0800
@@ -337,8 +337,10 @@
 
     api->GetBoolVariable("tessedit_create_pdf", &b);
     if (b) {
-      renderers->push_back(
-          new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()));
+      bool textonly;
+      api->GetBoolVariable("textonly_pdf", &textonly);
+      renderers->push_back(new tesseract::TessPDFRenderer(
+          outputbase, api->GetDatapath(), textonly));
     }
 
     api->GetBoolVariable("tessedit_write_unlv", &b);

--- ccmain/tesseractclass.cpp	2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.cpp	2017-01-19 18:15:57.000000000 -0800
@@ -391,6 +391,8 @@
                   this->params()),
       BOOL_MEMBER(tessedit_create_pdf, false, "Write .pdf output file",
                   this->params()),
+      BOOL_MEMBER(textonly_pdf, false, "Invisible text only for PDF",
+                  this->params()),
       STRING_MEMBER(unrecognised_char, "|",
                     "Output char for unidentified blobs", this->params()),
       INT_MEMBER(suspect_level, 99, "Suspect marker level", this->params()),

--- ccmain/tesseractclass.h	2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.h	2017-01-19 16:31:04.000000000 -0800
@@ -1027,6 +1027,7 @@
   BOOL_VAR_H(tessedit_create_hocr, false, "Write .html hOCR output file");
   BOOL_VAR_H(tessedit_create_tsv, false, "Write .tsv output file");
   BOOL_VAR_H(tessedit_create_pdf, false, "Write .pdf output file");
+  BOOL_VAR_H(textonly_pdf, false, "Invisible text only for PDF");
   STRING_VAR_H(unrecognised_char, "|",
                "Output char for unidentified blobs");
   INT_VAR_H(suspect_level, 99, "Suspect marker level");

RNCTX · 2017-01-20T02:51:59Z

@Shreeshrii http://kiirani.com/2013/03/22/tesseract-pdf.html

The PDF/invisible text output you guys are implementing works quite well for me using OSX 'Preview' but for a little jerkiness depending on scaling, of course.

This is quite a big deal, in my opinion, as it will allow those who have, for instance... legal documents containing notary stamps in color, or in my use-case aviation emergency manuals with color-coded pages, to keep their original copies unmodified from their scanners, but modify them in a clean way into searchable documents. Thanks for this.

Shreeshrii · 2017-01-20T04:12:15Z

Thanks for info on pdf to images conversion for use with tesseract.

I usually use ghostscript for the purpose e.g.

gs -dNOPAUSE -dBATCH  -r300x300 -sDEVICE=tiffg4  -dFirstPage=168  -dLastPage=174 -sOutputFile=sample%03d.tif ./sample.pdf

I will give the other suggestions a try (including a new one suggested by zdenop in the forum- https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/vvMldrkcuOQ/xLES3_ZoEwAJ )

@jbreiden Thanks, Jeff, for this invisible text output pdf which can be merged with the original pdf.

jbreiden · 2017-01-20T04:24:10Z

pdfimages from poppler-utils will do image extraction as well. And pdfium offers API calls for image extraction. I am sure there are many others. Have fun.

amitdo · 2017-01-20T10:25:32Z

Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited.

I suggest to merge this to master now. Ray can modify it later if needed.

zdenop · 2017-01-20T20:28:38Z

merged to master.
@Wikinaut: try master now.

Wikinaut · 2017-01-20T20:41:42Z

@zdenop effa574 does not work: breaks tesseract [UPDATE:] ~~and creates broken files~~. Who has tested that patch, and how ?

Wikinaut · 2017-01-21T00:32:10Z

why again *.jpg (step 1) ? Never ever use jpg with text files.
Please don't tell the mass about jpeg. Use png, ppm, or tif...

Wikinaut · 2017-01-21T00:33:24Z

I already developed code for this using -c textonly_pdf=1, thanks

jbreiden · 2017-01-21T00:33:47Z

A image-only PDF file is a bag of images. If the bag is holding a bunch of JPEG images, extract them as-is. Don't convert. Don't recompress. Just empty the PDF bag and get your images out. If it is holding JPEG2000, then just get those out. Same with PNG.

Wikinaut · 2017-01-21T00:35:22Z

Yes and no, why can't tesseract do this (pass-through the "bunch of input images") ?

jbreiden · 2017-01-21T00:37:12Z

Let's shift this discussion back to the forum. Please re-ask your most recent question there; I don't follow exactly what you are asking.

Wikinaut · 2017-01-21T00:38:44Z

Pls. elaborate your step
"Extract the images from the PDF file (don't render!). For this example, we'll assume jpeg."

I use
pdftoppm -aa yes -r 400 -scale-to-x 2000 -scale-to-y 2800 in.pdf image

zdenop · 2017-01-21T06:46:31Z

C-API should be fixed now. Thanks for finding this wikinaut.
@jbreiden capi.cpp and capi.h are C-API for tesseract that is used for tesseract wrappers (python etc.)
@Wikinaut as pointed by Jeff, please move back this discussion to tesseract user forum.

Jmuccigr · 2017-05-04T01:12:43Z

Was there a final resolution to this request for putting back in the original images? @Wikinaut?

jbreiden · 2017-05-04T01:52:19Z

Yes. The final solution was to implement tesseract -c textonly_pdf=1

Jmuccigr · 2017-05-04T04:51:06Z

Yeah, that doesn't work for me: Could not set option: textonly_pdf=1

I'm using version 3.05.00 installed via homebrew.

Wikinaut · 2017-05-04T08:50:48Z

@Jmuccigr I am definitely not happy with the current implementation, and decided some months ago to stay silent and let other users come back with the issue (hoping, that my original proposal - pass-through the original input image without transcoding it - will be implement in forthcoming versions).

amitdo · 2017-05-04T09:49:23Z

I'm using version 3.05.00

The textonly_pdf parameter is only available on the HEAD (4.00)

Jmuccigr · 2017-05-04T15:06:12Z

@Wikinaut, yeah, my workflow at some point involves adding OCR'ed text to an optimized PDF. Having the OCR step degrade the quality of that PDF kind of spoils it.

Shreeshrii · 2017-05-04T15:18:01Z

The textonly_pdf parameter is only available on the HEAD (4.00)

@zdenop Please backport for 3.05. Thanks!

zdenop · 2017-05-05T16:57:01Z

done.

Shreeshrii · 2017-05-05T17:12:38Z

Thanks, @zdenop. Please also make a 3.05.01 release with the latest commit in 3.05 branch so that all these enhancements are easily accessible.

Jmuccigr · 2017-06-05T17:54:01Z

Just getting back to this now that 3.05.01 has hit homebrew and wanted to say that it seems to be working.

I've tested it out by running text-only tesseract on a 2x version of an image - which tends to give better results if the original dpi is too low - and then combining that text-only PDF with a PDF made from the original image, which keeps the file size down.

gsauthof · 2018-05-01T13:21:19Z

FWIW, I created a small command line utility pdfmerge as a frontend to the merge functionality (equivalent to the pdfktk multibackground command) in the Python packages PyPDF2 and pdfrw.

wrznr · 2019-10-24T16:22:39Z

From version 8.4.0 on, qpdf has the options --overlay/--underlay for easy merging of image-only and text PDFs. E.g.,

$ qpdf image.pdf --underlay text.pdf -- image_txt.pdf

amitdo · 2019-10-24T17:06:23Z

@wrznr, thanks for the info.

Jeankree · 2023-02-27T17:16:20Z

Implementation complete and under review by Ray. @jbarlow83 this is a good time to look at the samples above and make sure they meet your needs.
tesseract -c naked_pdf=true HelloWorld.png HelloWorld pdf

Hello,
I tried this today (tesseract v5.3.0.20221214) but I was not able to run it… I always have this error:
Could not set option: naked_pdf=true
My command:
tesseract -c naked_pdf=true Duerer_Image.jpg Duerer_wText pdf
Was this option disabled in version 5 ?

My first goal was to try to understand how it works and what it does exactly... (for merging image anf textonly pdf files).

Thank you!

amitdo · 2023-02-27T17:43:09Z

You didn't read the whole thread. The parameter name was changed to textonly_pdf.

Jeankree · 2023-02-27T19:56:33Z

You didn't read the whole thread. The parameter name was changed to textonly_pdf .

Oh sorry! I thought it was another parameter! (I already know textonly_pdf)
And: of course, I did read the whole thread, and now I read it again to be sure; I did not see any mention about changing the name of this parameter. This is why I thought it could have been different and have another use, but now, I understand that naked_pdf=true was only a temporary name…
Thank you for the clarification!

amitdo · 2023-02-28T06:44:25Z

I should have written: "Did you read the whole thread?" or just omit the sentence.

Wikinaut mentioned this issue Jan 17, 2017

Output PDFs have decreased quality ocrmypdf/OCRmyPDF#125

Closed

jbreiden added the PDF label Jan 18, 2017

Wikinaut added a commit to Wikinaut/tesseract that referenced this issue Jan 20, 2017

patch for tesseract-ocr#660 textonly_pdf

3609964

Wikinaut added a commit to Wikinaut/tesseract that referenced this issue Jan 20, 2017

patch for tesseract-ocr#660 textonly_pdf

5e80891

zdenop closed this as completed Jan 21, 2017

Jmuccigr mentioned this issue May 3, 2017

Images unnecessarily compressed? ocrmypdf/OCRmyPDF#163

Closed

jbreiden mentioned this issue Dec 17, 2018

OCR Resolution and PDF output #2108

Closed

ariefcfa mentioned this issue Mar 27, 2019

Create a PDF with multiple pages? #1268

Closed

treysp mentioned this issue Mar 27, 2019

Question: overlaying OCR'd text in package scope? ropensci/qpdf#2

Closed

amitdo added the feature request label Apr 27, 2022

[Cross-posted from the Forum/Suggestion] Implement a way to integrate (original image file, detected text) →searchable PDF #660

[Cross-posted from the Forum/Suggestion] Implement a way to integrate (original image file, detected text) →searchable PDF #660

Comments

Wikinaut commented Jan 13, 2017

jbarlow83 commented Jan 14, 2017

jbreiden commented Jan 14, 2017

jbarlow83 commented Jan 14, 2017 via email

Wikinaut commented Jan 14, 2017 • edited Loading

Wikinaut commented Jan 14, 2017

jbreiden commented Jan 14, 2017

jbarlow83 commented Jan 15, 2017

jbreiden commented Jan 15, 2017

jbreiden commented Jan 17, 2017

jbreiden commented Jan 18, 2017 • edited Loading

jbreiden commented Jan 18, 2017 • edited Loading

jbarlow83 commented Jan 18, 2017

jbarlow83 commented Jan 18, 2017 • edited Loading

jbreiden commented Jan 18, 2017

Shreeshrii commented Jan 19, 2017

amitdo commented Jan 19, 2017 • edited Loading

jbarlow83 commented Jan 19, 2017

Wikinaut commented Jan 19, 2017

jbreiden commented Jan 19, 2017

jbreiden commented Jan 20, 2017 • edited Loading

RNCTX commented Jan 20, 2017 • edited Loading

Shreeshrii commented Jan 20, 2017

jbreiden commented Jan 20, 2017

amitdo commented Jan 20, 2017

zdenop commented Jan 20, 2017

Wikinaut commented Jan 20, 2017 • edited Loading

Wikinaut commented Jan 21, 2017 • edited Loading

Wikinaut commented Jan 21, 2017

jbreiden commented Jan 21, 2017 • edited Loading

Wikinaut commented Jan 21, 2017

jbreiden commented Jan 21, 2017

Wikinaut commented Jan 21, 2017 • edited Loading

zdenop commented Jan 21, 2017

Jmuccigr commented May 4, 2017

jbreiden commented May 4, 2017 • edited Loading

Jmuccigr commented May 4, 2017

Wikinaut commented May 4, 2017

amitdo commented May 4, 2017

Jmuccigr commented May 4, 2017

Shreeshrii commented May 4, 2017

zdenop commented May 5, 2017

Shreeshrii commented May 5, 2017

Jmuccigr commented Jun 5, 2017

gsauthof commented May 1, 2018

wrznr commented Oct 24, 2019

amitdo commented Oct 24, 2019

Jeankree commented Feb 27, 2023

amitdo commented Feb 27, 2023

Jeankree commented Feb 27, 2023

amitdo commented Feb 28, 2023

Wikinaut commented Jan 14, 2017 •

edited

Loading

jbreiden commented Jan 18, 2017 •

edited

Loading

jbreiden commented Jan 18, 2017 •

edited

Loading

jbarlow83 commented Jan 18, 2017 •

edited

Loading

amitdo commented Jan 19, 2017 •

edited

Loading

jbreiden commented Jan 20, 2017 •

edited

Loading

RNCTX commented Jan 20, 2017 •

edited

Loading

Wikinaut commented Jan 20, 2017 •

edited

Loading

Wikinaut commented Jan 21, 2017 •

edited

Loading

jbreiden commented Jan 21, 2017 •

edited

Loading

Wikinaut commented Jan 21, 2017 •

edited

Loading

jbreiden commented May 4, 2017 •

edited

Loading