Skip to content

Commit

Permalink
support ocrmypdf#32 (#39) (#40)
Browse files Browse the repository at this point in the history
* First working version with OCRmyPDF #32

* Fix variable assignment

* Use ProcessingFileAccessor to prevent infinite loop

* Update README for OCRmyPDF

* docs: update TOC

* Update README + app compliance

* Code compliance

* Apply suggestions from code review

Co-authored-by: Manuel Bentele <[email protected]>

Co-authored-by: R0Wi <[email protected]>
Co-authored-by: Manuel Bentele <[email protected]>

Co-authored-by: Robin Windey <[email protected]>
Co-authored-by: R0Wi <[email protected]>
Co-authored-by: Manuel Bentele <[email protected]>
  • Loading branch information
4 people authored Nov 30, 2020
1 parent 8aa6e6f commit 75ad9f9
Show file tree
Hide file tree
Showing 28 changed files with 834 additions and 1,152 deletions.
58 changes: 12 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@
- [App installation](#app-installation)
- [Nextcloud background jobs](#nextcloud-background-jobs)
- [Backend](#backend)
- [Imagick](#imagick)
- [Tesseract](#tesseract)
- [Usage](#usage)
- [How it works](#how-it-works)
- [General](#general)
Expand Down Expand Up @@ -44,46 +42,23 @@ Since the actual processing of the files is done asynchronously via Nextcloud's


### Backend
#### Imagick
Make sure `Imagick` is installed (the command below is for debian based Linux systems. It might be different on your system.).
```bash
sudo apt-get install php-imagick
```
> :warning: Since `v1.20.1` you'll have to install `OCRmyPDF`.
Make sure `Imagick` is properly configured so that it can access pdf files. On debian based systems edit the configuration file `/etc/ImageMagick-6/policy.xml` (path might be different on your system). It has to contain at least this line:
```xml
<policymap>
<!-- [...] -->
<policy domain="coder" rights="read" pattern="PDF" />
<!-- [...] -->
</policymap>

```
If you use **any other background job setting than [`cron`](https://docs.nextcloud.com/server/latest/admin_manual/configuration_server/background_jobs_configuration.html#cron)** you'll have to restart your php environment for the above changes to be applied. Depending on your system this is usually done by restarting your `php-fpm`-daemon or webserver, for example:
In the backend [`OCRmyPDF`](https://github.com/jbarlow83/OCRmyPDF) is used for processing PDF files. Make sure you have this commandline tool installed.

```bash
# Restart php-fpm
sudo systemctl restart php7.3-fpm.service

# Restart Apache webserver
sudo systemctl restart apache2
```

You can find additional information about `Imagick` [here](https://www.php.net/manual/en/imagick.setup.php).
apt-get install ocrmypdf
```

> :warning: **Note that `Imagick` requires [Ghostscript](https://www.ghostscript.com) to properly read PDF files. You can find more details in the section [Supported Image Formats](https://imagemagick.org/script/formats.php#supported) of `Imagick`'s documentation.**
Also if you want to use specific language settings please install the corresponding `tesseract` packages.

#### Tesseract
For the OCR part the commandlinetool `tesseract` is used. Make sure you have the library and appropriate languages installed. I recommend installing the packages from [PPA](https://github.com/tesseract-ocr/tessdoc/blob/master/Home.md) because they're newer than the official package-sources (i tested with `tesseract 4.1.1`). On Ubuntu 18.04 you might type the following for languages english and german:
```bash
# Install PPA
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
# English
apt-get install tesseract-ocr-eng

# Install Tesseract and language-files
sudo apt-get install tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng
# German
apt-get install tesseract-ocr-deu
```
You can read more about the installation of `tesseract` [here](https://github.com/tesseract-ocr/tesseract/wiki).

## Usage
You can configure the OCR processing via Nextcloud's workflow engine. Therefore configure a new flow via `Settings -> Flow -> Add new flow` (if you don't see `OCR file` here the app isn't installed properly or you forgot to activate it).
Expand Down Expand Up @@ -113,11 +88,7 @@ To **test** if your file gets processed properly you can do the following steps:
</p>

### PDF
<p align="center">
<img width="100%" src="doc/diagramms/pdf.svg" alt="PDF diagramm">
</p>

**Note on PDF processing:** since the processing algorithm for PDF files makes heavy use of splitting an recombining the single PDF pages, it could damage certain PDF files or manipulate the content somehow.
For processing PDF files, the external command line tool [`OCRmyPDF`](https://github.com/jbarlow83/OCRmyPDF) is used. The tool is invoked with the [`--redo-ocr`](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped) parameter so that it will perform a detailed text analysis. The detailed analysis masks out visible text and sends the image of each page to the OCR processor. After processing, additional text is inserted as OCR, whereas existing text in a mixed file document (images embedded into text pages) is not disrupted.

## Development
### Dev setup
Expand Down Expand Up @@ -232,11 +203,6 @@ That's all. If you now create a new workflow based on your added mimetype, your
## Used libraries & components
| Name | Version | Link |
|---|---|---|
| tesseract_ocr | >= 2.9 | https://github.com/thiagoalessio/tesseract-ocr-for-php |
| tesseract (commandline) | >= 4.0 | https://github.com/tesseract-ocr/tesseract |
| pdfparser | >= 0.15.0 | https://www.pdfparser.org/ |
| fpdi | >= 2.3 | https://www.setasign.com/products/fpdi/about/ |
| fpdf | >= 1.8 | http://www.fpdf.org/ |
| imagick php extension | >= 2 | https://www.php.net/manual/de/book.imagick.php |
| Ghostscript | >= 9.0 | https://www.ghostscript.com/ |
| OCRmyPDF (commandline) | >= 9.6.0 | https://github.com/jbarlow83/OCRmyPDF |
| php-shellcommand | >= 1.6 | https://github.com/mikehaertl/php-shellcommand |
| PHPUnit | >= 8.0 | https://phpunit.de/ |
2 changes: 1 addition & 1 deletion appinfo/info.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<summary>OCR processing via workflow</summary>
<description>This app makes it possible to process various files via OCR algorithms.
The processing is done via workflow-engine and can therefore easily be customized.</description>
<version>1.20.0</version>
<version>1.20.1</version>
<licence>agpl</licence>
<author mail="[email protected]">Robin Windey</author>
<namespace>WorkflowOcr</namespace>
Expand Down
5 changes: 1 addition & 4 deletions composer.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
{
"require": {
"thiagoalessio/tesseract_ocr": "^2.9",
"smalot/pdfparser": "^0.15.0",
"setasign/fpdi": "^2.3",
"setasign/fpdf": "^1.8"
"mikehaertl/php-shellcommand": "^1.6"
},
"require-dev": {
"phpunit/phpunit": "^8.0",
Expand Down
Loading

0 comments on commit 75ad9f9

Please sign in to comment.