ingest-attachment PDF-processing should be configureable #36890

janLo · 2018-12-20T12:21:17Z

Describe the feature:

The ingest-attachment plugin uses Apache Tika for document content extraction. The Pdf-parser in this project has several configuration options. Thes can be set using a properties file or programmatically (see http://tika.apache.org/1.19.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html).

The plugin only allows to use the default values and gives no option to change them: https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L82

This causes problems with certain documents where the Pdf-producing software (e.g. the Scanbot app) tries to place the chars in the OCRed document exactly behind the actual char in the scanned image, as the default setup parser then inserts a lot of unexpected spaces in the extracted text. This is especiially unexpected as these spaces do not appear if the text is copied from a pdf reader. Finally this renders the plugin useless in this case as the document can only be found if the user knows where the spaces are inserted.

An example of this problem is described here: nextcloud/files_fulltextsearch#29

The situation might be resolved if the plugin allows the uder to set properties like AverageCharTolerance and SpacingTolerance via a configuration mechanism.

elasticmachine · 2018-12-20T12:46:57Z

Pinging @elastic/es-core-features

clawoflight · 2019-09-30T06:49:56Z

Please look into this, it's deal-braking for the primary use-case of file indexing!

dakrone · 2024-05-17T20:22:28Z

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

janLo mentioned this issue Dec 20, 2018

PDF text extraction not very reliable nextcloud/files_fulltextsearch#29

Open

colings86 added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Dec 20, 2018

martijnvg mentioned this issue Nov 12, 2019

Improve ingest node usability #48999

Closed

12 tasks

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

dakrone closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest-attachment PDF-processing should be configureable #36890

ingest-attachment PDF-processing should be configureable #36890

janLo commented Dec 20, 2018

elasticmachine commented Dec 20, 2018

clawoflight commented Sep 30, 2019

dakrone commented May 17, 2024

ingest-attachment PDF-processing should be configureable #36890

ingest-attachment PDF-processing should be configureable #36890

Comments

janLo commented Dec 20, 2018

elasticmachine commented Dec 20, 2018

clawoflight commented Sep 30, 2019

dakrone commented May 17, 2024