ingest-attachment PDF-processing should be configureable #36890
Labels
:Data Management/Ingest Node
Execution or management of Ingest Pipelines including GeoIP
Team:Data Management
Meta label for data/management team
Describe the feature:
The ingest-attachment plugin uses Apache Tika for document content extraction. The Pdf-parser in this project has several configuration options. Thes can be set using a properties file or programmatically (see http://tika.apache.org/1.19.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html).
The plugin only allows to use the default values and gives no option to change them: https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L82
This causes problems with certain documents where the Pdf-producing software (e.g. the Scanbot app) tries to place the chars in the OCRed document exactly behind the actual char in the scanned image, as the default setup parser then inserts a lot of unexpected spaces in the extracted text. This is especiially unexpected as these spaces do not appear if the text is copied from a pdf reader. Finally this renders the plugin useless in this case as the document can only be found if the user knows where the spaces are inserted.
An example of this problem is described here: nextcloud/files_fulltextsearch#29
The situation might be resolved if the plugin allows the uder to set properties like AverageCharTolerance and SpacingTolerance via a configuration mechanism.
The text was updated successfully, but these errors were encountered: