Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Upgrading ingest-attachment dependencies #3279

Merged
merged 1 commit into from
May 10, 2022

Conversation

opensearch-trigger-bot[bot]
Copy link
Contributor

Backport fc0f446 from #3111

* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version

This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.1.0 to 2.2.0

This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.2.0 to 2.2.1

Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgraded Tika and xmlbeans libraries

Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>

* (In progress) Added tika-langdetect

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrading tika libraries to 2.4.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Switched from tika-langdetect to tika-langdetect-optimaize

To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>

* (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>

* Manually added LICENSE and NOTICE files for Optimaize language-detector

Signed-off-by: Kartik Ganesh <[email protected]>

* Move Optimaize dependency to runtimeOnly

Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix Optimaize langDetector to load models first before detecting

Signed-off-by: Kartik Ganesh <[email protected]>

* Fallback logic, and test updates

Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>

* Drop Guava version from 31.1 to 18.0

This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix ingest-attachment integration test to assert correct language

Signed-off-by: Kartik Ganesh <[email protected]>
(cherry picked from commit fc0f446)
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 1e57c71
Log 5215

Reports 5215

@kartg kartg merged commit 84fc77e into 2.x May 10, 2022
@github-actions github-actions bot deleted the backport/backport-3111-to-2.x branch May 10, 2022 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants