Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading ingest-attachment dependencies #3111

Merged
merged 14 commits into from
May 4, 2022

Conversation

kartg
Copy link
Member

@kartg kartg commented Apr 29, 2022

Signed-off-by: Kartik Ganesh [email protected]

Description

Multiple dependencies under ingest-attachment have been upgraded:

  • Tika libraries upgraded from 1.24.1 to 2.4.0
    • The major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser.
    • This upgrade also required an update of Apache Commons-IO from 2.7 to 2.11.0, and PDFBox to 2.0.25 as per Tika release notes
    • Also, LanguageIdentifier has been deprecated. This must be replaced by a concrete implementation of LanguageDetector. Tika publishes an implementation based on Optimaize via tika-langdetect-optimaize
      • This in turn brings in dependencies on Optimaize's language-detector and Google Guava (set to version 18 since that is what Optimaize uses, and to minimize the list of ignored violations)
      • Language-detector and Guava do not supply LICENSE and NOTICE files in the right format, so these have been manually added
  • xmlbeans libraries updated from 3.0.1 to 5.0.2
    • xmlbeans is now a subproject of Apache POI, so the POI libraries were upgraded from 4.1.2 to 5.2.2
    • With POI 5.x the ooxml-schemas library has been moved to ooxml-lite / ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed.

Alongside these version upgrades, code changes have been made to use the updated dependencies:

  • OptimaizeLangDetector is now used in place of the deprecated LanguageIdentifier
  • The new library versions have removed processing of certain fields so fallback logic has been added to AttachmentProcessor
  • Attachment Processor unit tests have been updated to accomodate non-deterministic results across library upgrades

Issues Resolved

Once this is merged, the dependabot PR #2138 can be closed

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure c654d875d2a7f5990181ac7bc462d09ccd27c228
Log 4870

Reports 4870

@kartg
Copy link
Member Author

kartg commented Apr 29, 2022

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure c654d875d2a7f5990181ac7bc462d09ccd27c228
Log 4873

Reports 4873

@peterzhuamazon
Copy link
Member

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure c654d875d2a7f5990181ac7bc462d09ccd27c228
Log 4874

Reports 4874

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming you can get it to green

@dblock
Copy link
Member

dblock commented May 2, 2022

Test failures look legit.


org.opensearch.ingest.attachment.AttachmentProcessorTests > testEnglishTextDocument FAILED
    java.lang.IllegalStateException: No language detectors available
        at __randomizedtesting.SeedInfo.seed([346CD47C4FE8213C:CE265966D8557F82]:0)
        at org.apache.tika.language.detect.LanguageDetector.getDefaultLanguageDetector(LanguageDetector.java:67)
        at org.opensearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:138)
        at org.opensearch.ingest.attachment.AttachmentProcessorTests.parseDocument(AttachmentProcessorTests.java:333)
        at org.opensearch.ingest.attachment.AttachmentProcessorTests.parseDocument(AttachmentProcessorTests.java:323)
        at org.opensearch.ingest.attachment.AttachmentProcessorTests.testEnglishTextDocument(AttachmentProcessorTests.java:85)```

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure abe7e74b3b77c64fcdc3d9151cc6e616b03e448f
Log 4913

Reports 4913

@kartg kartg marked this pull request as draft May 2, 2022 23:00
@kartg
Copy link
Member Author

kartg commented May 2, 2022

Test failures look legit.

Yup, need to figure out how to configure the language detectors. Flipping to draft PR.

kartg added 5 commits May 3, 2022 10:33
This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>
This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>
Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>
Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>
@kartg kartg force-pushed the xmlBeansUpdate branch from abe7e74 to d992f12 Compare May 3, 2022 17:52
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure d992f12
Log 4950

Reports 4950

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure b2fc07a
Log 4951

Reports 4951

kartg added 3 commits May 3, 2022 12:08
To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>
…Detector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 777f012
Log 4958

Reports 4958

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure d35aede
Log 4960

Reports 4960

kartg added 3 commits May 3, 2022 15:13
Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>
Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>
This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure eb74e94
Log 4971

Reports 4971

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 5765ed2
Log 4976

Reports 4976

@kartg
Copy link
Member Author

kartg commented May 4, 2022

More failures around non-determinism of language recognition - this time in IngestAttachmentClientYamlTestSuiteIT. Repro commands:

./gradlew ':plugins:ingest-attachment:yamlRestTest' --tests "org.opensearch.ingest.attachment.IngestAttachmentClientYamlTestSuiteIT" -Dtests.method="test {yaml=ingest_attachment/30_files_supported/Test ingest attachment processor with .doc file}" -Dtests.seed=1E07D222039A022B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-ES -Dtests.timezone=CNT -Druntime.java=17

./gradlew ':plugins:ingest-attachment:yamlRestTest' --tests "org.opensearch.ingest.attachment.IngestAttachmentClientYamlTestSuiteIT" -Dtests.method="test {yaml=ingest_attachment/30_files_supported/Test ingest attachment processor with .docx file}" -Dtests.seed=1E07D222039A022B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-ES -Dtests.timezone=CNT -Druntime.java=17

Looks like the failure are limited to doc/docx file processing, and the assertion failure is the same in all cases:

expected String [pl] but was String [en]

@kartg
Copy link
Member Author

kartg commented May 4, 2022

This is an error with the test case, and the new output is more accurate.

The two files are represented as Base64 encoded strings in the yml test file. Decoding them and opening as Word documents shows the contents to be:

Test opensearch

Previously, language detection was incorrectly identifying this as Polish (pl) - likely due to the phrase being so short. With the upgraded libraries, the result has changed to en which is an accurate value

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 59f6b92
Log 4980

Reports 4980

@kartg kartg marked this pull request as ready for review May 4, 2022 01:26
@kartg kartg merged commit fc0f446 into opensearch-project:main May 4, 2022
// TODO: stop using LanguageIdentifier...
LanguageIdentifier identifier = new LanguageIdentifier(parsedContent);
String language = identifier.getLanguage();
OptimaizeLangDetector langDetector = new OptimaizeLangDetector();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like initialization could be an expensive operation, should OptimaizeLangDetector + loadModels() be done only once during AttachmentProcessor initialization?

@kartg kartg deleted the xmlBeansUpdate branch May 10, 2022 21:07
@owaiskazi19
Copy link
Member

@kartg should we backport this to 2.x?

@kartg
Copy link
Member Author

kartg commented May 10, 2022

@owaiskazi19 Yes, looks like it is blocking this backport.

I'll dig into @reta 's comment above and make sure any fixes for that get backported too.

@kartg kartg added the backport 2.x Backport to 2.x branch label May 10, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 10, 2022
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version

This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.1.0 to 2.2.0

This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.2.0 to 2.2.1

Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgraded Tika and xmlbeans libraries

Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>

* (In progress) Added tika-langdetect

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrading tika libraries to 2.4.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Switched from tika-langdetect to tika-langdetect-optimaize

To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>

* (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>

* Manually added LICENSE and NOTICE files for Optimaize language-detector

Signed-off-by: Kartik Ganesh <[email protected]>

* Move Optimaize dependency to runtimeOnly

Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix Optimaize langDetector to load models first before detecting

Signed-off-by: Kartik Ganesh <[email protected]>

* Fallback logic, and test updates

Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>

* Drop Guava version from 31.1 to 18.0

This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix ingest-attachment integration test to assert correct language

Signed-off-by: Kartik Ganesh <[email protected]>
(cherry picked from commit fc0f446)
kartg added a commit that referenced this pull request May 10, 2022
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version

This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.1.0 to 2.2.0

This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.2.0 to 2.2.1

Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgraded Tika and xmlbeans libraries

Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>

* (In progress) Added tika-langdetect

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrading tika libraries to 2.4.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Switched from tika-langdetect to tika-langdetect-optimaize

To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>

* (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>

* Manually added LICENSE and NOTICE files for Optimaize language-detector

Signed-off-by: Kartik Ganesh <[email protected]>

* Move Optimaize dependency to runtimeOnly

Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix Optimaize langDetector to load models first before detecting

Signed-off-by: Kartik Ganesh <[email protected]>

* Fallback logic, and test updates

Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>

* Drop Guava version from 31.1 to 18.0

This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix ingest-attachment integration test to assert correct language

Signed-off-by: Kartik Ganesh <[email protected]>
(cherry picked from commit fc0f446)

Co-authored-by: Kartik Ganesh <[email protected]>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 5, 2022
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version

This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.1.0 to 2.2.0

This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.2.0 to 2.2.1

Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgraded Tika and xmlbeans libraries

Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>

* (In progress) Added tika-langdetect

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrading tika libraries to 2.4.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Switched from tika-langdetect to tika-langdetect-optimaize

To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>

* (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>

* Manually added LICENSE and NOTICE files for Optimaize language-detector

Signed-off-by: Kartik Ganesh <[email protected]>

* Move Optimaize dependency to runtimeOnly

Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix Optimaize langDetector to load models first before detecting

Signed-off-by: Kartik Ganesh <[email protected]>

* Fallback logic, and test updates

Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>

* Drop Guava version from 31.1 to 18.0

This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix ingest-attachment integration test to assert correct language

Signed-off-by: Kartik Ganesh <[email protected]>
(cherry picked from commit fc0f446)
@mch2 mch2 added the backport 1.3 Backport to 1.3 branch label Jul 7, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 7, 2022
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version

This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.1.0 to 2.2.0

This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.2.0 to 2.2.1

Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgraded Tika and xmlbeans libraries

Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>

* (In progress) Added tika-langdetect

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrading tika libraries to 2.4.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Switched from tika-langdetect to tika-langdetect-optimaize

To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>

* (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>

* Manually added LICENSE and NOTICE files for Optimaize language-detector

Signed-off-by: Kartik Ganesh <[email protected]>

* Move Optimaize dependency to runtimeOnly

Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix Optimaize langDetector to load models first before detecting

Signed-off-by: Kartik Ganesh <[email protected]>

* Fallback logic, and test updates

Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>

* Drop Guava version from 31.1 to 18.0

This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix ingest-attachment integration test to assert correct language

Signed-off-by: Kartik Ganesh <[email protected]>
(cherry picked from commit fc0f446)
mch2 pushed a commit that referenced this pull request Jul 7, 2022
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version

This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector.

This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.1.0 to 2.2.0

This also requires a update of Apache Commons-IO from 2.7 to 2.11.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrade Tika libraries from 2.2.0 to 2.2.1

Also update PDFBox to 2.0.25 as per Tika release notes

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgraded Tika and xmlbeans libraries

Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2

Signed-off-by: Kartik Ganesh <[email protected]>

* (In progress) Added tika-langdetect

Signed-off-by: Kartik Ganesh <[email protected]>

* Upgrading tika libraries to 2.4.0

Signed-off-by: Kartik Ganesh <[email protected]>

* Switched from tika-langdetect to tika-langdetect-optimaize

To fix the license check, the mapping regex was expanded to tika-.*
This now means the tika-core LICENSE and NOTICE files are no longer needed.

Signed-off-by: Kartik Ganesh <[email protected]>

* (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector

This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency.

Signed-off-by: Kartik Ganesh <[email protected]>

* Manually added LICENSE and NOTICE files for Optimaize language-detector

Signed-off-by: Kartik Ganesh <[email protected]>

* Move Optimaize dependency to runtimeOnly

Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins.

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix Optimaize langDetector to load models first before detecting

Signed-off-by: Kartik Ganesh <[email protected]>

* Fallback logic, and test updates

Following the Tika library upgrade, some fallback logic is necessary:
1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead.
2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor

Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades.

Signed-off-by: Kartik Ganesh <[email protected]>

* Drop Guava version from 31.1 to 18.0

This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list

Signed-off-by: Kartik Ganesh <[email protected]>

* Fix ingest-attachment integration test to assert correct language

Signed-off-by: Kartik Ganesh <[email protected]>
(cherry picked from commit fc0f446)

Co-authored-by: Kartik Ganesh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.x backport 1.3 Backport to 1.3 branch backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants