-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading ingest-attachment dependencies #3111
Conversation
❌ Gradle Check failure c654d875d2a7f5990181ac7bc462d09ccd27c228 |
start gradle check |
❌ Gradle Check failure c654d875d2a7f5990181ac7bc462d09ccd27c228 |
start gradle check |
❌ Gradle Check failure c654d875d2a7f5990181ac7bc462d09ccd27c228 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming you can get it to green
Test failures look legit.
|
❌ Gradle Check failure abe7e74b3b77c64fcdc3d9151cc6e616b03e448f |
Yup, need to figure out how to configure the language detectors. Flipping to draft PR. |
This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector. This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0 Signed-off-by: Kartik Ganesh <[email protected]>
This also requires a update of Apache Commons-IO from 2.7 to 2.11.0 Signed-off-by: Kartik Ganesh <[email protected]>
Also update PDFBox to 2.0.25 as per Tika release notes Signed-off-by: Kartik Ganesh <[email protected]>
Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2 Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
To fix the license check, the mapping regex was expanded to tika-.* This now means the tika-core LICENSE and NOTICE files are no longer needed. Signed-off-by: Kartik Ganesh <[email protected]>
…Detector This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency. Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins. Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
Following the Tika library upgrade, some fallback logic is necessary: 1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead. 2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades. Signed-off-by: Kartik Ganesh <[email protected]>
This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list Signed-off-by: Kartik Ganesh <[email protected]>
More failures around non-determinism of language recognition - this time in
Looks like the failure are limited to doc/docx file processing, and the assertion failure is the same in all cases:
|
Signed-off-by: Kartik Ganesh <[email protected]>
This is an error with the test case, and the new output is more accurate. The two files are represented as Base64 encoded strings in the yml test file. Decoding them and opening as Word documents shows the contents to be:
Previously, language detection was incorrectly identifying this as Polish ( |
// TODO: stop using LanguageIdentifier... | ||
LanguageIdentifier identifier = new LanguageIdentifier(parsedContent); | ||
String language = identifier.getLanguage(); | ||
OptimaizeLangDetector langDetector = new OptimaizeLangDetector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like initialization could be an expensive operation, should OptimaizeLangDetector
+ loadModels()
be done only once during AttachmentProcessor
initialization?
@kartg should we backport this to 2.x? |
@owaiskazi19 Yes, looks like it is blocking this backport. I'll dig into @reta 's comment above and make sure any fixes for that get backported too. |
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector. This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.1.0 to 2.2.0 This also requires a update of Apache Commons-IO from 2.7 to 2.11.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.2.0 to 2.2.1 Also update PDFBox to 2.0.25 as per Tika release notes Signed-off-by: Kartik Ganesh <[email protected]> * Upgraded Tika and xmlbeans libraries Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2 Signed-off-by: Kartik Ganesh <[email protected]> * (In progress) Added tika-langdetect Signed-off-by: Kartik Ganesh <[email protected]> * Upgrading tika libraries to 2.4.0 Signed-off-by: Kartik Ganesh <[email protected]> * Switched from tika-langdetect to tika-langdetect-optimaize To fix the license check, the mapping regex was expanded to tika-.* This now means the tika-core LICENSE and NOTICE files are no longer needed. Signed-off-by: Kartik Ganesh <[email protected]> * (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency. Signed-off-by: Kartik Ganesh <[email protected]> * Manually added LICENSE and NOTICE files for Optimaize language-detector Signed-off-by: Kartik Ganesh <[email protected]> * Move Optimaize dependency to runtimeOnly Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins. Signed-off-by: Kartik Ganesh <[email protected]> * Fix Optimaize langDetector to load models first before detecting Signed-off-by: Kartik Ganesh <[email protected]> * Fallback logic, and test updates Following the Tika library upgrade, some fallback logic is necessary: 1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead. 2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades. Signed-off-by: Kartik Ganesh <[email protected]> * Drop Guava version from 31.1 to 18.0 This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list Signed-off-by: Kartik Ganesh <[email protected]> * Fix ingest-attachment integration test to assert correct language Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit fc0f446)
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector. This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.1.0 to 2.2.0 This also requires a update of Apache Commons-IO from 2.7 to 2.11.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.2.0 to 2.2.1 Also update PDFBox to 2.0.25 as per Tika release notes Signed-off-by: Kartik Ganesh <[email protected]> * Upgraded Tika and xmlbeans libraries Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2 Signed-off-by: Kartik Ganesh <[email protected]> * (In progress) Added tika-langdetect Signed-off-by: Kartik Ganesh <[email protected]> * Upgrading tika libraries to 2.4.0 Signed-off-by: Kartik Ganesh <[email protected]> * Switched from tika-langdetect to tika-langdetect-optimaize To fix the license check, the mapping regex was expanded to tika-.* This now means the tika-core LICENSE and NOTICE files are no longer needed. Signed-off-by: Kartik Ganesh <[email protected]> * (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency. Signed-off-by: Kartik Ganesh <[email protected]> * Manually added LICENSE and NOTICE files for Optimaize language-detector Signed-off-by: Kartik Ganesh <[email protected]> * Move Optimaize dependency to runtimeOnly Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins. Signed-off-by: Kartik Ganesh <[email protected]> * Fix Optimaize langDetector to load models first before detecting Signed-off-by: Kartik Ganesh <[email protected]> * Fallback logic, and test updates Following the Tika library upgrade, some fallback logic is necessary: 1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead. 2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades. Signed-off-by: Kartik Ganesh <[email protected]> * Drop Guava version from 31.1 to 18.0 This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list Signed-off-by: Kartik Ganesh <[email protected]> * Fix ingest-attachment integration test to assert correct language Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit fc0f446) Co-authored-by: Kartik Ganesh <[email protected]>
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector. This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.1.0 to 2.2.0 This also requires a update of Apache Commons-IO from 2.7 to 2.11.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.2.0 to 2.2.1 Also update PDFBox to 2.0.25 as per Tika release notes Signed-off-by: Kartik Ganesh <[email protected]> * Upgraded Tika and xmlbeans libraries Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2 Signed-off-by: Kartik Ganesh <[email protected]> * (In progress) Added tika-langdetect Signed-off-by: Kartik Ganesh <[email protected]> * Upgrading tika libraries to 2.4.0 Signed-off-by: Kartik Ganesh <[email protected]> * Switched from tika-langdetect to tika-langdetect-optimaize To fix the license check, the mapping regex was expanded to tika-.* This now means the tika-core LICENSE and NOTICE files are no longer needed. Signed-off-by: Kartik Ganesh <[email protected]> * (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency. Signed-off-by: Kartik Ganesh <[email protected]> * Manually added LICENSE and NOTICE files for Optimaize language-detector Signed-off-by: Kartik Ganesh <[email protected]> * Move Optimaize dependency to runtimeOnly Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins. Signed-off-by: Kartik Ganesh <[email protected]> * Fix Optimaize langDetector to load models first before detecting Signed-off-by: Kartik Ganesh <[email protected]> * Fallback logic, and test updates Following the Tika library upgrade, some fallback logic is necessary: 1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead. 2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades. Signed-off-by: Kartik Ganesh <[email protected]> * Drop Guava version from 31.1 to 18.0 This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list Signed-off-by: Kartik Ganesh <[email protected]> * Fix ingest-attachment integration test to assert correct language Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit fc0f446)
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector. This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.1.0 to 2.2.0 This also requires a update of Apache Commons-IO from 2.7 to 2.11.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.2.0 to 2.2.1 Also update PDFBox to 2.0.25 as per Tika release notes Signed-off-by: Kartik Ganesh <[email protected]> * Upgraded Tika and xmlbeans libraries Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2 Signed-off-by: Kartik Ganesh <[email protected]> * (In progress) Added tika-langdetect Signed-off-by: Kartik Ganesh <[email protected]> * Upgrading tika libraries to 2.4.0 Signed-off-by: Kartik Ganesh <[email protected]> * Switched from tika-langdetect to tika-langdetect-optimaize To fix the license check, the mapping regex was expanded to tika-.* This now means the tika-core LICENSE and NOTICE files are no longer needed. Signed-off-by: Kartik Ganesh <[email protected]> * (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency. Signed-off-by: Kartik Ganesh <[email protected]> * Manually added LICENSE and NOTICE files for Optimaize language-detector Signed-off-by: Kartik Ganesh <[email protected]> * Move Optimaize dependency to runtimeOnly Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins. Signed-off-by: Kartik Ganesh <[email protected]> * Fix Optimaize langDetector to load models first before detecting Signed-off-by: Kartik Ganesh <[email protected]> * Fallback logic, and test updates Following the Tika library upgrade, some fallback logic is necessary: 1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead. 2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades. Signed-off-by: Kartik Ganesh <[email protected]> * Drop Guava version from 31.1 to 18.0 This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list Signed-off-by: Kartik Ganesh <[email protected]> * Fix ingest-attachment integration test to assert correct language Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit fc0f446)
* Upgrading Tika from 1.24.1 to 2.1.0 and bumping xmlbeans version This major version upgrade requires an explicit dependency on tika-parsers-standard-package to import the parser implementations, and an update to the namespace of RTFParser. Also, LanguageIdentifier has been deprecated and replaced by LanguageDetector. This change includes a bump in xmlbeans version from 3.0.1 to 3.1.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.1.0 to 2.2.0 This also requires a update of Apache Commons-IO from 2.7 to 2.11.0 Signed-off-by: Kartik Ganesh <[email protected]> * Upgrade Tika libraries from 2.2.0 to 2.2.1 Also update PDFBox to 2.0.25 as per Tika release notes Signed-off-by: Kartik Ganesh <[email protected]> * Upgraded Tika and xmlbeans libraries Tika libraries have been upgraded from 2.2.1 to 2.3.0. xmlbeans is now a subproject of POI, so POI was upgraded from 4.1.2 to 5.2.2. With POI 5.x the ooxml-schemas library has been moved to ooxml-lite/ooxml-full. Since ooxml-schemas no longer exists, the LICENSE and NOTICE files in the licenses/ directory have been removed. Finally, xmlbeans has been updated from 3.1.0 to 5.0.2 Signed-off-by: Kartik Ganesh <[email protected]> * (In progress) Added tika-langdetect Signed-off-by: Kartik Ganesh <[email protected]> * Upgrading tika libraries to 2.4.0 Signed-off-by: Kartik Ganesh <[email protected]> * Switched from tika-langdetect to tika-langdetect-optimaize To fix the license check, the mapping regex was expanded to tika-.* This now means the tika-core LICENSE and NOTICE files are no longer needed. Signed-off-by: Kartik Ganesh <[email protected]> * (Work in progress) Switching AttachmentProcessor to use OptimaizeLangDetector This is a concrete implementation of LanguageDetector. Using this requires bringing in the optimaize dependency. Signed-off-by: Kartik Ganesh <[email protected]> * Manually added LICENSE and NOTICE files for Optimaize language-detector Signed-off-by: Kartik Ganesh <[email protected]> * Move Optimaize dependency to runtimeOnly Also bring in transitive Guava dependency. This requires manual addition of LICENSE and NOTICE files as with other plugins. Signed-off-by: Kartik Ganesh <[email protected]> * Fix Optimaize langDetector to load models first before detecting Signed-off-by: Kartik Ganesh <[email protected]> * Fallback logic, and test updates Following the Tika library upgrade, some fallback logic is necessary: 1. "Author" is deprecated for MSOffice document parsing. It is recommended to use CREATOR from Tika Core Properties instead. 2. EPUB parsing no longer automatically extracts keywords. The convention to fall back to SUBJECT is now manually implemented in AttachmentProcessor Finally, unit tests have been upgraded to account for non-deterministic language results across library upgrades. Signed-off-by: Kartik Ganesh <[email protected]> * Drop Guava version from 31.1 to 18.0 This is the version that Optimaize 0.6 depends on, and it allows for a smaller ignoreViolations list Signed-off-by: Kartik Ganesh <[email protected]> * Fix ingest-attachment integration test to assert correct language Signed-off-by: Kartik Ganesh <[email protected]> (cherry picked from commit fc0f446) Co-authored-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh [email protected]
Description
Multiple dependencies under ingest-attachment have been upgraded:
tika-parsers-standard-package
to import the parser implementations, and an update to the namespace ofRTFParser
.LanguageIdentifier
has been deprecated. This must be replaced by a concrete implementation ofLanguageDetector
. Tika publishes an implementation based on Optimaize viatika-langdetect-optimaize
language-detector
and Google Guava (set to version 18 since that is what Optimaize uses, and to minimize the list of ignored violations)ooxml-schemas
library has been moved toooxml-lite
/ooxml-full
. Sinceooxml-schemas
no longer exists, the LICENSE and NOTICE files in thelicenses/
directory have been removed.Alongside these version upgrades, code changes have been made to use the updated dependencies:
OptimaizeLangDetector
is now used in place of the deprecatedLanguageIdentifier
AttachmentProcessor
Issues Resolved
Once this is merged, the dependabot PR #2138 can be closed
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.