-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FITS runs Tika, which runs Tesseract, which is very slow #25
Comments
@jcoyne - thanks for the heads-up |
Anyone know what you lose by not running tika, and if there are any alternatives? |
Ah tesseract is OCR, I think! Wonder if you can configure tika-via-fits to not do OCR, but still get metadata that it gets. |
FWIW, the way Hydra::Derivatives does full-text extraction does rely on Tika but does not rely on FITS' usage of Tika. It would be interesting to test a FITS config with Tika disabled in both this gem and Hyrax to see if it breaks any tests. If not, perhaps we should disable FITS' usage of Tika by default (and folks are always free to tweak the FITS config if they need it). |
I'd just still want to get other metadata extraction/validation that I think Tika is doing. Where does FITS config actually live in a hyrax app or another app using hydra-derivatives? |
Reading on here it doesn't sound like the OCR step will run if tesseract is not installed: https://wiki.apache.org/tika/TikaOCR If it is, then disabling tika for only certain file types might be a better option? If FITS is only doing a limited job for hydra-file_characterization we could look to publish a fits.xml configured so only certain tools run, and use the exclude options in the fits.xml to try to make it run more efficiently by only running one tool on given file types, like EXIF for all images and so on. |
I don't know if it's possible to disable Tesseract only for some file types, but you can disable it globally. We do this in our Ansible scripts (see https://github.com/pulibrary/princeton_ansible/pull/2/files). In our experience, it dramatically sped up FITS/Tika times (Tesseract was the majority of FITS processing time). |
Related to #18 — we found that 75% or more of the time to run FITS on our 100MB TIFF files was spent running Tesseract (run by Tika). We disabled Tika by commenting out the TikaTool line in the
/path/to/fits/xml/fits.xml
configuration file, and saw dramatically faster FITS execution times (20 seconds per file instead of 90+).We updated our Ansible playbook to comment out the Tika line when we install FITS: ucsdlib/ansible-role-fits#2
The text was updated successfully, but these errors were encountered: