-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add initial support for traineddata files in compressed archive formats (don't merge) #911
Conversation
Open questions:
|
Yes, there have been requests for more compact/compressed traineddata files. Another Qn.
|
|
Why not use a different compression library that is available on different o/s as well as older ubuntu versions? |
On my Debian system I find these libraries: |
Zip format reduces
|
Please move discussion to tesseract-dev forum. This is significant change. |
What about lz4? btw, libarchive handles all formats. |
See this discussion in the forum. I added a link to GitHub there. |
It is also supported by current Linux distributions and would be interesting if compressed tar instead of zip is preferred. I added it to my previous post. |
libarchive supported since very early ubuntu versions and in almost any other linuxes. Personally, I'm using libarchive in cppan. The code for working with any formats is very simple, see: Of course, this is pack/unpack archive code, but streaming code should be pretty similar and simple too. |
It does not compress very good (see result added to the list above). |
Actually I wanted to say lzma, which is .xz/.7z extensions. |
Rebased and added support for |
What libraries are currently in use in your PR? |
This is experimental code, as there is still no decision whether compressed archives should be supported at all, if yes with which format and which library. The current code uses |
As you can see here, the implementations for the two currently supported libraries are very similar. |
The latest code also supports |
As
The file i/o from disk did not play a role in this test because of the Linux file cache and the SSD of my computer.
|
Please also try the test with a different language. Maybe one which has the
largest traineddata size, to see if filesize has any impact to the relative
speeds
Thanks.
|
Test results with
|
lzma compresses slower but better? Or is it also decompress slower? |
lzma created the xz files. 7zip and lzma gave the best compression ratios, but both also need some time for the decompression (which is relevant for Tesseract): they need about 1.9 s more time (but still are faster than bz2). Please note that the current code for all formats reads all parts of the |
@theraysmith wrote on 4/18/14
and on 4/20/14
@stweil Do all the methods you tested support @theraysmith Is there a particular reason for |
The current Tesseract code reads the whole |
@stweil @amitdo @egorpugin have you tested zstd compression? I have tested it, and its very fast. Also if you add a dictionary to it, the compression ratio would be even better, I think it's a game changer. ZSTD compressing by a dictionary:
|
I have not tested it yet, but it looks like we get Zstandard support with libarchive. Pull request libarchive/libarchive#905 added Zstandard there. |
AFAIR there was intention to use already used libraries e.g. not to increase dependencies. |
With next libarchive release I'll add zstd dependency into it in cppan, so tesseract will get it automatically. |
Tesseract only needs to add a dependency on |
So If I understand it right if we compress datafiles with Zstandard users will need on all platform to compile libarchive + Zstandard... |
That's correct. Therefore I still would distribute the datafiles with zip format which hopefully has good support on all platforms. But users who need maximum performance then could repack their needed datafiles with a different compression standard. |
A fast compressor/decompressor |
Milestone is set to 4.1.0. Is it time to merge it? There was not a lof of changes here in last months... |
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
This requires libarchive-dev, libzip-dev or libminizip-dev. Up to now, little endian tesseract works with the new format. More work is needed for training tools and big endian support. Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Pull request #2290 now includes the implementation with |
This requires libminizip-dev, so expect failures from CI.
Up to now, little endian tesseract works with the new zip format.
More work is needed for training tools and big endian support and also to maintain
compatibility with the current proprietary format.
Signed-off-by: Stefan Weil [email protected]