-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multithreading in zimdump #69
Comments
@lidel Thank you for this quality ticket, I'm supportive. Will do my best to get this don in January. |
Thanks @kelson42!! Curious how things are evolving now that we're in the new year - is this still on your agenda this month? |
@momack2 I would like, but we are a bit short on C++ resources currently. It has been posponed to Febuary for the moment. If you can recommend someone, please tell us! |
I don't know of any C++ devs with bandwidth, but @jnthnvctr might be able to suggest other routes to get this work increased attention. We'd really love to update our distributed wikipedia mirror with snapshots more recent than 2017... ;) |
@momack2 It is just a "small" delay and working already to find someone. Maybe you can retweet https://twitter.com/KiwixOffline/status/1214826834417860609 |
I didn't know about I agree with this ticket, |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@mgautierfr Considering that this ticket will need to get the articles in the order they are in the files (to save cluster decompression and that this is as well needed by (at least) |
Most important part of the improvement will be achieved by implementing openzim/libzim#300 |
@lidel The zimdump speed has been improved a lot (around 15x) but without adding multithreading. The reason is that we need to revamp our libzim cache strategy to really benefit of it in zimdump. For the moment this is on hold and I will move it out the IPFS project as I believe the speed is acceptable now. |
@kelson42 I can confirm, it is now within the same order of magnitude as the rust library, and can be used for practical purposes.
Still slower, but that will change with multithreading. |
@lidel Thx for sharing the benchmark. |
@mgautierfr @veloman-yunkan @MiguelRocha AFAIK the libzim cache has been improved to allow to get the full potential a multithreading in zimcheck and zimdump in last version 6.2.0. Therefore would that not be a good time t reconsider this ticket? |
@Kelson I can work on this |
@veloman-yunkan Good for me but @MiguelRocha had started something which I believe is available at https://github.com/openzim/zim-tools/tree/speed-up-zimdump. @MiguelRocha Do you rememember what was the code status here? |
@kelson42 Yes at that time I created a thread pool to handle the decompression of the clusters in a multi-thread way. Since then zimdump changed quite a bit so its just a matter of re basing the branch with master and fixing potential conflicts. |
@MiguelRocha Thank you for the update |
@veloman-yunkan I’m unsure about the status here. Could you enlight me please? |
@kelson42 I am going to do the zimcheck part over this weekend. |
@veloman-yunkan Any news on this front? |
@kelson42 A prototype implementation was ready, but then the |
@veloman-yunkan I don't think there is any big plan for now to make big changes again. @mgautierfr has to implement openzim/libzim#397 but (beside a few other details), this is going to be the only thing to change before release mid of January. I think you can continue on this ticket. |
So, we need that now for the |
zimdump
feels slower than it could be.Below some notes from my tests and ideas how to improve its performance.
Single thread? Lack of buffer in front of disk writes?
I have SSD but my disk I/O remains pretty slow (
iotop
shows pretty slow disk writes at<400 K/s
!).Tool seems to be limited by the CPU: a single core is used and is constantly at 100%. Remaining 7 cores remain unused. Looks like it is single-threaded and perhaps flushing after each write to disk?
Benchmarks
Unpacking wikipedia_en_top_mini_2019-09.zim (250M) took nearly 30 minutes:
This is super slow comparing to rust-based multicore
extract_zim
from dignifiedquire/zim. It produces some errors and skips some files (tool is not maintained anymore), but is able to extract most of it under 10 seconds(!):Things to try
Applying some/all optimizations from dignifiedquire/zim should make
zimdump
much, much faster:(in Rust this is provided by BufWriter)
The text was updated successfully, but these errors were encountered: