Implement multithreading in zimdump #69

lidel · 2019-11-08T20:53:03Z

Issue extracted from ipfs/distributed-wikipedia-mirror#66

zimdump feels slower than it could be.
Below some notes from my tests and ideas how to improve its performance.

Single thread? Lack of buffer in front of disk writes?

I have SSD but my disk I/O remains pretty slow (iotop shows pretty slow disk writes at <400 K/s!).
Tool seems to be limited by the CPU: a single core is used and is constantly at 100%. Remaining 7 cores remain unused. Looks like it is single-threaded and perhaps flushing after each write to disk?

Benchmarks

Unpacking wikipedia_en_top_mini_2019-09.zim (250M) took nearly 30 minutes:

$ zimdump -V
1.0.5
$ time zimdump -D out-zimdump wikipedia_en_top_mini_2019-09.zim
1573.26s user 47.65s system 99% cpu 27:14.82 total

$ du -sh out-zimdump
758M    out-zimdump

This is super slow comparing to rust-based multicore extract_zim from dignifiedquire/zim. It produces some errors and skips some files (tool is not maintained anymore), but is able to extract most of it under 10 seconds(!):

$ time extract_zim --skip-link wikipedia_en_top_mini_2019-09.zim --out out-extract_zim
Extracting file: wikipedia_en_top_mini_2019-09.zim to out-extract_zim

  Creating map
  Extracting entries: 243
  Spawning 243 threads
...
5.91s user 1.87s system 635% cpu 1.223 total

$ du -sh out-extract_zim
726M    out-extract_zim

Things to try

Applying some/all optimizations from dignifiedquire/zim should make zimdump much, much faster:

builds a map of the file first, identifying individual clusters
creates a pool of 16 workers that process clusters in parallel
all writes to disk are buffered in memory and periodically flushed
(in Rust this is provided by BufWriter)

The text was updated successfully, but these errors were encountered:

kelson42 · 2019-11-09T15:31:05Z

@lidel Thank you for this quality ticket, I'm supportive. Will do my best to get this don in January.

momack2 · 2020-01-07T01:02:57Z

Thanks @kelson42!! Curious how things are evolving now that we're in the new year - is this still on your agenda this month?

kelson42 · 2020-01-07T10:50:49Z

@momack2 I would like, but we are a bit short on C++ resources currently. It has been posponed to Febuary for the moment. If you can recommend someone, please tell us!

momack2 · 2020-01-07T20:40:37Z

I don't know of any C++ devs with bandwidth, but @jnthnvctr might be able to suggest other routes to get this work increased attention. We'd really love to update our distributed wikipedia mirror with snapshots more recent than 2017... ;)

kelson42 · 2020-01-08T08:32:02Z

@momack2 It is just a "small" delay and working already to find someone. Maybe you can retweet https://twitter.com/KiwixOffline/status/1214826834417860609

mgautierfr · 2020-01-08T10:42:34Z

I didn't know about extract_zim tool. It's nice to see some rust around zim (even if it is not maintained anymore).

I agree with this ticket, zim_tools is a small set of tools and we can improve it a lot.
Looping articles based on the cluster index order instead of the url order is already used in zimrecreate tool. It should not be too difficult to reuse it.
At least we would use the libzim cache system and avoid to decompress the same cluster several times.

kelson42 · 2020-02-05T11:29:37Z

@mgautierfr Considering that this ticket will need to get the articles in the order they are in the files (to save cluster decompression and that this is as well needed by (at least) zimrecreate and zimcheck I think it would be smart to deliver this kind of iterator within the libzim. Please confirm.

kelson42 · 2020-03-04T15:00:22Z

Most important part of the improvement will be achieved by implementing openzim/libzim#300

kelson42 · 2020-05-14T13:06:32Z

@lidel The zimdump speed has been improved a lot (around 15x) but without adding multithreading. The reason is that we need to revamp our libzim cache strategy to really benefit of it in zimdump. For the moment this is on hold and I will move it out the IPFS project as I believe the speed is acceptable now.

lidel · 2020-05-14T14:22:09Z

@kelson42 I can confirm, it is now within the same order of magnitude as the rust library, and can be used for practical purposes.

wikipedia_tr_all_maxi_2020-04.zim (4.4G):

dignifiedquire/zim: 1m13s
zimdump v1.1.1: 2m8s

Still slower, but that will change with multithreading.

kelson42 · 2020-05-14T18:31:56Z

@lidel Thx for sharing the benchmark.

kelson42 · 2020-08-28T11:12:18Z

@mgautierfr @veloman-yunkan @MiguelRocha AFAIK the libzim cache has been improved to allow to get the full potential a multithreading in zimcheck and zimdump in last version 6.2.0. Therefore would that not be a good time t reconsider this ticket?

veloman-yunkan · 2020-08-28T13:43:26Z

@Kelson I can work on this

kelson42 · 2020-08-28T14:09:54Z

@veloman-yunkan Good for me but @MiguelRocha had started something which I believe is available at https://github.com/openzim/zim-tools/tree/speed-up-zimdump. @MiguelRocha Do you rememember what was the code status here?

MiguelRocha · 2020-08-30T09:55:39Z

@kelson42 Yes at that time I created a thread pool to handle the decompression of the clusters in a multi-thread way. Since then zimdump changed quite a bit so its just a matter of re basing the branch with master and fixing potential conflicts.

kelson42 · 2020-08-30T10:02:33Z

@MiguelRocha Thank you for the update
@veloman-yunkan The same multi-threading approach should be used as well for zimcheck.

kelson42 · 2020-11-20T05:40:16Z

@veloman-yunkan I’m unsure about the status here. Could you enlight me please?

veloman-yunkan · 2020-11-20T06:11:30Z

@kelson42 I am going to do the zimcheck part over this weekend.

kelson42 · 2020-12-20T11:43:57Z

@veloman-yunkan Any news on this front?

veloman-yunkan · 2020-12-21T12:36:04Z

@veloman-yunkan Any news on this front?

@kelson42 A prototype implementation was ready, but then the libzim_next branch was merged and I had to rebase my branch and resolve a lot of large conflicts. I was waiting for any ripple effects from the merged libzim_next branch to fade out before continuing/finishing the work on this.

kelson42 · 2020-12-21T12:38:35Z

@veloman-yunkan I don't think there is any big plan for now to make big changes again. @mgautierfr has to implement openzim/libzim#397 but (beside a few other details), this is going to be the only thing to change before release mid of January. I think you can continue on this ticket.

kelson42 · 2021-08-10T15:29:17Z

So, we need that now for the zimdump binary (it's done for the zimcheck), but this is less urgent. So we can keep that for the future.

lidel mentioned this issue Nov 8, 2019

Switch to zimdump from zim-tools ipfs/distributed-wikipedia-mirror#66

Closed

6 tasks

kelson42 added the enhancement label Nov 9, 2019

kelson42 changed the title ~~zimdump performance~~ Speed-up zimdump -D (fs dumping feature) Nov 9, 2019

kelson42 added the IPFS Necessary for proper IPFS version of Wikipedia label Nov 9, 2019

kelson42 assigned mgautierfr Nov 9, 2019

kelson42 pinned this issue Nov 9, 2019

This comment has been minimized.

Sign in to view

mgautierfr added the zimdump label Jan 14, 2020

mgautierfr mentioned this issue Jan 14, 2020

Rewrite API of zimdump. #70

Closed

This comment has been minimized.

Sign in to view

kelson42 assigned MiguelRocha and unassigned mgautierfr Feb 7, 2020

MiguelRocha linked a pull request Mar 4, 2020 that will close this issue

WIP Process files in order #72

Closed

kelson42 mentioned this issue Apr 3, 2020

Have zimdump to break down the files into multiple folders #81

Closed

kelson42 changed the title ~~Speed-up zimdump -D (fs dumping feature)~~ Implement multithreading in zimdump May 14, 2020

kelson42 removed the IPFS Necessary for proper IPFS version of Wikipedia label May 14, 2020

kelson42 unpinned this issue Jul 11, 2020

kelson42 assigned veloman-yunkan Aug 30, 2020

kelson42 added the zimcheck label Aug 30, 2020

kelson42 changed the title ~~Implement multithreading in zimdump~~ Implement multithreading in zimdump & zimcheck Aug 30, 2020

kelson42 unassigned MiguelRocha Sep 2, 2020

kelson42 mentioned this issue Sep 23, 2020

libzim cache size should be programmatically mutable openzim/libzim#311

Closed

veloman-yunkan mentioned this issue Nov 21, 2020

Multithreaded zimcheck #194

Merged

kelson42 mentioned this issue Dec 9, 2020

Support for opening a ZIM file by file descriptor openzim/libzim#449

Merged

leanhdung1994 mentioned this issue Apr 17, 2021

zimDump takes nearly 2 days to dump a 6GB zim file #233

Closed

kelson42 mentioned this issue Mar 18, 2022

feat: Move mirror creation to the cloud ipfs/distributed-wikipedia-mirror#123

Closed

1 task

kelson42 changed the title ~~Implement multithreading in zimdump & zimcheck~~ Implement multithreading in zimdump ~~~& zimcheck~~~ Apr 13, 2024

kelson42 changed the title ~~Implement multithreading in zimdump ~~~& zimcheck~~~~~ Implement multithreading in zimdump ~~& zimcheck~~ Apr 13, 2024

kelson42 changed the title ~~Implement multithreading in zimdump ~~& zimcheck~~~~ Implement multithreading in zimdump & zimcheck Apr 13, 2024

kelson42 changed the title ~~Implement multithreading in zimdump & zimcheck~~ Implement multithreading in zimdump <s>& zimcheck</s> Apr 13, 2024

kelson42 changed the title ~~Implement multithreading in zimdump <s>& zimcheck</s>~~ Implement multithreading in zimdump Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multithreading in zimdump #69

Implement multithreading in zimdump #69

lidel commented Nov 8, 2019 •

edited

Loading

kelson42 commented Nov 9, 2019

momack2 commented Jan 7, 2020

kelson42 commented Jan 7, 2020

momack2 commented Jan 7, 2020

kelson42 commented Jan 8, 2020

mgautierfr commented Jan 8, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

kelson42 commented Feb 5, 2020

kelson42 commented Mar 4, 2020

kelson42 commented May 14, 2020

lidel commented May 14, 2020

kelson42 commented May 14, 2020

kelson42 commented Aug 28, 2020

veloman-yunkan commented Aug 28, 2020

kelson42 commented Aug 28, 2020

MiguelRocha commented Aug 30, 2020

kelson42 commented Aug 30, 2020

kelson42 commented Nov 20, 2020

veloman-yunkan commented Nov 20, 2020

kelson42 commented Dec 20, 2020

veloman-yunkan commented Dec 21, 2020

kelson42 commented Dec 21, 2020

kelson42 commented Aug 10, 2021

Implement multithreading in zimdump #69

Implement multithreading in zimdump #69

Comments

lidel commented Nov 8, 2019 • edited Loading

Single thread? Lack of buffer in front of disk writes?

Benchmarks

Things to try

kelson42 commented Nov 9, 2019

momack2 commented Jan 7, 2020

kelson42 commented Jan 7, 2020

momack2 commented Jan 7, 2020

kelson42 commented Jan 8, 2020

mgautierfr commented Jan 8, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

kelson42 commented Feb 5, 2020

kelson42 commented Mar 4, 2020

kelson42 commented May 14, 2020

lidel commented May 14, 2020

kelson42 commented May 14, 2020

kelson42 commented Aug 28, 2020

veloman-yunkan commented Aug 28, 2020

kelson42 commented Aug 28, 2020

MiguelRocha commented Aug 30, 2020

kelson42 commented Aug 30, 2020

kelson42 commented Nov 20, 2020

veloman-yunkan commented Nov 20, 2020

kelson42 commented Dec 20, 2020

veloman-yunkan commented Dec 21, 2020

kelson42 commented Dec 21, 2020

kelson42 commented Aug 10, 2021

lidel commented Nov 8, 2019 •

edited

Loading