Switch to zimdump from zim-tools #66
Labels
need/analysis
Needs further analysis before proceeding
P1
High: Likely tackled by core team if no one steps up
snapshots
issues related to snapshot creation and updates
Motivation
Creating a new snapshot requires unpacking data from ZIM archive.
Legacy process relied on a customized
extract_zim
tool which unfortunately is no longer able to unpack latest snapshots (#60 (comment)).Good news: we now have upstream openzim/zim-tools which not only unpacks archives without a problem, but removes maintenance burden from the mirror project
Prerequisites
@kelson42 I took a look at output of
zimdump v1.0.5
and believe we could switch to this tool when below issues are addressed:Filenames should match article URLs (zimdump dumping filename are based on the article titles openzim/zim-tools#24)
(Replace spaces with underscore and add
.html
suffix, so we can load then via HTTP gateway as-is)Example: https://en.wikipedia-on-ipfs.org/wiki/Vincent_van_Gogh.html
Unescape paths before creating assets (zimdump should unescape paths before creating assets openzim/zim-tools#68)
Redirects via HTML file with
<meta http-equiv="refresh"
(zimdump does not dump redirects properly (empty files) openzim/zim-tools#23 (comment))Performance (Implement multithreading in zimdump openzim/zim-tools#69)
Nice-to-haves
Not blockers, but things to consider in the future:
Prebuilt binaries for other platforms than 64bit Linux
While folks on Windows and MacOS could run this in a VM, would be really nice if the pipeline for building a new snapshot worked on all three platforms.
Ability to skip processing of Xapian index
(Not a hard blocker, but perhaps could speed up the build even further, as we don't use it atm?)
The text was updated successfully, but these errors were encountered: