-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to using JS WACZ #505
Conversation
Also noticing that |
TODO:
|
pages: this.pagesDir, | ||
detectPages: false, | ||
indexFromWARCs: false, | ||
logDirectory: this.logDir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like logDirectory
may not be working as expected, not seeing log files in the resulting WACZs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's not working because this isn't supported in js-wacz yet!
Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures. |
I'm wondering if the solution here isn't just to change the tests to expect a number. Looking at the CDXJ specification, it looks like examples also use an int for status code, e.g.: https://specs.webrecorder.net/cdxj/0.1.0/#example I would assume ReplayWeb.page can handle input as a string or number, since our spec has said one thing while the crawler has been doing another? Of course important to verify. |
Closing in favor of #673 (WACZ generation approach has been changed, as documented in #674) Also worth noting that as of webrecorder/warcio.js#75, CDXJ created by warcio.js now uses strings consistently for |
Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <[email protected]>
Replaces dependencies on py-wacz with importing js-wacz natively.
Writes pages to either pages.jsonl (if seed) or extraPages.jsonl (if non-seed)
Uses streams for writing pages
Replaces --generateCDX with just moving tmp-cdx -> indexes
Removes any dependencies on python
Fixes #484
Pending more testing and js-wacz release, using @tw4l branch for now!