Switch to using JS WACZ #505

ikreymer · 2024-03-22T02:58:37Z

Replaces dependencies on py-wacz with importing js-wacz natively.
Writes pages to either pages.jsonl (if seed) or extraPages.jsonl (if non-seed)
Uses streams for writing pages
Replaces --generateCDX with just moving tmp-cdx -> indexes
Removes any dependencies on python

Fixes #484

Pending more testing and js-wacz release, using @tw4l branch for now!

remove python dependencies

src/crawler.ts

tw4l · 2024-03-22T13:32:33Z

Also noticing that js-wacz is logging strings to stdout, which breaks our logging format. Might want to see what we can do about that. I suppose if we call it as a subprocess via the cli we could capture the stdout and write it into the details of a crawler log line...

src/crawler.ts

tw4l · 2024-03-22T20:51:50Z

TODO:

Add WACZ validation (not yet supported in js-wacz)
Make CDXJ handling more memory-efficient in js-wacz (currently keeps all pages in memory, may OOM with large crawls)
Possibly move CDXJ line handling in js-wacz from bin/cli.js into WACZ class

…rrect offsets

tw4l · 2024-07-01T19:19:53Z

src/crawler.ts

+      pages: this.pagesDir,
+      detectPages: false,
+      indexFromWARCs: false,
+      logDirectory: this.logDir,


Seems like logDirectory may not be working as expected, not seeing log files in the resulting WACZs.

Ah, it's not working because this isn't supported in js-wacz yet!

ikreymer · 2024-07-03T23:02:37Z

Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.

tw4l · 2024-08-14T15:44:04Z

Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.

I'm wondering if the solution here isn't just to change the tests to expect a number. Looking at the CDXJ specification, it looks like examples also use an int for status code, e.g.: https://specs.webrecorder.net/cdxj/0.1.0/#example

I would assume ReplayWeb.page can handle input as a string or number, since our spec has said one thing while the crawler has been doing another? Of course important to verify.

tw4l · 2024-08-26T21:47:02Z

Closing in favor of #673 (WACZ generation approach has been changed, as documented in #674)

Also worth noting that as of webrecorder/warcio.js#75, CDXJ created by warcio.js now uses strings consistently for status, offset, and length

Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <[email protected]>

ikreymer added 3 commits March 21, 2024 19:43

switch to using js-wacz natively for wacz creation!

a457a5e

remove python dependencies

replace generateCDX with just moves files from tmp-cdx

c6723b0

fix tests?

1595b35

tw4l reviewed Mar 22, 2024

View reviewed changes

src/crawler.ts Outdated Show resolved Hide resolved

tw4l force-pushed the use-js-wacz branch from 24cf4a2 to 1595b35 Compare March 22, 2024 13:11

Wait until after WACZ generation to delete tmp-cdx

952cd75

tw4l reviewed Mar 22, 2024

View reviewed changes

src/crawler.ts Outdated Show resolved Hide resolved

tw4l added 7 commits March 22, 2024 16:35

Add WACZLogger class for use with js-wacz

c68d117

Temporariy comment out validation tests using py-wacz

13b6385

Fix extra hops test to account for extraPages

97b1069

Fix custom driver test to account for extraPages

84c1ef2

Fix extra hops test

118ffb0

Fix typo

82169fe

Generate CDX with warcio CDXIndexer

a5d36ce

tw4l force-pushed the use-js-wacz branch from 8fa4d34 to a5d36ce Compare March 22, 2024 20:36

Switch js-wacz dependency to ^0.1.0

d5e5976

ikreymer added 12 commits March 22, 2024 18:04

Merge branch 'main' into use-js-wacz

3e76568

Merge branch 'main' into use-js-wacz

9bbce0c

reenable temp-cdx

280c0c4

use tempCdxDir

1f02102

test: clear saved state test dir for reentrancy

4eee5a7

tests: fix test to account for extraPages.jsonl

58014e6

Merge branch 'main' into use-js-wacz

dd92629

Merge branch 'main' into use-js-wacz

e7de7a0

Merge branch 'main' into use-js-wacz

22ddc92

Merge branch 'main' into use-js-wacz

c29b1b2

Merge branch 'main' into use-js-wacz

40a22fd

Merge branch 'main' into use-js-wacz

2168dbd

ikreymer added 14 commits June 10, 2024 01:00

ensure the warcinfo record is also indexed via writeCDX, to ensure co…

d4fd9e7

…rrect offsets

Merge branch 'index-warcinfo' into use-js-wacz

8e89fa2

remove unneeded await

6e4a401

undo removal, fix tests

de40884

Merge branch 'main' into use-js-wacz

eb2a3ab

don't try to index warcinfo, just add offset

ad44858

Merge branch 'main' into use-js-wacz

13bb461

Merge branch 'main' into use-js-wacz

95ce882

prepend 'bearer ' to signing token opt as its passed directly

3d9f267

Merge branch 'main' into use-js-wacz

0586a8d

Merge branch 'main' into use-js-wacz

770f136

Merge branch 'main' into use-js-wacz

ef51e09

Merge branch 'main' into use-js-wacz

8aee09c

Merge branch 'main' into use-js-wacz

cf004e4

tw4l reviewed Jul 1, 2024

View reviewed changes

Merge branch 'main' into use-js-wacz

5ff9a26

ikreymer added 2 commits July 11, 2024 20:05

Merge branch 'main' into use-js-wacz

c3783ba

Merge branch 'main' into use-js-wacz

e464ce0

Merge branch 'main' into use-js-wacz

033f848

ikreymer mentioned this pull request Aug 26, 2024

Streaming in-place WACZ creation + CDXJ indexing #673

Merged

tw4l closed this Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using JS WACZ #505

Switch to using JS WACZ #505

ikreymer commented Mar 22, 2024 •

edited

Loading

tw4l commented Mar 22, 2024

tw4l commented Mar 22, 2024

tw4l Jul 1, 2024

tw4l Jul 1, 2024

ikreymer commented Jul 3, 2024 •

edited

Loading

tw4l commented Aug 14, 2024 •

edited

Loading

tw4l commented Aug 26, 2024

Switch to using JS WACZ #505

Switch to using JS WACZ #505

Conversation

ikreymer commented Mar 22, 2024 • edited Loading

tw4l commented Mar 22, 2024

tw4l commented Mar 22, 2024

tw4l Jul 1, 2024

Choose a reason for hiding this comment

tw4l Jul 1, 2024

Choose a reason for hiding this comment

ikreymer commented Jul 3, 2024 • edited Loading

tw4l commented Aug 14, 2024 • edited Loading

tw4l commented Aug 26, 2024

ikreymer commented Mar 22, 2024 •

edited

Loading

ikreymer commented Jul 3, 2024 •

edited

Loading

tw4l commented Aug 14, 2024 •

edited

Loading