Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using JS WACZ #505

Closed
wants to merge 42 commits into from
Closed

Switch to using JS WACZ #505

wants to merge 42 commits into from

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 22, 2024

Replaces dependencies on py-wacz with importing js-wacz natively.
Writes pages to either pages.jsonl (if seed) or extraPages.jsonl (if non-seed)
Uses streams for writing pages
Replaces --generateCDX with just moving tmp-cdx -> indexes
Removes any dependencies on python

Fixes #484

Pending more testing and js-wacz release, using @tw4l branch for now!

@tw4l
Copy link
Member

tw4l commented Mar 22, 2024

Also noticing that js-wacz is logging strings to stdout, which breaks our logging format. Might want to see what we can do about that. I suppose if we call it as a subprocess via the cli we could capture the stdout and write it into the details of a crawler log line...

@tw4l
Copy link
Member

tw4l commented Mar 22, 2024

TODO:

  • Add WACZ validation (not yet supported in js-wacz)
  • Make CDXJ handling more memory-efficient in js-wacz (currently keeps all pages in memory, may OOM with large crawls)
  • Possibly move CDXJ line handling in js-wacz from bin/cli.js into WACZ class

pages: this.pagesDir,
detectPages: false,
indexFromWARCs: false,
logDirectory: this.logDir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like logDirectory may not be working as expected, not seeing log files in the resulting WACZs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's not working because this isn't supported in js-wacz yet!

@ikreymer
Copy link
Member Author

ikreymer commented Jul 3, 2024

Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.

@tw4l
Copy link
Member

tw4l commented Aug 14, 2024

Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.

I'm wondering if the solution here isn't just to change the tests to expect a number. Looking at the CDXJ specification, it looks like examples also use an int for status code, e.g.: https://specs.webrecorder.net/cdxj/0.1.0/#example

I would assume ReplayWeb.page can handle input as a string or number, since our spec has said one thing while the crawler has been doing another? Of course important to verify.

@tw4l
Copy link
Member

tw4l commented Aug 26, 2024

Closing in favor of #673 (WACZ generation approach has been changed, as documented in #674)

Also worth noting that as of webrecorder/warcio.js#75, CDXJ created by warcio.js now uses strings consistently for status, offset, and length

@tw4l tw4l closed this Aug 26, 2024
ikreymer added a commit that referenced this pull request Aug 29, 2024
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use js-wacz to create WACZ files
2 participants