Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming in-place WACZ creation + CDXJ indexing #673

Merged
merged 76 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
a457a5e
switch to using js-wacz natively for wacz creation!
ikreymer Mar 22, 2024
c6723b0
replace generateCDX with just moves files from tmp-cdx
ikreymer Mar 22, 2024
1595b35
fix tests?
ikreymer Mar 22, 2024
952cd75
Wait until after WACZ generation to delete tmp-cdx
tw4l Mar 22, 2024
c68d117
Add WACZLogger class for use with js-wacz
tw4l Mar 22, 2024
13b6385
Temporariy comment out validation tests using py-wacz
tw4l Mar 22, 2024
97b1069
Fix extra hops test to account for extraPages
tw4l Mar 22, 2024
84c1ef2
Fix custom driver test to account for extraPages
tw4l Mar 22, 2024
118ffb0
Fix extra hops test
tw4l Mar 22, 2024
82169fe
Fix typo
tw4l Mar 22, 2024
a5d36ce
Generate CDX with warcio CDXIndexer
tw4l Mar 22, 2024
d5e5976
Switch js-wacz dependency to ^0.1.0
tw4l Mar 22, 2024
3e76568
Merge branch 'main' into use-js-wacz
ikreymer Mar 23, 2024
9bbce0c
Merge branch 'main' into use-js-wacz
ikreymer Mar 26, 2024
280c0c4
reenable temp-cdx
ikreymer Mar 26, 2024
1f02102
use tempCdxDir
ikreymer Mar 26, 2024
4eee5a7
test: clear saved state test dir for reentrancy
ikreymer Mar 26, 2024
58014e6
tests: fix test to account for extraPages.jsonl
ikreymer Mar 27, 2024
dd92629
Merge branch 'main' into use-js-wacz
ikreymer Mar 28, 2024
e7de7a0
Merge branch 'main' into use-js-wacz
ikreymer Apr 11, 2024
22ddc92
Merge branch 'main' into use-js-wacz
ikreymer Apr 19, 2024
c29b1b2
Merge branch 'main' into use-js-wacz
ikreymer May 30, 2024
40a22fd
Merge branch 'main' into use-js-wacz
ikreymer Jun 6, 2024
2168dbd
Merge branch 'main' into use-js-wacz
ikreymer Jun 6, 2024
d4fd9e7
ensure the warcinfo record is also indexed via writeCDX, to ensure co…
ikreymer Jun 10, 2024
8e89fa2
Merge branch 'index-warcinfo' into use-js-wacz
ikreymer Jun 10, 2024
6e4a401
remove unneeded await
ikreymer Jun 10, 2024
de40884
undo removal, fix tests
ikreymer Jun 10, 2024
eb2a3ab
Merge branch 'main' into use-js-wacz
ikreymer Jun 11, 2024
ad44858
don't try to index warcinfo, just add offset
ikreymer Jun 11, 2024
13bb461
Merge branch 'main' into use-js-wacz
ikreymer Jun 14, 2024
95ce882
Merge branch 'main' into use-js-wacz
ikreymer Jun 15, 2024
3d9f267
prepend 'bearer ' to signing token opt as its passed directly
ikreymer Jun 21, 2024
0586a8d
Merge branch 'main' into use-js-wacz
ikreymer Jun 21, 2024
770f136
Merge branch 'main' into use-js-wacz
ikreymer Jun 22, 2024
ef51e09
Merge branch 'main' into use-js-wacz
ikreymer Jun 26, 2024
8aee09c
Merge branch 'main' into use-js-wacz
ikreymer Jun 26, 2024
cf004e4
Merge branch 'main' into use-js-wacz
ikreymer Jun 26, 2024
5ff9a26
Merge branch 'main' into use-js-wacz
ikreymer Jul 3, 2024
c3783ba
Merge branch 'main' into use-js-wacz
ikreymer Jul 12, 2024
e464ce0
Merge branch 'main' into use-js-wacz
ikreymer Jul 26, 2024
033f848
Merge branch 'main' into use-js-wacz
ikreymer Aug 17, 2024
d26e8d0
initial pass at generating streaming wacz file:
ikreymer Aug 22, 2024
7e370c4
support streaming WACZ upload with --generateWACZStream
ikreymer Aug 22, 2024
705e5ee
handle in single pass, instead of two passes, stream directly as file…
ikreymer Aug 22, 2024
1a75247
support zipnum compressed cdx, if tmp cdx dir is >50K
ikreymer Aug 22, 2024
9d959e7
use mergeCDX for --generateCDX as well, run for both generateCDX / ge…
ikreymer Aug 22, 2024
c9e03a0
try old warcio.js
ikreymer Aug 22, 2024
8ab31e8
test fix?
ikreymer Aug 22, 2024
12f44db
try new warcio again
ikreymer Aug 22, 2024
3dc3182
update yarn.lock
ikreymer Aug 23, 2024
ad0cdf6
tests: try concurrency 1
ikreymer Aug 23, 2024
e7d6547
remove unneeded async
ikreymer Aug 23, 2024
8b0bcc7
remove maxConcurrency
ikreymer Aug 23, 2024
e9e4cab
closeLog: cleanup, avoid dupe call
ikreymer Aug 23, 2024
ba6f6bf
update
ikreymer Aug 23, 2024
98b5075
ensure readline processed immediately
ikreymer Aug 23, 2024
eabacde
merge cdx: for compatibility, always generate uncompressed with --gen…
ikreymer Aug 23, 2024
887beb1
update to warcio.js 2.3.0-beta.0
ikreymer Aug 23, 2024
3b3e9f5
fix package version for warcio
ikreymer Aug 24, 2024
6338f9d
compress CDXJ for --generateWACZ but not for --generateCDX
ikreymer Aug 24, 2024
dabe7fc
drop microseconds from WACZ 'created' field
ikreymer Aug 24, 2024
c98e195
disk utilization: don't add wacz or warc storage when uploading to re…
ikreymer Aug 24, 2024
97e5c61
convert readline directly to async iter
ikreymer Aug 24, 2024
5fc2c69
simplify dependencies
ikreymer Aug 24, 2024
0304b57
cleanup, remove unused
ikreymer Aug 24, 2024
838ef4a
Update src/util/wacz.ts
ikreymer Aug 28, 2024
36c02a8
rename StartMarker -> CurrZipFileMarker, EndMarker -> EndOfZipFileMar…
ikreymer Aug 28, 2024
c427658
tests: validate wacz with py-wacz if '-validate' flag is passed in
ikreymer Aug 28, 2024
d346685
install py-wacz as root
ikreymer Aug 28, 2024
a12a08b
deps: update to warcio 2.3.0 + wombat 3.8.0
ikreymer Aug 28, 2024
d2c1720
add 'cdxj-gzip-1.0' header to idx files, to match py-wacz/cdxj-indexer
ikreymer Aug 28, 2024
2588bda
Merge branch 'main' into streaming-wacz
ikreymer Aug 29, 2024
fb427ea
rename tempCdxDir -> warcCdxDir and use 'warc-cdx' instead of 'tmp-cdx'
ikreymer Aug 29, 2024
67893db
Update src/util/wacz.ts
ikreymer Aug 29, 2024
2cef026
don't attempt merge if no CDXJ files to merge
ikreymer Aug 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,11 @@ jobs:
- name: add http-server for tests
run: yarn add -D http-server

- name: install py-wacz as root for tests
run: sudo pip install wacz

- name: run all tests as root
run: sudo DOCKER_HOST_NAME=172.17.0.1 yarn test
run: sudo DOCKER_HOST_NAME=172.17.0.1 yarn test -validate

- name: run saved state + qa compare test as non-root - with volume owned by current user
run: |
Expand Down
7 changes: 0 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,6 @@ EXPOSE 9222 9223 6080

WORKDIR /app

ADD requirements.txt /app/
RUN python3 -m venv /app/python-venv && \
/app/python-venv/bin/pip install -U setuptools && \
/app/python-venv/bin/pip install -r requirements.txt && \
ln -s /app/python-venv/bin/wacz /usr/bin/wacz && \
ln -s /app/python-venv/bin/cdxj-indexer /usr/bin/cdxj-indexer

ADD package.json yarn.lock /app/

# to allow forcing rebuilds from this stage
Expand Down
10 changes: 7 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@
},
"dependencies": {
"@novnc/novnc": "^1.4.0",
"@types/sax": "^1.2.7",
"@webrecorder/wabac": "^2.19.7",
"@webrecorder/wabac": "^2.19.8",
"browsertrix-behaviors": "^0.6.4",
"client-zip": "^2.4.5",
"fetch-socks": "^1.3.0",
"get-folder-size": "^4.0.0",
"husky": "^8.0.3",
Expand All @@ -36,7 +36,7 @@
"tsc": "^2.0.4",
"undici": "^6.18.2",
"uuid": "8.3.2",
"warcio": "^2.2.1",
"warcio": "^2.3.0",
"ws": "^7.4.4",
"yargs": "^17.7.2"
},
Expand All @@ -46,6 +46,7 @@
"@types/node": "^20.8.7",
"@types/pixelmatch": "^5.2.6",
"@types/pngjs": "^6.0.4",
"@types/sax": "^1.2.7",
"@types/uuid": "^9.0.6",
"@types/ws": "^8.5.8",
"@typescript-eslint/eslint-plugin": "^6.10.0",
Expand All @@ -62,5 +63,8 @@
"jest": {
"transform": {},
"testTimeout": 90000
},
"resolutions": {
"wrap-ansi": "7.0.0"
}
}
196 changes: 75 additions & 121 deletions src/crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ import { parseArgs } from "./util/argParser.js";

import yaml from "js-yaml";

import { WACZ, WACZInitOpts, mergeCDXJ } from "./util/wacz.js";

import { HealthChecker } from "./util/healthcheck.js";
import { TextExtractViaSnapshot } from "./util/textextract.js";
import {
Expand Down Expand Up @@ -62,7 +64,12 @@ import {
import { Recorder } from "./util/recorder.js";
import { SitemapReader } from "./util/sitemapper.js";
import { ScopedSeed } from "./util/seeds.js";
import { WARCWriter, createWARCInfo, setWARCInfo } from "./util/warcwriter.js";
import {
WARCWriter,
createWARCInfo,
setWARCInfo,
streamFinish,
} from "./util/warcwriter.js";
import { isHTMLMime, isRedirectStatus } from "./util/reqresp.js";
import { initProxy } from "./util/proxy.js";

Expand Down Expand Up @@ -117,7 +124,7 @@ export class Crawler {

pagesFH?: WriteStream | null = null;
extraPagesFH?: WriteStream | null = null;
logFH!: WriteStream;
logFH: WriteStream | null = null;

crawlId: string;

Expand Down Expand Up @@ -150,7 +157,8 @@ export class Crawler {

archivesDir: string;
tempdir: string;
tempCdxDir: string;
warcCdxDir: string;
indexesDir: string;

screenshotWriter: WARCWriter | null;
textWriter: WARCWriter | null;
Expand Down Expand Up @@ -288,7 +296,10 @@ export class Crawler {
// archives dir
this.archivesDir = path.join(this.collDir, "archive");
this.tempdir = path.join(os.tmpdir(), "tmp-dl");
this.tempCdxDir = path.join(this.collDir, "tmp-cdx");

// indexes dirs
this.warcCdxDir = path.join(this.collDir, "warc-cdx");
this.indexesDir = path.join(this.collDir, "indexes");

this.screenshotWriter = null;
this.textWriter = null;
Expand Down Expand Up @@ -470,7 +481,7 @@ export class Crawler {
if (!this.params.dryRun) {
await fsp.mkdir(this.archivesDir, { recursive: true });
await fsp.mkdir(this.tempdir, { recursive: true });
await fsp.mkdir(this.tempCdxDir, { recursive: true });
await fsp.mkdir(this.warcCdxDir, { recursive: true });
}

this.logFH = fs.createWriteStream(this.logFilename, { flags: "a" });
Expand Down Expand Up @@ -1478,36 +1489,24 @@ self.__bx_behaviors.selectMainBehavior();
await this.combineWARC();
}

if (this.params.generateCDX && !this.params.dryRun) {
logger.info("Generating CDX");
await fsp.mkdir(path.join(this.collDir, "indexes"), { recursive: true });
await this.crawlState.setStatus("generate-cdx");
logger.info("Crawling done");

const warcList = await fsp.readdir(this.archivesDir);
const warcListFull = warcList.map((filename) =>
path.join(this.archivesDir, filename),
if (
(this.params.generateCDX || this.params.generateWACZ) &&
!this.params.dryRun
) {
logger.info("Merging CDX");
await this.crawlState.setStatus(
this.params.generateWACZ ? "generate-wacz" : "generate-cdx",
);

//const indexResult = await this.awaitProcess(child_process.spawn("wb-manager", ["reindex", this.params.collection], {cwd: this.params.cwd}));
const params = [
"-o",
path.join(this.collDir, "indexes", "index.cdxj"),
...warcListFull,
];
const indexResult = await this.awaitProcess(
child_process.spawn("cdxj-indexer", params, { cwd: this.params.cwd }),
await mergeCDXJ(
this.warcCdxDir,
this.indexesDir,
this.params.generateWACZ ? null : false,
);
if (indexResult === 0) {
logger.debug("Indexing complete, CDX successfully created");
} else {
logger.error("Error indexing and generating CDX", {
"status code": indexResult,
});
}
}

logger.info("Crawling done");

if (
this.params.generateWACZ &&
!this.params.dryRun &&
Expand Down Expand Up @@ -1543,11 +1542,9 @@ self.__bx_behaviors.selectMainBehavior();
if (!this.logFH) {
return;
}
try {
await new Promise<void>((resolve) => this.logFH.close(() => resolve()));
} catch (e) {
// ignore
}
const logFH = this.logFH;
this.logFH = null;
await streamFinish(logFH);
}

async generateWACZ() {
Expand Down Expand Up @@ -1577,110 +1574,67 @@ self.__bx_behaviors.selectMainBehavior();
logger.fatal("No WARC Files, assuming crawl failed");
}

logger.debug("End of log file, storing logs in WACZ");
const waczPath = path.join(this.collDir, this.params.collection + ".wacz");

// Build the argument list to pass to the wacz create command
const waczFilename = this.params.collection.concat(".wacz");
const waczPath = path.join(this.collDir, waczFilename);
const streaming = !!this.storage;

const createArgs = [
"create",
"-o",
waczPath,
"--pages",
this.seedPagesFile,
"--extra-pages",
this.otherPagesFile,
"--copy-pages",
"--log-directory",
this.logDir,
];
if (!streaming) {
logger.debug("WACZ will be written to disk", { path: waczPath }, "wacz");
} else {
logger.debug("WACZ will be stream uploaded to remote storage");
}

logger.debug("End of log file in WACZ, storing logs to WACZ file");

await this.closeLog();

const waczOpts: WACZInitOpts = {
input: warcFileList.map((x) => path.join(this.archivesDir, x)),
output: waczPath,
pages: this.pagesDir,
logDirectory: this.logDir,
warcCdxDir: this.warcCdxDir,
indexesDir: this.indexesDir,
softwareString: this.infoString,
};

if (process.env.WACZ_SIGN_URL) {
createArgs.push("--signing-url");
createArgs.push(process.env.WACZ_SIGN_URL);
waczOpts.signingUrl = process.env.WACZ_SIGN_URL;
if (process.env.WACZ_SIGN_TOKEN) {
createArgs.push("--signing-token");
createArgs.push(process.env.WACZ_SIGN_TOKEN);
waczOpts.signingToken = "bearer " + process.env.WACZ_SIGN_TOKEN;
}
}

if (this.params.title) {
createArgs.push("--title");
createArgs.push(this.params.title);
waczOpts.title = this.params.title;
}

if (this.params.description) {
createArgs.push("--desc");
createArgs.push(this.params.description);
}

createArgs.push("-f");

warcFileList.forEach((val) =>
createArgs.push(path.join(this.archivesDir, val)),
);

// create WACZ
const waczResult = await this.awaitProcess(
child_process.spawn("wacz", createArgs, { detached: RUN_DETACHED }),
);

if (waczResult !== 0) {
logger.error("Error creating WACZ", { "status code": waczResult });
logger.fatal("Unable to write WACZ successfully");
waczOpts.description = this.params.description;
}

logger.debug(`WACZ successfully generated and saved to: ${waczPath}`);

// Verify WACZ
/*
const validateArgs = ["validate"];
validateArgs.push("-f");
validateArgs.push(waczPath);
try {
const wacz = new WACZ(waczOpts, this.collDir);
if (!streaming) {
await wacz.generateToFile(waczPath);
}

const waczVerifyResult = await this.awaitProcess(child_process.spawn("wacz", validateArgs));
if (this.storage) {
await this.crawlState.setStatus("uploading-wacz");
const filename = process.env.STORE_FILENAME || "@[email protected]";
const targetFilename = interpolateFilename(filename, this.crawlId);

if (waczVerifyResult !== 0) {
console.log("validate", waczVerifyResult);
logger.fatal("Unable to verify WACZ created successfully");
}
*/
if (this.storage) {
await this.crawlState.setStatus("uploading-wacz");
const filename = process.env.STORE_FILENAME || "@[email protected]";
const targetFilename = interpolateFilename(filename, this.crawlId);
await this.storage.uploadCollWACZ(wacz, targetFilename, isFinished);
return true;
}

await this.storage.uploadCollWACZ(waczPath, targetFilename, isFinished);
return true;
return false;
} catch (e) {
logger.error("Error creating WACZ", e);
if (!streaming) {
logger.fatal("Unable to write WACZ successfully");
}
}

return false;
}

awaitProcess(proc: ChildProcess) {
const stdout: string[] = [];
const stderr: string[] = [];

proc.stdout!.on("data", (data) => {
stdout.push(data.toString());
});

proc.stderr!.on("data", (data) => {
stderr.push(data.toString());
});

return new Promise((resolve) => {
proc.on("close", (code) => {
if (stdout.length) {
logger.debug(stdout.join("\n"));
}
if (stderr.length && this.params.logging.includes("debug")) {
logger.debug(stderr.join("\n"));
}
resolve(code);
});
});
}

logMemory() {
Expand Down Expand Up @@ -2604,7 +2558,7 @@ self.__bx_behaviors.selectMainBehavior();

return new WARCWriter({
archivesDir: this.archivesDir,
tempCdxDir: this.tempCdxDir,
warcCdxDir: this.warcCdxDir,
filenameTemplate,
rolloverSize: this.params.rolloverSize,
gzip,
Expand Down
2 changes: 1 addition & 1 deletion src/util/argParser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ class ArgParser {

generateWACZ: {
alias: ["generatewacz", "generateWacz"],
describe: "If set, generate wacz",
describe: "If set, generate WACZ on disk",
type: "boolean",
default: false,
},
Expand Down
1 change: 1 addition & 0 deletions src/util/logger.ts
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ export const LOG_CONTEXT_TYPES = [
"crawlStatus",
"links",
"sitemap",
"wacz",
"replay",
"proxy",
] as const;
Expand Down
Loading
Loading