Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save user uploads as WACZs #3679

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

bensteinberg
Copy link
Contributor

@bensteinberg bensteinberg commented Dec 16, 2024

This is a first cut at preserving user uploads as WACZ files rather than WARCs. I'm not sure which Linear ticket is the best one to link here. I'm making this a draft for now.

This works, in the sense that it produces a valid WACZ that replayweb.page can play back, but it does not yet play back in Perma; the error message, from wabac.js. is e.g.

Archived Page Not Found

Sorry, this page was not found in this archive:

file:///8MXD-LZ6V/upload.jpg?version=040666551871161394
...

(That page is in fact in the archive.)

I'm guessing I need a slightly different set of options to py-wacz.

@bensteinberg bensteinberg marked this pull request as draft December 16, 2024 15:16
@bensteinberg
Copy link
Contributor Author

Apart from getting this to work, the main question is what the metadata should look like.

@bensteinberg
Copy link
Contributor Author

I think the problem here is that the CDX index is broken, because the entries for file:/// URIs are getting rewritten to reduce consecutive slashes to one; I think this is down to cdxj_indexer just using surt on the URL, rather than doing what warcio does with getSurt(), in the Scoop context:

    if (!url.startsWith("https:") && !url.startsWith("http:")) {
      return url;
    }

I'm going to see if I can demonstrate this. If this is the problem, I'm not sure where to make a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant