Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for /webdata endpoint, support for 'open' warcs #3

Open
ikreymer opened this issue Dec 12, 2016 · 3 comments
Open

Suggestions for /webdata endpoint, support for 'open' warcs #3

ikreymer opened this issue Dec 12, 2016 · 3 comments

Comments

@ikreymer
Copy link

I wanted to offer some thoughts on the /webdata endpoint in general and some possible areas of improvement for supporting other services, such as Webrecorder.

One issue that I see is the time for how long the files list returned from /webdata should be considered valid, and what happens if files change between a request to /webdata and request to retrieve the WARC file.

It might be useful to include at a least a timestamp to the response to indicate at what time the /webdata was retrieved.

One possible workaround is to also include a validUntil timestamp that should guarantee that the WARCs listed are available through that time, and that after this time, a user should not trust the /webdata listing. For example, if a user did not retrieve all the WARCs by that time, they should query /webdata again to get a more updated listing.

Although, this may not be possible to guarantee in a general crawler based system, as for example, a new WARC could be added to the collection a few seconds after /webdata was called, making the file listing out of date anyway.

Another idea is to have an 'open' WARC type, something like:

    "files": [
        {
            "content-type": "application/warc",
            "filename": "2016-08-30-blah.warc.gz",
            "type": "open",
            "lastModified": "...",
            "size": 2000,
            "locations": [
                "http://webrecorder.io/api/wasapi/v0/...blah.warc.gz",
            ]
        }

By adding "type": "open", the system indicates that this WARC is still being written to, and may change between the time of /webdata call and the time it is retrieved. Since the WARC may be changing, the checksum is not included here, but the size is, and the size should be at least the specified size when downloaded. A lastModified field is included to indicate when this WARC file was last updated (this could be useful to add to all WARC files).

This will address the Webrecorder use case where users may be actively recording when /webdata call is made, and therefore the exact size and checksum may change between this query and the actual download. This would be useful to any system that allows live updating of the archive.

A more difficult issue is how to deal with systems, such as Webrecorder, which are not simply additive but allow users to delete or modify collections. For example, in Webrecorder, a user could delete a recording (specified by one or more WARCs) within a collection.
In such a case, I suppose the WARC download should return 404 immediately.

Alternatively, if the validUntil timestamp is used, the api could "freeze" the particular until the expiration time, allowing the api users to download the WARCs exactly as they were at that time (this may be a bit more complex).

Ideally, the simplest approach would be taken, which is probably to allow some form of open warcs and handle deletion as a 404.

@ikreymer
Copy link
Author

To put it more succinctly, I think there are two main options that could be implemented:

  • Support for open WARCs which may change between the call to /webdata and the download call.

OR

  • Support WARC cacheing and a validUntil timestamp, where any open WARC could be cached at the time of the query and kept around through the validUntil timestamp. This is a bit more heavy duty to implement, and could result in a user not getting the latest version of a WARC, but could still support a checksum for every WARC.

It would be useful to have a lastModified field for all WARCs regardless.

@nlevitt
Copy link

nlevitt commented Dec 13, 2016

I expressed my thoughts to Ilya on iipc slack:

my inclination is to keep it simple
i would advocate /webdata return truth at the time of the query
no guarantee that the files won't be deleted before you try to download them
and i'm not sure i see the need to support .open files in this api
if you have a use case for that, maybe describe it on the issue?

I'm not opposed to lastModified.

@ikreymer
Copy link
Author

ikreymer commented Dec 13, 2016

The use case for having open files is that Webrecorder generally keeps files open until they have been idle for some period of time (an internal) setting, and a user may add to any collection or recording at any time. Since the replay is available immediately, the download should also reflect what the user can access.

The issue is mostly with specifying the checksum in the /webdata, because it can change between the time it was listed and when the user starts downloading the WARC.

I guess the simplest solution is just making the checksum optional, and maybe indicating that the WARC is 'open' (in the process of being written to). I think this would solve this issue without adding extra complexity (like taking snapshots).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants