-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions for /webdata endpoint, support for 'open' warcs #3
Comments
To put it more succinctly, I think there are two main options that could be implemented:
OR
It would be useful to have a |
I expressed my thoughts to Ilya on iipc slack:
I'm not opposed to lastModified. |
The use case for having open files is that Webrecorder generally keeps files open until they have been idle for some period of time (an internal) setting, and a user may add to any collection or recording at any time. Since the replay is available immediately, the download should also reflect what the user can access. The issue is mostly with specifying the I guess the simplest solution is just making the |
I wanted to offer some thoughts on the /webdata endpoint in general and some possible areas of improvement for supporting other services, such as Webrecorder.
One issue that I see is the time for how long the files list returned from
/webdata
should be considered valid, and what happens if files change between a request to/webdata
and request to retrieve the WARC file.It might be useful to include at a least a
timestamp
to the response to indicate at what time the/webdata
was retrieved.One possible workaround is to also include a
validUntil
timestamp that should guarantee that the WARCs listed are available through that time, and that after this time, a user should not trust the/webdata
listing. For example, if a user did not retrieve all the WARCs by that time, they should query/webdata
again to get a more updated listing.Although, this may not be possible to guarantee in a general crawler based system, as for example, a new WARC could be added to the collection a few seconds after
/webdata
was called, making the file listing out of date anyway.Another idea is to have an 'open' WARC type, something like:
By adding
"type": "open"
, the system indicates that this WARC is still being written to, and may change between the time of/webdata
call and the time it is retrieved. Since the WARC may be changing, the checksum is not included here, but thesize
is, and the size should be at least the specified size when downloaded. AlastModified
field is included to indicate when this WARC file was last updated (this could be useful to add to all WARC files).This will address the Webrecorder use case where users may be actively recording when
/webdata
call is made, and therefore the exact size and checksum may change between this query and the actual download. This would be useful to any system that allows live updating of the archive.A more difficult issue is how to deal with systems, such as Webrecorder, which are not simply additive but allow users to delete or modify collections. For example, in Webrecorder, a user could delete a recording (specified by one or more WARCs) within a collection.
In such a case, I suppose the WARC download should return 404 immediately.
Alternatively, if the
validUntil
timestamp is used, the api could "freeze" the particular until the expiration time, allowing the api users to download the WARCs exactly as they were at that time (this may be a bit more complex).Ideally, the simplest approach would be taken, which is probably to allow some form of open warcs and handle deletion as a 404.
The text was updated successfully, but these errors were encountered: