-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add official support for taking multiple snapshots of websites over time #179
Comments
Looking forward to this feature. Thanks for the hacky workaround as well, I have a few pages I'd like to continue monitoring for new content but I was worried about the implications of my current backup being overwritten by a 404 page if the content went down. |
I just updated the README to make the current behavior clearer as well:
|
Any updates on this? It would be really nice if it was possible to have versions, like the waybackmachine does :) |
You can accomplish this right now still by adding a hash at the end of the URL, e.g. archivebox add https://example.com/#2020-08-01
archivebox add https://example.com/#2020-09-01
... Official first-class support for multiple snapshots is still on the roadmap, but don't expect it anytime in the next month or two, it's quite a large feature with big implications for how we store and dedupe snapshot data internally. |
Would be nice if there also was a migration from the hash-date-hack to the first-class support. |
Do I get this right? At the point this is available I can for example add an URL (not a feed) like Will it be possible that ArchiveBox notifies me if there are changes maybe by using the local MTA? |
Scheduled archiving will not re-archive the initial page if snapshots already exist, the way that ArchiveBox has no first-class support for taking multiple snapshots or any built-in diffing system, only the |
Thanks for the quick answer and the very cool application! I already run an ArchiveBox instance on my FreeNAS, and it fits the purpose perfectly. Having the described feature above would be a nice extra. I asked because something like diffs is mentioned on the archivebox.io website itself. If BTW: I hope ArchiveBox will end up in the FreeNAS/TrueNAS plugins section some time. Having ArchiveBox here available with one or two clicks would be very nice. |
This is now added in v0.6. It's not full support, but it's a step in the right direction. I just added a UI button labeled Then later when we add better real multi-snapshot support, we can migrate all the Snapshots with timestamps in their hashes to the new system automatically. |
Sometimes websites remove pages and redirect them to something completely different. IDK, an example I could think of is that if you tried to call for the OG URL for the Xbox 360 sub-page on xbox.com these days I think you'll get redirected to the Xbox One S page, since that is now... yeah I don't know how that's really relevant other than "this is old-ish and it's cheap-ish, have this instead???! kthx" Try it for yourself: Redirects at the time of writing to: Not sure if the URL sends some HTML error code along with the redirect (???). Also, I would consider being VERY careful about dropping URLs from the automated re-archival process on too many fails. It's not very uncommon for a site to go missing for months sometimes and then to come back. I'm not talking about the leagues of Microsoft, but fan sites, hobby projects, niche software developers who do it in their spare time and missed renewing their domain name registration and catching it a bit late, etc... There are all sorts of sticks that can be thrown at you where a simple KO on 3 errors would lead to silent discontinuation of archiving of something that's only temporarily not there. Maybe asking for user confirmation at least per domain would be the best approach:
Edit: Microsoft does supply the error 301, moved permanently. That's kind of them, not sure how much we can rely on this in the real world? Anyone with ample experience in this? |
This comment was marked as off-topic.
This comment was marked as off-topic.
Thanks for the support @agnosticlines I got your donation! <3 *(All the donation info is here for future reference: https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) This is still high on my priority list but development speed is slow these days, I only have a day or so per month to dedicate to this project and most of it is taken up by bugfixes. Occasionally I have a month where I sprint and do a big release but I cant make promises on the timeline that this particular feature will be released. |
This is by far the most requested feature.
People want an easy way to take multiple snapshots of websites over time.
For people finding this issue via Google / incoming links, if you want a hacky solution to take a second snapshot of a site, you can add the link with a new hash and it will be treated as a new page and a new snapshot will be taken:
Edit: as of v0.6 there is now a button in the UI to do this ^
The text was updated successfully, but these errors were encountered: