Skip to content

Web Archiving Community

Henry Wilkinson edited this page Oct 9, 2024 · 195 revisions

Web Archiving Community

💬 Join us on our new ArchiveBox community chat server: https://Zulip.ArchiveBox.io

🔢 Just getting started and want to learn more about why Web Archiving is important?
     Check out this article: On the Importance of Web Archiving.


The internet archiving community is surprisingly far-reaching and almost universally friendly!

Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, this is my attempt at an index of the entire web archiving community.


The Master Lists

Indexes of archiving institutions and software maintained by other people. If there's anything archivists love doing, it's making lists.


Web Archiving Projects

           

Bookmarking Services

  • Pocket Premium Bookmarking tool that provides an archiving service in their paid version, run by Mozilla
  • Pinboard Bookmarking tool that provides archiving in a paid version, run by a single independent developer
  • Raindrop Bookmarking tool with archiving in their paid version, run by a company est. 2011
  • Instapaper Bookmarking alternative to Pocket/Pinboard (with no archiving)
  • Wallabag / Wallabag.it Self-hostable web archiving server that can import via RSS
  • Shaarli Self-hostable bookmark tagging, archiving, and sharing service
  • ReadWise A paid Pocket/Pinboard alternative that includes article snippet and highlight saving
  • Diigo Another brookmarking/annotation service with archiving as a paid feature

From the Archive.org & Archive-It teams

  • Archive.org The O.G. Wayback Machine provided publicly by the Internet Archive (Archive.org)
  • Archive.it commercial Wayback Machine solution
  • Heritrix The king of internet archiving crawlers, powers the Wayback Machine
  • Brozzler chrome headless crawler + WARC archiver maintained by Archive.org
  • WarcProx WARC proxy recording and playback utility
  • WarcTools utilities for dealing with WARCs
  • Grab-Site An easy preconfigured web crawler designed for backing up websites
  • WPull A pure python implementation of wget with WARC saving
  • More on their GitHub...

From Webrecorder

Webrecorder develops a suite of open source tools, to capture websites and replay them at a later time as accurately as possible. Webrecorder also publishes the WACZ file format spec.

  • Browsertrix Fully integrated (self hostable) SaaS web archiving platform
  • ArchiveWeb.page Chrome extension for manual, interactive archiving of websites as you browse the web. Good for capturing high-fidelity complex interactions
  • ReplayWeb.page Web archive viewer that runs entirely in the browser and doesn't require any server-hosted component to view WARC and WACZ files. Also available as a standalone electron app for local desktop use
  • Browsertrix Crawler Command-line crawling application that powers Browsertrix's core crawling features
  • pywb aka Python Wayback, the open source toolkit forked from archive.org for self-hosting your own wayback machine among other web archiving tools
  • warcit Create a WARC file out of a folder full of assets
  • warcio fast streaming asynchronous WARC reader and writer
  • More on their GitHub...

From Rhizome.org (Conifer)


From the Old Dominion University: Web Science Team

  • ipwb A distributed web archiving solution using pywb with IPFS for storage
  • archivenow tool that pushes urls into all the online archive services like Archive.is and Archive.org
  • node-warc Parse And Create Web ARChive (WARC) files with node.js
  • WAIL Web archiver GUI using Heritrix and OpenWayback
  • Squidwarc User-scriptable, archival crawler using Chrome
  • WAIL (Electron) Electron app version of the original wail for creating and interacting with web archives
  • warcreate a Chrome extension for creating WARCs from any webpage
  • More on their GitHub...

From the Archives Unleashed Team


From the IIPC team


Other Public Archiving Services


Other ArchiveBox Alternatives

There are lots more projects listed here too: https://github.com/stars/pirate/lists/internet-archiving

  • Browsertrix + ArchiveWeb.page + ReplayWeb.page Webrecorder's archiving suite has the highest fidelity, and can flawlessly archive YouTube, X, Facebook, and other complex, JS-heavy SPAs
  • SingleFile Web Extension / CLI util for Firefox and Chrome to save a web page as a single HTML file
  • Memex by Worldbrain.io a beautiful, user-friendly browser extension that archives all history with full-text search, annotation support, and more
  • Hypothes.is a web/pdf/ebook annotation tool that also archives content
  • Reminiscence extremely similar to ArchiveBox, uses a Django backend + UI and provides auto-tagging and summary features with NLTK
  • Shaarchiver very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
  • Archivy Python-based self-hosted knowledge base embedded into your filesystem
  • Polarized a desktop application for bookmarking, annotating, and archiving articles offline
  • LinkWarden Link archival and curation web app, very similar to ArchiveBox
  • Photon a fast crawler with archiving and asset extraction support
  • Scoop Create high-fidelity WARC/WACZ captures using a playwright browser, with support for signing, media extraction, PDFs, etc. (by the Perma.cc team)

Ones I haven't personally vetted:

  • Shiori Simple bookmark manager + readability archiver built with Go (like a clone of Pocket)
  • Percollate A command-line tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs.
  • LinkAce A self-hosted bookmark management tool that saves snapshots to archive.org
  • LinkDing Self-hosted bookmark manager that is designed be to be minimal, fast, and easy to set up using Docker.
  • LinkWallet A self-hosted bookmark database with full-text page content search and limited archiving features
  • Espial Bookmark manager and search tool with limited archiving features
  • Diskernet Archiving tool that uses the Chrome debugger protocol to save each page as-loaded in the browser** (aka 22120 by c0fe or i5ik)
  • Trilium Personal web UI based knowledge-base with web clipping and note-taking
  • Herodotus Django-based web archiving tool with a focus on collecting text-based content
  • Buku Browser-independent bookmark manager CLI written in Python3 and SQLite3
  • ReadableWebProxy A proxying archiver that downloads content from sites and can snapshot multiple versions of sites over time
  • Perkeep "Perkeep lets you permanently keep your stuff, for life."
  • Fossilo A commercial archiving solution that appears to be very similar to ArchiveBox
  • NeonLink Simple self-hosted bookmark management + Benotes note-taking app with limited archiving features
  • Archivematica web GUI for institutional long-term archiving of web and other content
  • Headless Chrome Crawler distributed web crawler built on puppeteer with screenshots
  • WWWofle old proxying recorder software similar to ArchiveBox
  • Erised Super simple CLI utility to bookmark and archive webpages
  • Zotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)
  • TiddlyWiki Non-linear bookmark and note-taking tool with archiving support
  • Joplin Desktop + mobile app for knowledge-base-style info collection and notes (w/ optional plugin for archiving)
  • Hunchly A paid web archiving / session recording tool design for OSINT
  • Monolith CLI tool for saving complete web pages as a single HTML file
  • Obelisk Go package and CLI tool for saving web page as single HTML file
  • Munin Archiver Social media archiver for Facebook, Instagram and VKontakte accounts.
  • Wayback Archiving in style like ArchiveBox, but with a chat.

Smaller Utilities

Random helpful utilities for web archiving, WARC creation and replay, and more...


Reading List

A collection of blog posts and articles about internet archiving, contact me / open an issue if you want to add a link here!


Blogs Friends of ArchiveBox


Articles We Like About Internet Archiving

If any of these links are dead, you can find an archived version on https://archive.sweeting.me or https://web.archive.org.


ArchiveBox-Specific Posts, Tutorials, and Guides

Beware: many of these may be outdated, as ArchiveBox has frequent updates and continual improvement.

ArchiveBox Discussions in News & Social Media


Communities

Most Active Communities


Web Archiving Communities

Follow these technological and organizational archiving hubs for the latest archiving news.


General Archiving Foundations, Coalitions, Initiatives, and Institutes

Find your local archiving group in the list and see how you can contribute!

You can find more organizations and initiatives on these other lists:


ArchiveBox Community Resources

ArchiveBox Chat Rooms

ArchiveBox on Social Media

ArchiveBox on Package Distribution Platforms


Clone this wiki locally