DAIRE (Deep Archival Image Retrieval Engine) is an image exploration tool based on latent representations derived from neural networks, which allows scholars to "query" using an image of interest to rapidly find related images within a web archive. More details can be found in our paper:
- Tobi Adewoye, Xiao Han, Nick Ruest, Ian Milligan, Samantha Fritz, and Jimmy Lin. Content-Based Exploration of Archival Images Using Neural Networks. Proceedings of the 20th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2020), August 2020.
A live demo is available at http://daire.cs.uwaterloo.ca/
, running on images from the "EnchantedForest" neighborhood of GeoCities.
This repo holds the code that runs that demo.
If you haven't set up the Archives Unleashed Toolkit, follow the instructions here.
Use the Toolkit to extract image information and place the parquet files in data/images/
:
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar \
--extractor ImageInformationExtractor --input /path/to/warcs/* \
--output /path/to/daire/data/images --output-format parquet
Use the Toolkit to extract the image graph and place the parquet files in data/imagegraph/
:
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar \
--extractor ImageInformationExtractor --input /path/to/warcs/* \
--output /path/to/daire/data/imagegraph --output-format parquet
Install DAIRE dependencies:
pip install -r requirements.txt
Set save_image=True
in script/extract-all-parquet-multi.py
and run the script:
python script/extract-all-parquets-multi.py
python script/extract-parquets-url-to-name.py
This will save the images to img/
and generate full_info.txt
and url_to_name.txt
, along with some intermediate files.
Generate the HNSW index:
python script/index-hnsw.py
The resulting index will be saved in bin/<index_number>.bin
and bin/<index_number>.txt
.
In future runs, you can load from an index as follows:
python script/index-hnsw.py <index_number>
The front-end is built with TypeScript and React. To make changes, follow the steps in the ui/
directory here.
Finally, start up the Flask server:
python server.py
How I scale HNSW to more images (10^6, 10^7, 10^8)? Discussion in in this Github issue.