Skip to content

Latest commit

 

History

History
141 lines (95 loc) · 6.15 KB

experiments-cord19-extras.md

File metadata and controls

141 lines (95 loc) · 6.15 KB

Ingesting CORD-19 into Solr and Elasticsearch

This document describes how to ingest the COVID-19 Open Research Dataset (CORD-19) from the Allen Institute for AI into Solr and Elasticsearch. If you want to build or download Lucene indexes for CORD-19, see this guide.

Getting the Data

Follow the instructions here to get access to the data. This version of the guide has been verified to work with the version of 2020/05/26.

Solr + Blacklight

From the Solr archives, download the Solr (non -src) version that matches Anserini's Lucene version to the anserini/ directory.

Extract the archive:

mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1

Start Solr (adjust memory usage with -m as appropriate):

solrini/bin/solr start -c -m 8G

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd

Solr should now be available at http://localhost:8983/ for browsing.

Next, create the collection:

solrini/bin/solr create -n anserini -c cord19

Adjust the schema (if there are errors, follow the instructions below and come back):

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/cord19.json \
 http://localhost:8983/solr/cord19/schema

Note: if there are errors from field conflicts, you'll need to reset the configset and recreate the collection (select [All] for the fields to replace):

solrini/bin/solr delete -c cord19
pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd
solrini/bin/solr create -n anserini -c cord19

We can now index into Solr:

DATE=2020-05-26
DATA_DIR=./collections/cord19-"${DATE}"

sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
   -threads 8 -input "${DATA_DIR}" \
   -solr -solr.index cord19 -solr.zkUrl localhost:9983 \
   -storePositions -storeDocvectors -storeContents -storeRaw

Once indexing is complete, you can query in Solr at http://localhost:8983/solr/#/cord19/query.

Next, we can stand up an instance of Blacklight to provide a nice search interface; this is exactly the same instance that runs our basic (non-neural) Covidex.

To begin, ensure that you have Ruby 2.6.5+ and Ruby on Rails 6.0+ installed.

Once the approriate ruby and ruby on rails version is installed, navigate to a directory outside of Anserini and clone the Gooselight2:

cd ..
git clone https://github.com/castorini/gooselight2.git

Then navigate into the gooselight2/covid directory, and run the following commands. You may need to change to your corresponding Ruby version in the Gemfile. If a yarn error occurs with rails db:migrate run yarn install --check-files to update yarn:

bundle install
rails db:migrate
rails s

The rails should now be avaliable on http://localhost:3000

Elasticsearch + Kibana

From the Elasticsearch, download the correct distribution for your platform to the anserini/ directory. First, unpack and deploy Elasticsearch:

mkdir elastirini && tar -zxvf elasticsearch*.tar.gz -C elastirini --strip-components=1
elastirini/bin/elasticsearch

Upack and deploy Kibana:

tar -zxvf kibana*.tar.gz -C elastirini --strip-components=1
elastirini/bin/kibana

Set up the proper schema using this config:

cat src/main/resources/elasticsearch/index-config.cord19.json \
 | curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/cord19' -d @-

Indexing (Abstract, Full-Text, Paragraph):

sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
 -es -es.index cord19 -threads 8 -input path/to/cord19 -storePositions -storeDocvectors -storeContents -storeRaw

sh target/appassembler/bin/IndexCollection -collection Cord19FullTextCollection -generator Cord19Generator \
 -es -es.index cord19 -threads 8 -input path/to/cord19 -storePositions -storeDocvectors -storeContents -storeRaw

sh target/appassembler/bin/IndexCollection -collection Cord19ParagraphCollection -generator Cord19Generator \
 -es -es.index cord19 -threads 8 -input path/to/cord19 -storePositions -storeDocvectors -storeContents -storeRaw

We are now able to get visualizations from Kibana at http://localhost:5601

Navigating Kibana

First, from Kibana home tab, connect to the ElasticSearch index that we have created above (or Settings tab -> Kibana -> Index Patterns -> Create Index Patterns).

Provide the index pattern cord19*, and use publish_time as the Time Filter field name. You can optionally give this index pattern a custom ID.

Then you can navigate to the Discover tab to run text-based search, or navigate to the Visualize tab to create diagrams and charts.

Replication Log

  • Confirmed by @adamyy on 2020-05-29 (commit 2947a16) that these instructions work for CORD-19 release of 2020/05/26
  • Confirmed by @yxzhu16 on 2020-07-17 (commit fad12be) that these instructions work for CORD-19 release of 2020/06/19
  • Confirmed by @LizzyZhang-tutu on 2020-07-26 (commit fad12be) that these instructions work for CORD-19 release of 2020/07/25