Skip to content

PathwayCommons/semantic-search

Repository files navigation

build codecov Checked with mypy GitHub

Scientific Semantic Search

A simple semantic search engine for scientific papers. Check out our demo here.

Installation

This repository requires Python 3.7 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.

Installing the library and dependencies

If you don't plan on modifying the source code, install from git using pip

pip install git+https://github.com/PathwayCommons/semantic-search.git

Otherwise, clone the repository locally and then install

git clone https://github.com/PathwayCommons/semantic-search.git
cd semantic-search
pip install --editable .

Finally, if you would like to take advantage of a CUDA-enabled GPU, you must also install PyTorch with CUDA support by following the instructions for your system here.

Usage

To start up the server:

uvicorn semantic_search.main:app

You can pass the --reload flag if you are developing to force the server to reload on changes.

To provide arguments to the server, pass them as environment variables, e.g.:

CUDA_DEVICE=0 MAX_LENGTH=384 uvicorn semantic_search.main:app

Once the server is running, you can make a POST request to the /search endpoint with a JSON body. E.g.

{
  "query": {
    "uid": "9887103",
    "text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development."
  },
  "documents": [
    {
      "uid": "10320478",
      "text": "Drosophila dSmad2 and Atr-I transmit activin/TGFbeta signals. "
    },
    {
      "uid": "22563507",
      "text": "R-Smad competition controls activin receptor output in Drosophila. "
    },
    {
      "uid": "18820452",
      "text": "Distinct signaling of Drosophila Activin/TGF-beta family members. "
    },
    {
      "uid": "10357889"
    },
    {
      "uid": "31270814"
    }
  ],
  "top_k": 3
}

The return value is a JSON representation of the top_k most similar documents (defaults to 10):

[
  {
    "uid": "10320478",
    "score": 0.6997108459472656
  },
  {
    "uid": "22563507",
    "score": 0.6877762675285339
  },
  {
    "uid": "18820452",
    "score": 0.6436074376106262
  }
]

If "text" is not provided, we assume "uid"s are valid PMIDs and fetch the title and abstract text before embedding, indexing and searching.

  • Notes on optional parameters
    • top_k: A positive integer (default is 10) that limits the search results to this many of the most similar neighbours (articles)
    • docs_only: A boolean (default is False) that instructs the service to return scores for the provided documents. If true, top_k is disregarded.

Running via Docker

Setup

If you are intending on using a CUDA-enabled GPU, you must also install the NVIDIA Container Toolkit on the host following the instructions for your system here.

For Ubuntu 18.04:

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list |\
    sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install nvidia-container-runtime

Restart Docker

sudo systemctl stop docker
sudo systemctl start docker

Check your install

docker run --gpus all nvidia/cuda:10.2-cudnn7-devel nvidia-smi

Running a container

First, build the docker image:

docker build -t semantic-search .

Then, run it

docker run -it -p <PORT>:8000 semantic-search

For CUDA-enabled GPU

docker run --gpus all -dt --rm --name semantic_container -p 8000:8000 --env CUDA_DEVICE=0 --env MAX_LENGTH=384 semantic-search:latest

Documentation

With the web server running, open http://127.0.0.1:8000/redoc in your browser for the API documentation.

For contributing guidelines, see CONTRIBUTING.md.