Skip to content

Commit

Permalink
Update documentation (#170)
Browse files Browse the repository at this point in the history
  • Loading branch information
caufieldjh authored Aug 9, 2023
2 parents 2e335a2 + afc6b03 commit ef28de1
Show file tree
Hide file tree
Showing 3 changed files with 421 additions and 223 deletions.
144 changes: 80 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,25 @@

## Introduction

OntoGPT is a Python package for the generation of Ontologies and Knowledge Bases using GPT. It is a knowledge extraction tool that uses a Large Language Models (LLMs) to extract semantic information from text.
OntoGPT is a Python package for the generation of Ontologies and Knowledge Bases using large language models (LLMs).

This makes use of so-called *instruction prompts* in Large Language Models (LLMs) such as GPT-4.
OntoGPT makes use of so-called *instruction prompts* in LLMs such as GPT-4.

Currently three different strategies for knowledge extraction have been implemented in the ontogpt package:
Two different strategies for knowledge extraction are currently implemented in OntoGPT:

* SPIRES: *Structured Prompt Interrogation and Recursive Extraction of Semantics*
* Zero-shot learning (ZSL) approach to extracting nested semantic structures from text
* A Zero-shot learning (ZSL) approach to extracting nested semantic structures from text
* This approach takes two inputs - 1) LinkML schema 2) free text, and outputs knowledge in a structure conformant with the supplied schema in JSON, YAML, RDF or OWL formats
* Uses text-davinci-003 or gpt-3.5-turbo (gpt-4 untested)
* HALO: *HAllucinating Latent Ontologies*
* Few-shot learning approach to generating/hallucinating a domain ontology given a few examples
* Uses code-davinci-002
* Uses GPT-3.5-turbo, GPT-4, or one of a variety of open LLMs on your local machine
* SPINDOCTOR: *Structured Prompt Interpolation of Narrative Descriptions Or Controlled Terms for Ontological Reporting*
* Summarize gene set descriptions (pseudo gene-set enrichment)
* Uses text-davinci-003 or gpt-3.5-turbo (gpt-4 untested)

* Uses GPT-3.5-turbo or GPT-4

## Pre-requisites

* Python 3.9+
* OpenAI account

An OpenAI key is necessary for using OpenAI's GPT models. This is a paid API and you will be charged based on usage. If you do not have an OpenAI account, [you may sign up here](https://platform.openai.com/signup). You will need to set your API key using the [Ontology Access Kit](https://github.com/INCATools/ontology-access-kit):
* OpenAI API key: necessary for using OpenAI's GPT models. This is a paid API and you will be charged based on usage. If you do not have an OpenAI account, [you may sign up here](https://platform.openai.com/signup). You will need to set your API key using the [Ontology Access Kit](https://github.com/INCATools/ontology-access-kit):

```bash
poetry run runoak set-apikey -e openai <your openai api key>
Expand All @@ -51,14 +46,6 @@ poetry run runoak set-apikey -e hfhub-key <your HuggingFace Hub api key>

## Setup

For feature development and contributing to the package:

```bash
git clone https://github.com/monarch-initiative/ontogpt.git
cd ~/path/to/ontogpt
poetry install
```

To simply start using the package in your workspace:

```bash
Expand All @@ -80,17 +67,64 @@ where `extra_name` is one of the following:
* `docs` - dependencies for building documentation
* `web` - dependencies for the web application
* `recipes` - dependencies for recipe scraping and parsing
* `gpt4all` - dependencies for loading LLMs from GPT4All
* `textract` - the textract plugin
* `huggingface` - dependencies for accessing LLMs from HuggingFace Hub, remotely or locally

## Examples
For feature development and contributing to the package:

```bash
git clone https://github.com/monarch-initiative/ontogpt.git
cd ~/path/to/ontogpt
poetry install
```

## Getting Started

OntoGPT is run from the command line. See the full list of commands with:

```bash
ontogpt --help
```

For a simple example of text completion and testing to ensure OntoGPT is set up correctly, create a text file containing the following, saving the file as `example.txt`:

```
Why did the squid cross the coral reef?
```

Then try the following command:

```bash
ontogpt complete example.txt
```

You should get text output like the following:

```
Perhaps the squid crossed the coral reef for a variety of reasons:
1. Food: Squids are known to feed on small fish and other marine organisms, and there could have been a rich food source on the other side of the reef.
...
```

OntoGPT is intended to be used for information extraction. The following examples show how to accomplish this.

### Strategy 1: Knowledge extraction using SPIRES

#### Working Mechanism

1. You provide an arbitrary data model, describing the structure you want to extract text into. This can be nested (but see limitations below). The predefined [templates](src/ontogpt/templates/) may be used.
2. Provide your preferred annotations for grounding `NamedEntity` fields
3. OntoGPT will:
- Generate a prompt
- Feed the prompt to a language model
- Parse the results into a dictionary structure
- Ground the results using a preferred annotator (e.g., an ontology)

#### Input

Consider some text from one of the input files being used in the ontogpt test suite. You can find the text file [here](tests/input/cases/gocam-betacat.txt). You can download the raw file from the GitHub link to that input text file, or copy its contents over into another file, say, `abstract.txt`. An excerpt
Consider some text from one of the input files being used in the OntoGPT test suite. You can find the text file [here](tests/input/cases/gocam-betacat.txt). You can download the raw file from the GitHub link to that input text file, or copy its contents over into another file, say, `abstract.txt`. An excerpt:

> The cGAS/STING-mediated DNA-sensing signaling pathway is crucial
for interferon (IFN) production and host antiviral
Expand Down Expand Up @@ -144,22 +178,24 @@ gene_functions:
...
```

#### Working Mechanism
#### Local Models

1. You provide an arbitrary data model, describing the structure you want to extract text into
- This can be nested (but see limitations below)
2. Provide your preferred annotations for grounding `NamedEntity` fields
3. OntoGPT will:
- Generate a prompt
- Feed the prompt to a language model (currently OpenAI GPT models)
- Parse the results into a dictionary structure
- Ground the results using a preferred annotator
To use a local model, specify it with the `-m` or `--model` option.

## Strategy 2: HALO
Example:

```bash
ontogpt extract -t drug -i ~/path/to/abstract.txt -m nous-hermes-13b
```

See the list of all available models with this command:
```bash
ontogpt list-models
```

*Documentation to come*
When specifying a local model for the first time, it will be downloaded to your local system.

## Strategy 3: Gene Enrichment using SPINDOCTOR
## Strategy 2: Gene Enrichment using SPINDOCTOR

Given a set of genes, OntoGPT can find similarities among them.

Expand All @@ -174,6 +210,8 @@ The default is to use ontological gene function synopses (via the Alliance API).
* To use narrative/RefSeq summaries, use the `--no-ontological-synopses` flag
* To run without any gene descriptions, use the `--no-annotations` flag

This strategy does not currently support using local models.

## Features

### Define your own extraction model using LinkML
Expand Down Expand Up @@ -262,7 +300,7 @@ Once you have defined your own schema / data model and placed in the correct dir

Ex.:

```
```bash
ontogpt extract -t mendelian_disease.MendelianDisease -i marfan-wikipedia.txt
```

Expand Down Expand Up @@ -295,9 +333,7 @@ The output of this is then passed through further SPIRES iterations.

### Text length limit

Currently SPIRES must use text-davinci-003, which has a total 4k token limit (prompt + completion).

You can pass in a parameter to split the text into chunks. Returned results will be recombined automatically, but more experiments need to be done to determined how reliable this is.
LLMs have context sizes limiting the combined length of their inputs and outputs. The text-davinci-003 model, for example, whas a total 4,000 token limit (prompt + completion), while the gpt-3.5-turbo-16k model has a larger context of 16 thousand tokens.

### Schema tips

Expand Down Expand Up @@ -397,23 +433,18 @@ poetry run web-ontogpt

Note: The agent running uvicorn must have the API key set, so for obvious reasons don't host this publicly without authentication, unless you want your credits drained.

## OntoGPT Limitations

1. Non-deterministic
* This relies on an existing LLM, and LLMs can be fickle in their responses
2. Coupled to OpenAI
* You will need an OpenAI account to use their API. In theory any LLM can be used but in practice the parser is tuned for OpenAI's models

### SPINDOCTOR web app

To start:

```
```bash
poetry run streamlit run src/ontogpt/streamlit/spindoctor.py
```

### HuggingFace Hub

Note: support for HuggingFace-provided models is currently a work in progress.

A select number of LLMs may be accessed through HuggingFace Hub. See the full list using `ontogpt list-models`

Specify a model name with the `-m` option.
Expand All @@ -424,26 +455,11 @@ Example:
ontogpt extract -t mendelian_disease.MendelianDisease -i tests/input/cases/mendelian-disease-sly.txt -m FLAN_T5_BASE
```


### Using local models

OntoGPT supports using language models released by [GPT4All](https://gpt4all.io/).

Specify the name of a model when using the `extract` command with the `-m` or `--model` option and OntoGPT will retrieve the model.

For example:

```
ontogpt --verbose extract -t mendelian_disease.MendelianDisease -i mendelian-disease-sly.txt -m ggml-gpt4all-j-v1.3-groovy
```

will download the `ggml-gpt4all-j-v1.3-groovy.bin` file, generate a prompt, and try that prompt against the specified model.

## Citation

SPIRES is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning.
SPIRES is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv publication: http://arxiv.org/abs/2304.02711

arXiv publication: http://arxiv.org/abs/2304.02711
SPINDOCTOR is described further in: Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization using Large Language Models. arXiv publication: http://arxiv.org/abs/2305.13338

## Contributing

Expand Down
Loading

0 comments on commit ef28de1

Please sign in to comment.