Skip to content

Commit

Permalink
Remove TALISMAN functions (#309)
Browse files Browse the repository at this point in the history
  • Loading branch information
caufieldjh authored Jan 5, 2024
2 parents 7553b3e + 24e828b commit d6c2345
Show file tree
Hide file tree
Showing 13 changed files with 11 additions and 1,430 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ NOTE: We do not recommend hosting this webapp publicly without authentication.

OpenAI's functions have been evaluated on test data. Please see the full documentation for details on these evaluations and how to reproduce them.
## Related Projects
* [TALISMAN](https://github.com/monarch-initiative/talisman/), a tool for generating summaries of functions enriched within a gene set. TALISMAN uses OntoGPT to work with LLMs.
## Tutorials and Presentations
- Presentation: "Staying grounded: assembling structured biological knowledge with help from large language models" - presented by Harry Caufield as part of the AgBioData Consortium webinar series (September 2023)
Expand All @@ -81,8 +85,6 @@ OpenAI's functions have been evaluated on test data. Please see the full documen
The information extraction approach used in OntoGPT, SPIRES, is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv publication: <http://arxiv.org/abs/2304.02711>
The gene summarization approach used in OntoGPT, SPINDOCTOR, is described further in: Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization using Large Language Models. arXiv publication: <http://arxiv.org/abs/2305.13338>
## Acknowledgements
This project is part of the [Monarch Initiative](https://monarchinitiative.org/). We also gratefully acknowledge [Bosch Research](https://www.bosch.com/research) for their support of this research project.
99 changes: 4 additions & 95 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,44 +213,11 @@ ontogpt convert-examples inputfile.yaml

### convert-geneset

Convert gene set to YAML.

The gene set may be in JSON (msigdb format) or text (one gene symbol per line) format.

See also the `create-gene-set` command (see below).

Options:

* `--fill` / `--no-fill` - Defaults to False (`--no-fill`). If True (`--fill`), the function will attempt to fill in missing gene values.
* `-U`, `--input-file TEXT` - Path to a file with gene IDs to enrich (if not passed as arguments).

Example:

```bash
ontogpt convert-geneset -U inputfile.json
```
*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>

### create-gene-set

Create a gene set.

This is primarily relevant to the TALISMAN method for creating gene set summaries.

It creates a gene set given a set of gene annotations in two-column TSV or GAF format.

The function also requires a single argument for the term to create the gene set with.

The output is provided in YAML format.

Options:

* `-A`, `--annotation-path TEXT` - Path to a file containing annotations.

Example:

```bash
ontogpt create-gene-set -A inputfile.tsv "positive regulation of mitotic cytokinesis"
```
*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>

### diagnose

Expand Down Expand Up @@ -315,43 +282,7 @@ For OpenAI's "text-embedding-ada-002" model, the output will be a vector of leng

### enrichment

Gene class summary enriching. This is OntoGPT's implementation of TALISMAN.

The goal of gene summary enrichment is to assemble a textual summary of the functions of a set of genes and their products.

TALISMAN can run in three different ways:

1. Map gene symbols to IDs using the resolver (unless IDs are specified)
2. Fetch gene descriptions using Alliance API
3. Create a prompt using descriptions

Options:

* `-r`, `--resolver TEXT` - OAK selector for the gene ID resolver, e.g., `sqlite:obo:hgnc` for HGNC gene IDs.
* `-C`, `--context TEXT` - domain, e.g., anatomy, industry, health-related
* `--strict` / `--no-strict` - If set, there must be a unique mappings from labels to IDs. Defaults to True.
* `-U`, `--input-file TEXT` - Path to a file with gene IDs to enrich if not passed as arguments.
* `--randomize-gene-descriptions-using-file TEXT` - For evaluation only. Path to a file containing gene identifiers and descriptions; if this option is used, TALISMAN will swap out gene descriptions with those from this gene set file.
* `--ontological-synopsis` / `--no-ontological-synopsis` - If set, use automated rather than manual gene descriptions. Defaults to True.
* `--combined-synopsis` / `--no-combined-synopsis` - If set, combine gene descriptions. Defaults to False.
* `--end-marker TEXT` - Specify a character or string to end prompts with. For testing minor variants of prompts.
* `--annotations` / `--no-annotations` - If set, include annotations in the prompt. Defaults to True.
* `--prompt-template TEXT` - Path to a file containing the prompt.
* `--interactive` / `--no-interactive` - Interactive mode - rather than call the API, the function will present a walkthrough process. Defaults to False.

Example:

```bash
ontogpt enrichment -r sqlite:obo:hgnc -U tests/input/genesets/EDS.yaml
```

In this case, the prompt will include gene summaries retrieved from the database.

The response text will include, among other fields, a summary like this:

```text
Summary: The common function among these genes is their involvement in the regulation and organization of the extracellular matrix, particularly collagen fibril organization and biosynthesis.
```
*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>

### entity-similarity

Expand Down Expand Up @@ -402,29 +333,7 @@ ontogpt eval --num-tests 1 EvalCTD

### eval-enrichment

Run enrichment (TALISMAN) using multiple methods.

This function runs a set of evaluations specific to the TALISMAN gene set summary process.

It will iterate through all relevant models to compare results.

The function assumes genes will have HGNC identifiers.

Options:

* `--strict` / `--no-strict` - If set, there must be a unique mappings from labels to IDs. Defaults to True.
* `-U`, `--input-file TEXT` - Path to a file with gene IDs to enrich (if not passed as arguments)
* `--ontological-synopsis` / `--no-ontological-synopsis` - If set, use automated rather than manual gene descriptions. Defaults to True.
* `--combined-synopsis` / `--no-combined-synopsis` - If set, combine gene descriptions. Defaults to False.
* `--annotations` / `--no-annotations` - If set, include annotations in the prompt. Defaults to True.
* `-n`, `--number-to-drop INTEGER` - Maximum number of genes to drop if necessary.
* `-A`, `--annotations-path TEXT` - Path to file containing annotations.

Example:

```bash
ontogpt enrichment -U tests/input/genesets/EDS.yaml
```
*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>

### extract

Expand Down
10 changes: 1 addition & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,21 +66,13 @@ web-ontogpt

NOTE: We do not recommend hosting this webapp publicly without authentication.

Gene enrichment has its own webapp powered by Streamlit:

```bash
streamlit run src/ontogpt/streamlit/spindoctor.py
```

## Citation

SPIRES is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv publication: <http://arxiv.org/abs/2304.02711>

SPINDOCTOR is described further in: Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization using Large Language Models. arXiv publication: <http://arxiv.org/abs/2305.13338>

## Contributing

Contributions are welcome! One way to get started with contributing to OntoGPT is to submit an
Contributions are welcome! One way to get started with contributing to OntoGPT is to submit an issue.

Contributions on recipes to test welcome from anyone! Just make a PR [here](https://github.com/monarch-initiative/ontogpt/blob/main/tests/input/recipe-urls.csv). See [this list](https://github.com/hhursev/recipe-scrapers) for accepted URLs

Expand Down
189 changes: 1 addition & 188 deletions src/ontogpt/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import logging
import pickle
import sys
from copy import copy, deepcopy
from copy import deepcopy
from dataclasses import dataclass
from io import BytesIO, TextIOWrapper
from pathlib import Path
Expand All @@ -27,9 +27,7 @@
from ontogpt.clients.pubmed_client import PubmedClient
from ontogpt.clients.soup_client import SoupClient
from ontogpt.clients.wikipedia_client import WikipediaClient
from ontogpt.engines import create_engine
from ontogpt.engines.embedding_similarity_engine import SimilarityEngine
from ontogpt.engines.enrichment import EnrichmentEngine
from ontogpt.engines.generic_engine import GenericEngine, QuestionCollection
from ontogpt.engines.gpt4all_engine import GPT4AllEngine # type: ignore
from ontogpt.engines.halo_engine import HALOEngine # type: ignore
Expand All @@ -41,17 +39,10 @@
from ontogpt.engines.reasoner_engine import ReasonerEngine
from ontogpt.engines.spires_engine import SPIRESEngine
from ontogpt.engines.synonym_engine import SynonymEngine
from ontogpt.evaluation.enrichment.eval_enrichment import EvalEnrichment
from ontogpt.evaluation.resolver import create_evaluator
from ontogpt.io.csv_wrapper import output_parser, write_obj_as_csv
from ontogpt.io.html_exporter import HTMLExporter
from ontogpt.io.markdown_exporter import MarkdownExporter
from ontogpt.utils.gene_set_utils import (
GeneSet,
_is_human,
fill_missing_gene_set_values,
parse_gene_set,
)
from ontogpt.utils.gpt4all_runner import chain_gpt4all_model, set_up_gpt4all_model

__all__ = [
Expand Down Expand Up @@ -864,184 +855,6 @@ def synonyms(model, term, context, output, output_format, **kwargs):
output.write(out)


@main.command()
@output_option_txt
@output_format_options
@click.option(
"--annotation-path",
"-A",
required=True,
)
@click.argument("term")
def create_gene_set(term, output, output_format, annotation_path, **kwargs):
"""Create a gene set."""
logging.info(f"Creating for {term}")
evaluator = EvalEnrichment()
evaluator.load_annotations(annotation_path)
gene_set = evaluator.create_gene_set_from_term(term)
print(yaml.dump(gene_set.dict(), sort_keys=False))


@main.command()
@output_option_txt
@output_format_options
@click.option("--fill/--no-fill", default=False)
@click.option(
"--input-file",
"-U",
help="File with gene IDs to enrich (if not passed as arguments)",
)
def convert_geneset(input_file, output, output_format, fill, **kwargs):
"""Convert gene set to YAML."""
gene_set = parse_gene_set(input_file)
if fill:
fill_missing_gene_set_values(gene_set)
output.write(dump_minimal_yaml(gene_set.dict()))


@main.command()
@output_option_txt
@output_format_options
@model_option
@show_prompt_option
@click.option(
"--resolver", "-r", help="OAK selector for the gene ID resolver. E.g. sqlite:obo:hgnc"
)
@click.option(
"-C",
"--context",
help="domain e.g. anatomy, industry, health-related (NOT IMPLEMENTED - currently gene only)",
)
@click.option(
"--strict/--no-strict",
default=True,
show_default=True,
help="If set, there must be a unique mappings from labels to IDs",
)
@click.option(
"--input-file",
"-U",
help="File with gene IDs to enrich (if not passed as arguments)",
)
@click.option(
"--randomize-gene-descriptions-using-file",
help="FOR EVALUATION ONLY. Swap out gene descriptions with genes from this gene set filefile",
)
@click.option(
"--ontological-synopsis/--no-ontological-synopsis",
default=True,
show_default=True,
help="If set, use automated rather than manual gene descriptions",
)
@click.option(
"--combined-synopsis/--no-combined-synopsis",
default=False,
show_default=True,
help="If set, both gene descriptions",
)
@click.option(
"--end-marker",
help="For testing minor variants of prompts",
)
@click.option(
"--annotations/--no-annotations",
default=True,
show_default=True,
help="If set, include annotations in the prompt",
)
@prompt_template_option
@interactive_option
@click.argument("genes", nargs=-1)
def enrichment(
genes,
context,
input_file,
resolver,
output,
model,
show_prompt,
interactive,
end_marker,
output_format,
randomize_gene_descriptions_using_file,
**kwargs,
):
"""Gene class summary enriching (SPINDOCTOR).
Algorithm:
1. Map gene symbols to IDs using the resolver (unless IDs specified)
2. Fetch gene descriptions using Alliance API
3. Create a prompt using descriptions
Limitations:
It is very easy to exceed the max token length with GPT-3 models.
Usage:
ontogpt enrichment -r sqlite:obo:hgnc -U tests/input/genesets/dopamine.yaml
Usage:
ontogpt enrichment -r sqlite:obo:hgnc -U tests/input/genesets/dopamine.yaml
"""
if model:
selectmodel = get_model_by_name(model)
model_source = selectmodel["provider"]

if model_source != "OpenAI":
raise NotImplementedError(
"Model not yet supported for gene enrichment or enrichment evaluation."
)

if not genes and not input_file:
raise ValueError("Either genes or input file must be passed")
if genes:
gene_set = GeneSet(name="TEMP", gene_symbols=genes)
if input_file:
if genes:
raise ValueError("Either genes or input file must be passed")
gene_set = parse_gene_set(input_file)
if not gene_set:
raise ValueError("No genes passed")
ke = create_engine(None, EnrichmentEngine, model=model)
if end_marker:
ke.end_marker = end_marker
if interactive:
ke.client.interactive = True
if settings.cache_db:
ke.client.cache_db_path = settings.cache_db
if not isinstance(ke, EnrichmentEngine):
raise ValueError(f"Expected EnrichmentEngine, got {type(ke)}")
if resolver:
ke.add_resolver(resolver)
if randomize_gene_descriptions_using_file:
print("WARNING!! Randomly spiking gene descriptions")
spike_gene_set = parse_gene_set(randomize_gene_descriptions_using_file)
aliases = {}
if not spike_gene_set.gene_symbols:
raise ValueError("No gene symbols for spike set")
syms = copy(gene_set.gene_symbols)
if len(spike_gene_set.gene_symbols) < len(gene_set.gene_symbols):
raise ValueError("Not enough genes in spike set")
for sym in spike_gene_set.gene_symbols:
if not syms:
break
aliases[sym] = syms.pop()
results = ke.summarize(
spike_gene_set, normalize=resolver is not None, gene_aliases=aliases, **kwargs
)
else:
results = ke.summarize(gene_set, normalize=resolver is not None, **kwargs)
if results.truncation_factor is not None and results.truncation_factor < 1.0:
logging.warning(f"Text was truncated; factor = {results.truncation_factor}")
output = _as_text_writer(output)
if show_prompt:
print(results.prompt)
output.write(dump_minimal_yaml(results))


@main.command()
@output_option_txt
@output_format_options
Expand Down
Loading

0 comments on commit d6c2345

Please sign in to comment.