Remove TALISMAN functions (#309)

TALISMAN has moved to https://github.com/monarch-initiative/talisman/
monarch-initiative · Jan 5, 2024 · d6c2345 · d6c2345
2 parents 7553b3e + 24e828b
commit d6c2345
Show file tree

Hide file tree

Showing 13 changed files with 11 additions and 1,430 deletions.
diff --git a/README.md b/README.md
@@ -65,6 +65,10 @@ NOTE: We do not recommend hosting this webapp publicly without authentication.
 
 OpenAI's functions have been evaluated on test data. Please see the full documentation for details on these evaluations and how to reproduce them.
 
+## Related Projects
+
+* [TALISMAN](https://github.com/monarch-initiative/talisman/), a tool for generating summaries of functions enriched within a gene set. TALISMAN uses OntoGPT to work with LLMs.
+
 ## Tutorials and Presentations
 
 - Presentation: "Staying grounded: assembling structured biological knowledge with help from large language models" - presented by Harry Caufield as part of the AgBioData Consortium webinar series (September 2023)
@@ -81,8 +85,6 @@ OpenAI's functions have been evaluated on test data. Please see the full documen
 
 The information extraction approach used in OntoGPT, SPIRES, is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv publication: <http://arxiv.org/abs/2304.02711>
 
-The gene summarization approach used in OntoGPT, SPINDOCTOR, is described further in: Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization using Large Language Models. arXiv publication: <http://arxiv.org/abs/2305.13338>
-
 ## Acknowledgements
 
 This project is part of the [Monarch Initiative](https://monarchinitiative.org/). We also gratefully acknowledge [Bosch Research](https://www.bosch.com/research) for their support of this research project.
diff --git a/docs/functions.md b/docs/functions.md
@@ -213,44 +213,11 @@ ontogpt convert-examples inputfile.yaml
 
 ### convert-geneset
 
-Convert gene set to YAML.
-
-The gene set may be in JSON (msigdb format) or text (one gene symbol per line) format.
-
-See also the `create-gene-set` command (see below).
-
-Options:
-
-* `--fill` / `--no-fill` - Defaults to False (`--no-fill`). If True (`--fill`), the function will attempt to fill in missing gene values.
-* `-U`, `--input-file TEXT` - Path to a file with gene IDs to enrich (if not passed as arguments).
-
-Example:
-
-```bash
-ontogpt convert-geneset -U inputfile.json
-```
+*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>
 
 ### create-gene-set
 
-Create a gene set.
-
-This is primarily relevant to the TALISMAN method for creating gene set summaries.
-
-It creates a gene set given a set of gene annotations in two-column TSV or GAF format.
-
-The function also requires a single argument for the term to create the gene set with.
-
-The output is provided in YAML format.
-
-Options:
-
-* `-A`, `--annotation-path TEXT` - Path to a file containing annotations.
-
-Example:
-
-```bash
-ontogpt create-gene-set -A inputfile.tsv "positive regulation of mitotic cytokinesis"
-```
+*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>
 
 ### diagnose
 
@@ -315,43 +282,7 @@ For OpenAI's "text-embedding-ada-002" model, the output will be a vector of leng
 
 ### enrichment
 
-Gene class summary enriching. This is OntoGPT's implementation of TALISMAN.
-
-The goal of gene summary enrichment is to assemble a textual summary of the functions of a set of genes and their products.
-
-TALISMAN can run in three different ways:
-
-1. Map gene symbols to IDs using the resolver (unless IDs are specified)
-2. Fetch gene descriptions using Alliance API
-3. Create a prompt using descriptions
-
-Options:
-
-* `-r`, `--resolver TEXT` - OAK selector for the gene ID resolver, e.g., `sqlite:obo:hgnc` for HGNC gene IDs.
-* `-C`, `--context TEXT` - domain, e.g., anatomy, industry, health-related
-* `--strict` / `--no-strict` - If set, there must be a unique mappings from labels to IDs. Defaults to True.
-* `-U`, `--input-file TEXT` - Path to a file with gene IDs to enrich if not passed as arguments.
-* `--randomize-gene-descriptions-using-file TEXT` - For evaluation only. Path to a file containing gene identifiers and descriptions; if this option is used, TALISMAN will swap out gene descriptions with those from this gene set file.
-* `--ontological-synopsis` / `--no-ontological-synopsis` - If set, use automated rather than manual gene descriptions. Defaults to True.
-* `--combined-synopsis` / `--no-combined-synopsis` - If set, combine gene descriptions. Defaults to False.
-* `--end-marker TEXT` - Specify a character or string to end prompts with. For testing minor variants of prompts.
-* `--annotations` / `--no-annotations` - If set, include annotations in the prompt. Defaults to True.
-* `--prompt-template TEXT` - Path to a file containing the prompt.
-* `--interactive` / `--no-interactive` - Interactive mode - rather than call the API, the function will present a walkthrough process. Defaults to False.
-
-Example:
-
-```bash
-ontogpt enrichment -r sqlite:obo:hgnc -U tests/input/genesets/EDS.yaml
-```
-
-In this case, the prompt will include gene summaries retrieved from the database.
-
-The response text will include, among other fields, a summary like this:
-
-```text
-Summary: The common function among these genes is their involvement in the regulation and organization of the extracellular matrix, particularly collagen fibril organization and biosynthesis.
-```
+*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>
 
 ### entity-similarity
 
@@ -402,29 +333,7 @@ ontogpt eval --num-tests 1 EvalCTD
 
 ### eval-enrichment
 
-Run enrichment (TALISMAN) using multiple methods.
-
-This function runs a set of evaluations specific to the TALISMAN gene set summary process.
-
-It will iterate through all relevant models to compare results.
-
-The function assumes genes will have HGNC identifiers.
-
-Options:
-
-* `--strict` / `--no-strict` - If set, there must be a unique mappings from labels to IDs. Defaults to True.
-* `-U`, `--input-file TEXT` - Path to a file with gene IDs to enrich (if not passed as arguments)
-* `--ontological-synopsis` / `--no-ontological-synopsis` - If set, use automated rather than manual gene descriptions. Defaults to True.
-* `--combined-synopsis` / `--no-combined-synopsis` - If set, combine gene descriptions. Defaults to False.
-* `--annotations` / `--no-annotations` - If set, include annotations in the prompt. Defaults to True.
-* `-n`, `--number-to-drop INTEGER` - Maximum number of genes to drop if necessary.
-* `-A`, `--annotations-path TEXT` - Path to file containing annotations.
-
-Example:
-
-```bash
-ontogpt enrichment -U tests/input/genesets/EDS.yaml
-```
+*This command has been deprecated. It is now available through the TALISMAN package at:* <https://github.com/monarch-initiative/talisman>
 
 ### extract
 

diff --git a/docs/index.md b/docs/index.md
@@ -66,21 +66,13 @@ web-ontogpt
 
 NOTE: We do not recommend hosting this webapp publicly without authentication.
 
-Gene enrichment has its own webapp powered by Streamlit:
-
-```bash
-streamlit run src/ontogpt/streamlit/spindoctor.py
-```
-
 ## Citation
 
 SPIRES is described further in: Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv publication: <http://arxiv.org/abs/2304.02711>
 
-SPINDOCTOR is described further in: Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization using Large Language Models. arXiv publication: <http://arxiv.org/abs/2305.13338>
-
 ## Contributing
 
-Contributions are welcome! One way to get started with contributing to OntoGPT is to submit an 
+Contributions are welcome! One way to get started with contributing to OntoGPT is to submit an issue.
 
 Contributions on recipes to test welcome from anyone! Just make a PR [here](https://github.com/monarch-initiative/ontogpt/blob/main/tests/input/recipe-urls.csv). See [this list](https://github.com/hhursev/recipe-scrapers) for accepted URLs
 

diff --git a/src/ontogpt/cli.py b/src/ontogpt/cli.py
@@ -4,7 +4,7 @@
 import logging
 import pickle
 import sys
-from copy import copy, deepcopy
+from copy import deepcopy
 from dataclasses import dataclass
 from io import BytesIO, TextIOWrapper
 from pathlib import Path
@@ -27,9 +27,7 @@
 from ontogpt.clients.pubmed_client import PubmedClient
 from ontogpt.clients.soup_client import SoupClient
 from ontogpt.clients.wikipedia_client import WikipediaClient
-from ontogpt.engines import create_engine
 from ontogpt.engines.embedding_similarity_engine import SimilarityEngine
-from ontogpt.engines.enrichment import EnrichmentEngine
 from ontogpt.engines.generic_engine import GenericEngine, QuestionCollection
 from ontogpt.engines.gpt4all_engine import GPT4AllEngine  # type: ignore
 from ontogpt.engines.halo_engine import HALOEngine  # type: ignore
@@ -41,17 +39,10 @@
 from ontogpt.engines.reasoner_engine import ReasonerEngine
 from ontogpt.engines.spires_engine import SPIRESEngine
 from ontogpt.engines.synonym_engine import SynonymEngine
-from ontogpt.evaluation.enrichment.eval_enrichment import EvalEnrichment
 from ontogpt.evaluation.resolver import create_evaluator
 from ontogpt.io.csv_wrapper import output_parser, write_obj_as_csv
 from ontogpt.io.html_exporter import HTMLExporter
 from ontogpt.io.markdown_exporter import MarkdownExporter
-from ontogpt.utils.gene_set_utils import (
-    GeneSet,
-    _is_human,
-    fill_missing_gene_set_values,
-    parse_gene_set,
-)
 from ontogpt.utils.gpt4all_runner import chain_gpt4all_model, set_up_gpt4all_model
 
 __all__ = [
@@ -864,184 +855,6 @@ def synonyms(model, term, context, output, output_format, **kwargs):
     output.write(out)
 
 
-@main.command()
-@output_option_txt
-@output_format_options
-@click.option(
-    "--annotation-path",
-    "-A",
-    required=True,
-)
-@click.argument("term")
-def create_gene_set(term, output, output_format, annotation_path, **kwargs):
-    """Create a gene set."""
-    logging.info(f"Creating for {term}")
-    evaluator = EvalEnrichment()
-    evaluator.load_annotations(annotation_path)
-    gene_set = evaluator.create_gene_set_from_term(term)
-    print(yaml.dump(gene_set.dict(), sort_keys=False))
-
-
-@main.command()
-@output_option_txt
-@output_format_options
-@click.option("--fill/--no-fill", default=False)
-@click.option(
-    "--input-file",
-    "-U",
-    help="File with gene IDs to enrich (if not passed as arguments)",
-)
-def convert_geneset(input_file, output, output_format, fill, **kwargs):
-    """Convert gene set to YAML."""
-    gene_set = parse_gene_set(input_file)
-    if fill:
-        fill_missing_gene_set_values(gene_set)
-    output.write(dump_minimal_yaml(gene_set.dict()))
-
-
-@main.command()
-@output_option_txt
-@output_format_options
-@model_option
-@show_prompt_option
-@click.option(
-    "--resolver", "-r", help="OAK selector for the gene ID resolver. E.g. sqlite:obo:hgnc"
-)
-@click.option(
-    "-C",
-    "--context",
-    help="domain e.g. anatomy, industry, health-related (NOT IMPLEMENTED - currently gene only)",
-)
-@click.option(
-    "--strict/--no-strict",
-    default=True,
-    show_default=True,
-    help="If set, there must be a unique mappings from labels to IDs",
-)
-@click.option(
-    "--input-file",
-    "-U",
-    help="File with gene IDs to enrich (if not passed as arguments)",
-)
-@click.option(
-    "--randomize-gene-descriptions-using-file",
-    help="FOR EVALUATION ONLY. Swap out gene descriptions with genes from this gene set filefile",
-)
-@click.option(
-    "--ontological-synopsis/--no-ontological-synopsis",
-    default=True,
-    show_default=True,
-    help="If set, use automated rather than manual gene descriptions",
-)
-@click.option(
-    "--combined-synopsis/--no-combined-synopsis",
-    default=False,
-    show_default=True,
-    help="If set, both gene descriptions",
-)
-@click.option(
-    "--end-marker",
-    help="For testing minor variants of prompts",
-)
-@click.option(
-    "--annotations/--no-annotations",
-    default=True,
-    show_default=True,
-    help="If set, include annotations in the prompt",
-)
-@prompt_template_option
-@interactive_option
-@click.argument("genes", nargs=-1)
-def enrichment(
-    genes,
-    context,
-    input_file,
-    resolver,
-    output,
-    model,
-    show_prompt,
-    interactive,
-    end_marker,
-    output_format,
-    randomize_gene_descriptions_using_file,
-    **kwargs,
-):
-    """Gene class summary enriching (SPINDOCTOR).
-
-    Algorithm:
-
-    1. Map gene symbols to IDs using the resolver (unless IDs specified)
-    2. Fetch gene descriptions using Alliance API
-    3. Create a prompt using descriptions
-
-    Limitations:
-
-    It is very easy to exceed the max token length with GPT-3 models.
-
-    Usage:
-
-        ontogpt enrichment -r sqlite:obo:hgnc -U tests/input/genesets/dopamine.yaml
-
-    Usage:
-
-        ontogpt enrichment -r sqlite:obo:hgnc -U tests/input/genesets/dopamine.yaml
-    """
-    if model:
-        selectmodel = get_model_by_name(model)
-        model_source = selectmodel["provider"]
-
-        if model_source != "OpenAI":
-            raise NotImplementedError(
-                "Model not yet supported for gene enrichment or enrichment evaluation."
-            )
-
-    if not genes and not input_file:
-        raise ValueError("Either genes or input file must be passed")
-    if genes:
-        gene_set = GeneSet(name="TEMP", gene_symbols=genes)
-    if input_file:
-        if genes:
-            raise ValueError("Either genes or input file must be passed")
-        gene_set = parse_gene_set(input_file)
-    if not gene_set:
-        raise ValueError("No genes passed")
-    ke = create_engine(None, EnrichmentEngine, model=model)
-    if end_marker:
-        ke.end_marker = end_marker
-    if interactive:
-        ke.client.interactive = True
-    if settings.cache_db:
-        ke.client.cache_db_path = settings.cache_db
-    if not isinstance(ke, EnrichmentEngine):
-        raise ValueError(f"Expected EnrichmentEngine, got {type(ke)}")
-    if resolver:
-        ke.add_resolver(resolver)
-    if randomize_gene_descriptions_using_file:
-        print("WARNING!! Randomly spiking gene descriptions")
-        spike_gene_set = parse_gene_set(randomize_gene_descriptions_using_file)
-        aliases = {}
-        if not spike_gene_set.gene_symbols:
-            raise ValueError("No gene symbols for spike set")
-        syms = copy(gene_set.gene_symbols)
-        if len(spike_gene_set.gene_symbols) < len(gene_set.gene_symbols):
-            raise ValueError("Not enough genes in spike set")
-        for sym in spike_gene_set.gene_symbols:
-            if not syms:
-                break
-            aliases[sym] = syms.pop()
-        results = ke.summarize(
-            spike_gene_set, normalize=resolver is not None, gene_aliases=aliases, **kwargs
-        )
-    else:
-        results = ke.summarize(gene_set, normalize=resolver is not None, **kwargs)
-    if results.truncation_factor is not None and results.truncation_factor < 1.0:
-        logging.warning(f"Text was truncated; factor = {results.truncation_factor}")
-    output = _as_text_writer(output)
-    if show_prompt:
-        print(results.prompt)
-    output.write(dump_minimal_yaml(results))
-
-
 @main.command()
 @output_option_txt
 @output_format_options