A simple way to create and evaluate given a 'task' in the catalog and…

… python data structure (#1413) * - added api methods to create and evaluate datasets from python data structures - created simplfied qa task and qa metric catalogs - added default templates to Task, so user does need to specifiy it - added input checks in llm_as_judges and standard recipe Signed-off-by: Yoav Katz <[email protected]> * Added example of create_and_evaluate for rag Signed-off-by: Yoav Katz <[email protected]> * Added type to Rag end to end response Also added documentation Signed-off-by: Yoav Katz <[email protected]> * Added option for integer ids in Retrieval metrics Signed-off-by: Yoav Katz <[email protected]> * Moved to integer context_ids Signed-off-by: Yoav Katz <[email protected]> * Improved Task field mismtach erorr message Signed-off-by: Yoav Katz <[email protected]> * Added option to pass post processor to create_and_evaluate_dataset Signed-off-by: Yoav Katz <[email protected]> * Addded descriptions and default values to classification Added example of classification Signed-off-by: Yoav Katz <[email protected]> * Added descriptions Signed-off-by: Yoav Katz <[email protected]> * Reorganized extracted qa task and templates. Signed-off-by: Yoav Katz <[email protected]> * Updated json Signed-off-by: Yoav Katz <[email protected]> * Added documentation of default task and improved error message Signed-off-by: Yoav Katz <[email protected]> * Updated extraction tempalte Signed-off-by: Yoav Katz <[email protected]> * Upated recommended template Signed-off-by: Yoav Katz <[email protected]> * Sikplified and improved LLM as Judge Signed-off-by: Yoav Katz <[email protected]> * Improved error message. Signed-off-by: Yoav Katz <[email protected]> * Fixed classification default task Signed-off-by: Yoav Katz <[email protected]> * Fixed classification default task Signed-off-by: Yoav Katz <[email protected]> * Fixed unitxt Signed-off-by: Yoav Katz <[email protected]> * Fixed typo in template Signed-off-by: Yoav Katz <[email protected]> * Improved documentation Signed-off-by: Yoav Katz <[email protected]> * Simplified API (removed create_and_evaluate_dataset) Signed-off-by: Yoav Katz <[email protected]> * Improved error message Signed-off-by: Yoav Katz <[email protected]> * Changed wikitq back to extractive. Signed-off-by: Yoav Katz <[email protected]> * Fixed examples Signed-off-by: Yoav Katz <[email protected]> * Removed location reference in template. Signed-off-by: Yoav Katz <[email protected]> * Modified extractive template Signed-off-by: Yoav Katz <[email protected]> --------- Signed-off-by: Yoav Katz <[email protected]>
IBM · Dec 11, 2024 · 3019204 · 3019204
1 parent 688f500
commit 3019204
Show file tree

Hide file tree

Showing 65 changed files with 768 additions and 291 deletions.
diff --git a/docs/docs/adding_task.rst b/docs/docs/adding_task.rst
@@ -58,6 +58,31 @@ in this case, the metrics calculate the accuracy of the sum of two integers, exp
 It is the responsibility of the templates, via its post processors to convert the model textual predictions
 into the `prediction_type`.
 
+A Task can define a `default_template` attribute that determines the default :ref:`Template <adding_template>` used to create the input to the model, 
+when the template is not explicitly specified by the user in the :ref:`Card <adding_dataset>` 
+definition or in the  :ref:`load_dataset <evaluating_datasets>` template argument.
+
+.. code-block:: python
+
+   from unitxt.blocks import Task
+   from unitxt.templates import InputOutputTemplate
+
+   template = InputOutputTemplate(
+         input_format="How much is {num1} plus {num2}?",
+         output_format="{sum}",
+   ) 
+    
+   task = Task(
+        input_fields={"num1" : int, "num2" : int},
+        reference_fields={"sum" : int},
+        prediction_type=int,
+        metrics=[
+            "metrics.sum_accuracy",
+            "metrics.sum_accuracy_approximate"
+        ],
+        default_template=template
+   )
+
 To register the task to the catalog
 
 .. code-block:: python

diff --git a/docs/docs/examples.rst b/docs/docs/examples.rst
@@ -11,21 +11,11 @@ Each example comes with a self contained python file that you can run and later
 Basic Usage
 ------------
 
-
-Evaluate an existing dataset from the Unitxt catalog (No installation)
-++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.
-
-`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_no_install.py>`__
-
-Related documentation:  :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.
-
-Evaluate an existing dataset from the Unitxt catalog (with Unitxt installation)
+Evaluate an existing dataset from the Unitxt catalog. 
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
-Demonstrates how to evaluate an existing entailment dataset (wnli) using Unitxt native APIs.
-This approach is faster than using Huggingface APIs.
+Demonstrates how to evaluate an existing entailment dataset using Unitxt.
+Unitxt is used to load the dataset, generate the input to the model, run inference and evaluate the results.
 
 `Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_with_install.py>`__
 
@@ -51,6 +41,15 @@ It also shows how to use preprocessing steps to align the raw input of the datas
 
 Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`, :ref:`Inference Engines <inference>`.
 
+Evaluate a custom dataset - with existing predictions
+=====================================================
+
+These examples demonstrate how to evaluate a datasets of different tasks when predictions are already available and no inference is required.
+
+`Example code for QA task  <https://github.com/IBM/unitxt/blob/main/examples/evaluate_qa_dataset_with_given_predictions.py>`__
+`Example code for classification task  <https://github.com/IBM/unitxt/blob/main/examples/evaluate_classification_dataset_with_given_predictions.py>`__  
+
+Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`
 
 Evaluation usecases
 -----------------------
@@ -228,6 +227,15 @@ and use the existing metrics to evaluate model results.
 
 Related documentation: :ref:`RAG Guide <rag_support>`, :ref:`Response generation task <catalog.tasks.rag.response_generation>`, :ref:`Inference Engines <inference>`.
 
+Evaluate RAG End to End - with existing predictions
+=====================================================
+
+This example demonstrates how to evaluate an end to end RAG system, given that the RAG system outputs are available.
+
+`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_rag_end_to_end_dataset_with_given_predictions.py>`__
+
+Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`
+
 Multi-Modality
 --------------
 
@@ -258,7 +266,7 @@ Evaluate Image-Text to Text Models with different templates and explore the sens
 
 Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.
 
-Types and Serializers
+Advanced topics
 ----------------------------
 
 Custom Types and Serializers
@@ -270,3 +278,12 @@ This example show how to define new data types as well as the way these data typ
 
 Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`, :ref:`Inference Engines <inference>`.
 
+
+Evaluate an existing dataset from the Unitxt catalog (No installation)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.
+
+`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_no_install.py>`__
+
+Related documentation:  :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.
diff --git a/docs/docs/installation.rst b/docs/docs/installation.rst
@@ -4,78 +4,12 @@
 Installation
 ==============
 
-Unitxt conforms to the HuggingFace Datasets and Metrics APIs, so it can be used without explicitly installing the unitxt package.
-
-.. code-block:: python
-
-  import evaluate
-  from datasets import load_dataset
-  from transformers import pipeline
-
-
-  dataset = load_dataset('unitxt/data', 'card=cards.wnli,template=templates.classification.multi_class.relation.default,max_test_instances=20',trust_remote_code=True)
-  testset = dataset["test"]
-  model_inputs = testset["source"]
-  model = pipeline(model='google/flan-t5-base')
-  predictions = [output['generated_text'] for output in model(model_inputs, max_new_tokens=30)]
-
-  metric = evaluate.load("unitxt/metric", trust_remote_code=True)
-  dataset_with_scores = metric.compute(predictions=predictions, references=testset)
-  [print(item) for item in dataset_with_scores[0]['score']['global'].items()]
-
-Note, the `trust_remote_code=True` flag is required because in the background the HuggingFace API downloads and installs the
-latest version of Unitxt from https://huggingface.co/datasets/unitxt/data/tree/main.
-
-The core of Unitxt has minimal dependencies (none beyond HuggingFace evaluate).
-Note that specific metrics or other operators may required specific dependencies, which are checked before the first time they are used.
-An error message is printed if the there are missing installed dependencies.
-
-The benefit of using the HuggingFace API approach is that you can load a Unitxt dataset, just like every other HuggingFace dataset, 
-so it can be used in preexisting code without modifications.  
-However, this incurs extra overhead when HuggingFace downloads the unitxt package and does not expose all unitxt capabilities
-(e.g., defining new datasets, metrics, templates, and more).
-
-To get the full capabilities of Unitxt, install Unitxt locally from pip:
+Install Unitxt locally from pip:
 
 .. code-block:: bash
 
   pip install unitxt
 
-
-You can then use the API:
-
-.. code-block:: python
-
-  from unitxt import load_dataset,evaluate
-  from unitxt.inference import HFPipelineBasedInferenceEngine
-
-  dataset = load_dataset('card=cards.wnli,template=templates.classification.multi_class.relation.default,max_test_instances=20')
-  test_dataset = dataset["test"]
-
-  model_name="google/flan-t5-large"
-  inference_model = HFPipelineBasedInferenceEngine(model_name=model_name, max_new_tokens=32)
-  predictions = inference_model.infer(test_dataset)
-
-  dataset_with_scores = evaluate(predictions=predictions, data=test_dataset)
-  [print(item) for item in dataset_with_scores[0]['score']['global'].items()] 
-
-
-.. warning::
-   It's important not to mix calls to the Unitxt direct APIs and the HuggingFace APIs in the same program.  Use either
-   the direct Unitxt APIs or the HuggingFace APIs to load datasets and metrics.
-
-If you get an error message like:
-
-.. code-block::
-
-   datasets_modules.datasets.unitxt--data.df049865776d8814049d4543a4068e50cda79b1558dc933047f4a41d087cc120.hf_utils.UnitxtVersionsConflictError:
-   Located installed unitxt version 1.9.0 that is older than unitxt HuggingFace dataset version 1.10.0.
-
-It means that you are loading datasets using the HuggingFace API, but you also have a local version of Unitxt
-installed, and the versions are not compatible. To fix this issue, you should choose one of the following three options:
-* Update the locally installed Unitxt
-to the Unitxt HuggingFace dataset version
-* Uninstall the local Unitxt package (in case you don't require the access to Unitxt
-direct APIs), or 
-* Change the code to load the datasets using the direct Unitxt APIs without using the HuggingFace API.
-
+The core of Unitxt has a minimal set of requirements.
+Specific metrics, inference APIs and other assets, may require additional dependencies. 
+A clear error message on what needs to be installed is provided if these dependencies are not available when the specific asset is used.
diff --git a/examples/evaluate_classification_dataset_with_given_predictions.py b/examples/evaluate_classification_dataset_with_given_predictions.py
@@ -0,0 +1,36 @@
+from unitxt import get_logger
+from unitxt.api import create_dataset, evaluate
+from unitxt.text_utils import print_dict
+
+logger = get_logger()
+
+
+classes = ["positive", "negative"]
+
+
+# Set up question answer pairs in a dictionary
+dataset = [
+    {"text": "I am happy.", "label": "positive", "classes": classes},
+    {"text": "It was a great movie.", "label": "positive", "classes": classes},
+    {"text": "I never felt so bad", "label": "negative", "classes": classes},
+]
+
+predictions = ["Positive.", "negative.", "negative"]
+
+dataset = create_dataset(
+    task="tasks.classification.multi_class",
+    test_set=dataset,
+    postprocessors=["processors.take_first_word", "processors.lower_case"],
+)
+
+evaluated_dataset = evaluate(predictions, dataset["test"])
+# Print results
+for instance in evaluated_dataset:
+    print_dict(
+        instance,
+        keys_to_print=[
+            "prediction",
+            "references",
+            "score",
+        ],
+    )
diff --git a/examples/evaluate_qa_dataset_with_given_predictions.py b/examples/evaluate_qa_dataset_with_given_predictions.py
@@ -0,0 +1,33 @@
+from unitxt import get_logger
+from unitxt.api import create_dataset, evaluate
+from unitxt.text_utils import print_dict
+
+logger = get_logger()
+
+# Set up question answer pairs in a dictionary
+dataset = [
+    {"question": "What is the capital of Texas?", "answers": ["Austin"]},
+    {"question": "What is the color of the sky?", "answers": ["Blue"]},
+]
+
+predictions = ["San Antonio", "blue"]
+
+dataset = create_dataset(
+    task="tasks.qa.open",
+    test_set=dataset,
+    metrics=[
+        "metrics.qa.open.recommended_no_gpu",
+        "metrics.qa.open.recommended_llm_as_judge",
+    ],
+)
+
+evaluated_dataset = evaluate(predictions, dataset["test"])
+
+# Print results
+for instance in evaluated_dataset:
+    print_dict(
+        instance,
+        keys_to_print=[
+            "score",
+        ],
+    )
diff --git a/examples/evaluate_rag_end_to_end_dataset_with_given_predictions.py b/examples/evaluate_rag_end_to_end_dataset_with_given_predictions.py
@@ -0,0 +1,64 @@
+from unitxt import get_logger
+from unitxt.api import create_dataset, evaluate
+from unitxt.text_utils import print_dict
+
+logger = get_logger()
+
+#
+contexts = [
+    "Austin is the capital of Texas.",
+    "Houston is in Texas",
+    "Houston is the the largest city in the state but not the capital of it.",
+]
+
+# Set up question answer pairs in a dictionary
+dataset = [
+    {
+        "question": "What is the capital of Texas?",
+        "question_id": 0,
+        "reference_answers": ["Austin"],
+        "reference_contexts": [contexts[0]],
+        "reference_context_ids": [0],
+        "is_answerable_label": True,
+    },
+    {
+        "question": "Which is the the largest city in Texas?",
+        "question_id": 1,
+        "reference_answers": ["Houston"],
+        "reference_contexts": [contexts[1], contexts[2]],
+        "reference_context_ids": [1, 2],
+        "is_answerable_label": True,
+    },
+]
+
+predictions = [
+    {
+        "answer": "Houston",
+        "contexts": [contexts[2]],
+        "context_ids": [2],
+        "is_answerable": True,
+    },
+    {
+        "answer": "Houston",
+        "contexts": [contexts[2]],
+        "context_ids": [2],
+        "is_answerable": True,
+    },
+]
+
+dataset = create_dataset(
+    task="tasks.rag.end_to_end",
+    test_set=dataset,
+    split="test",
+    postprocessors=[],
+)
+
+evaluated_dataset = evaluate(predictions, dataset)
+# Print results
+for instance in evaluated_dataset:
+    print_dict(
+        instance,
+        keys_to_print=[
+            "score",
+        ],
+    )