Skip to content

Commit

Permalink
A simple way to create and evaluate given a 'task' in the catalog and…
Browse files Browse the repository at this point in the history
… python data structure (#1413)

* - added api methods to create and evaluate datasets from python data structures
- created simplfied qa task and qa metric catalogs
- added default templates to Task, so user does need to specifiy it
- added input checks in llm_as_judges and standard recipe

Signed-off-by: Yoav Katz <[email protected]>

* Added example of create_and_evaluate for rag

Signed-off-by: Yoav Katz <[email protected]>

* Added type to Rag end to end response

Also added documentation

Signed-off-by: Yoav Katz <[email protected]>

* Added option for integer ids in Retrieval metrics

Signed-off-by: Yoav Katz <[email protected]>

* Moved to integer context_ids

Signed-off-by: Yoav Katz <[email protected]>

* Improved Task field mismtach erorr message

Signed-off-by: Yoav Katz <[email protected]>

* Added option to pass post processor to create_and_evaluate_dataset

Signed-off-by: Yoav Katz <[email protected]>

* Addded descriptions and default values to classification

Added example of classification

Signed-off-by: Yoav Katz <[email protected]>

* Added descriptions

Signed-off-by: Yoav Katz <[email protected]>

* Reorganized extracted qa task and templates.

Signed-off-by: Yoav Katz <[email protected]>

* Updated json

Signed-off-by: Yoav Katz <[email protected]>

* Added documentation of default task and improved error message

Signed-off-by: Yoav Katz <[email protected]>

* Updated extraction tempalte

Signed-off-by: Yoav Katz <[email protected]>

* Upated recommended template

Signed-off-by: Yoav Katz <[email protected]>

* Sikplified and improved LLM as Judge

Signed-off-by: Yoav Katz <[email protected]>

* Improved error message.

Signed-off-by: Yoav Katz <[email protected]>

* Fixed classification default task

Signed-off-by: Yoav Katz <[email protected]>

* Fixed classification default task

Signed-off-by: Yoav Katz <[email protected]>

* Fixed unitxt

Signed-off-by: Yoav Katz <[email protected]>

* Fixed typo in template

Signed-off-by: Yoav Katz <[email protected]>

* Improved documentation

Signed-off-by: Yoav Katz <[email protected]>

* Simplified API (removed create_and_evaluate_dataset)

Signed-off-by: Yoav Katz <[email protected]>

* Improved error message

Signed-off-by: Yoav Katz <[email protected]>

* Changed wikitq back to extractive.

Signed-off-by: Yoav Katz <[email protected]>

* Fixed examples

Signed-off-by: Yoav Katz <[email protected]>

* Removed location reference in template.

Signed-off-by: Yoav Katz <[email protected]>

* Modified extractive template

Signed-off-by: Yoav Katz <[email protected]>

---------

Signed-off-by: Yoav Katz <[email protected]>
  • Loading branch information
yoavkatz authored Dec 11, 2024
1 parent 688f500 commit 3019204
Show file tree
Hide file tree
Showing 65 changed files with 768 additions and 291 deletions.
25 changes: 25 additions & 0 deletions docs/docs/adding_task.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,31 @@ in this case, the metrics calculate the accuracy of the sum of two integers, exp
It is the responsibility of the templates, via its post processors to convert the model textual predictions
into the `prediction_type`.

A Task can define a `default_template` attribute that determines the default :ref:`Template <adding_template>` used to create the input to the model,
when the template is not explicitly specified by the user in the :ref:`Card <adding_dataset>`
definition or in the :ref:`load_dataset <evaluating_datasets>` template argument.

.. code-block:: python
from unitxt.blocks import Task
from unitxt.templates import InputOutputTemplate
template = InputOutputTemplate(
input_format="How much is {num1} plus {num2}?",
output_format="{sum}",
)
task = Task(
input_fields={"num1" : int, "num2" : int},
reference_fields={"sum" : int},
prediction_type=int,
metrics=[
"metrics.sum_accuracy",
"metrics.sum_accuracy_approximate"
],
default_template=template
)
To register the task to the catalog

.. code-block:: python
Expand Down
45 changes: 31 additions & 14 deletions docs/docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,11 @@ Each example comes with a self contained python file that you can run and later
Basic Usage
------------


Evaluate an existing dataset from the Unitxt catalog (No installation)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_no_install.py>`__

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.

Evaluate an existing dataset from the Unitxt catalog (with Unitxt installation)
Evaluate an existing dataset from the Unitxt catalog.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Demonstrates how to evaluate an existing entailment dataset (wnli) using Unitxt native APIs.
This approach is faster than using Huggingface APIs.
Demonstrates how to evaluate an existing entailment dataset using Unitxt.
Unitxt is used to load the dataset, generate the input to the model, run inference and evaluate the results.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_with_install.py>`__

Expand All @@ -51,6 +41,15 @@ It also shows how to use preprocessing steps to align the raw input of the datas

Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`, :ref:`Inference Engines <inference>`.

Evaluate a custom dataset - with existing predictions
=====================================================

These examples demonstrate how to evaluate a datasets of different tasks when predictions are already available and no inference is required.

`Example code for QA task <https://github.com/IBM/unitxt/blob/main/examples/evaluate_qa_dataset_with_given_predictions.py>`__
`Example code for classification task <https://github.com/IBM/unitxt/blob/main/examples/evaluate_classification_dataset_with_given_predictions.py>`__

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`

Evaluation usecases
-----------------------
Expand Down Expand Up @@ -228,6 +227,15 @@ and use the existing metrics to evaluate model results.

Related documentation: :ref:`RAG Guide <rag_support>`, :ref:`Response generation task <catalog.tasks.rag.response_generation>`, :ref:`Inference Engines <inference>`.

Evaluate RAG End to End - with existing predictions
=====================================================

This example demonstrates how to evaluate an end to end RAG system, given that the RAG system outputs are available.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_rag_end_to_end_dataset_with_given_predictions.py>`__

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`

Multi-Modality
--------------

Expand Down Expand Up @@ -258,7 +266,7 @@ Evaluate Image-Text to Text Models with different templates and explore the sens

Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.

Types and Serializers
Advanced topics
----------------------------

Custom Types and Serializers
Expand All @@ -270,3 +278,12 @@ This example show how to define new data types as well as the way these data typ

Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`, :ref:`Inference Engines <inference>`.


Evaluate an existing dataset from the Unitxt catalog (No installation)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.

`Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_existing_dataset_no_install.py>`__

Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.
74 changes: 4 additions & 70 deletions docs/docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,78 +4,12 @@
Installation
==============

Unitxt conforms to the HuggingFace Datasets and Metrics APIs, so it can be used without explicitly installing the unitxt package.

.. code-block:: python
import evaluate
from datasets import load_dataset
from transformers import pipeline
dataset = load_dataset('unitxt/data', 'card=cards.wnli,template=templates.classification.multi_class.relation.default,max_test_instances=20',trust_remote_code=True)
testset = dataset["test"]
model_inputs = testset["source"]
model = pipeline(model='google/flan-t5-base')
predictions = [output['generated_text'] for output in model(model_inputs, max_new_tokens=30)]
metric = evaluate.load("unitxt/metric", trust_remote_code=True)
dataset_with_scores = metric.compute(predictions=predictions, references=testset)
[print(item) for item in dataset_with_scores[0]['score']['global'].items()]
Note, the `trust_remote_code=True` flag is required because in the background the HuggingFace API downloads and installs the
latest version of Unitxt from https://huggingface.co/datasets/unitxt/data/tree/main.

The core of Unitxt has minimal dependencies (none beyond HuggingFace evaluate).
Note that specific metrics or other operators may required specific dependencies, which are checked before the first time they are used.
An error message is printed if the there are missing installed dependencies.

The benefit of using the HuggingFace API approach is that you can load a Unitxt dataset, just like every other HuggingFace dataset,
so it can be used in preexisting code without modifications.
However, this incurs extra overhead when HuggingFace downloads the unitxt package and does not expose all unitxt capabilities
(e.g., defining new datasets, metrics, templates, and more).

To get the full capabilities of Unitxt, install Unitxt locally from pip:
Install Unitxt locally from pip:

.. code-block:: bash
pip install unitxt
You can then use the API:

.. code-block:: python
from unitxt import load_dataset,evaluate
from unitxt.inference import HFPipelineBasedInferenceEngine
dataset = load_dataset('card=cards.wnli,template=templates.classification.multi_class.relation.default,max_test_instances=20')
test_dataset = dataset["test"]
model_name="google/flan-t5-large"
inference_model = HFPipelineBasedInferenceEngine(model_name=model_name, max_new_tokens=32)
predictions = inference_model.infer(test_dataset)
dataset_with_scores = evaluate(predictions=predictions, data=test_dataset)
[print(item) for item in dataset_with_scores[0]['score']['global'].items()]
.. warning::
It's important not to mix calls to the Unitxt direct APIs and the HuggingFace APIs in the same program. Use either
the direct Unitxt APIs or the HuggingFace APIs to load datasets and metrics.

If you get an error message like:

.. code-block::
datasets_modules.datasets.unitxt--data.df049865776d8814049d4543a4068e50cda79b1558dc933047f4a41d087cc120.hf_utils.UnitxtVersionsConflictError:
Located installed unitxt version 1.9.0 that is older than unitxt HuggingFace dataset version 1.10.0.
It means that you are loading datasets using the HuggingFace API, but you also have a local version of Unitxt
installed, and the versions are not compatible. To fix this issue, you should choose one of the following three options:
* Update the locally installed Unitxt
to the Unitxt HuggingFace dataset version
* Uninstall the local Unitxt package (in case you don't require the access to Unitxt
direct APIs), or
* Change the code to load the datasets using the direct Unitxt APIs without using the HuggingFace API.

The core of Unitxt has a minimal set of requirements.
Specific metrics, inference APIs and other assets, may require additional dependencies.
A clear error message on what needs to be installed is provided if these dependencies are not available when the specific asset is used.
36 changes: 36 additions & 0 deletions examples/evaluate_classification_dataset_with_given_predictions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from unitxt import get_logger
from unitxt.api import create_dataset, evaluate
from unitxt.text_utils import print_dict

logger = get_logger()


classes = ["positive", "negative"]


# Set up question answer pairs in a dictionary
dataset = [
{"text": "I am happy.", "label": "positive", "classes": classes},
{"text": "It was a great movie.", "label": "positive", "classes": classes},
{"text": "I never felt so bad", "label": "negative", "classes": classes},
]

predictions = ["Positive.", "negative.", "negative"]

dataset = create_dataset(
task="tasks.classification.multi_class",
test_set=dataset,
postprocessors=["processors.take_first_word", "processors.lower_case"],
)

evaluated_dataset = evaluate(predictions, dataset["test"])
# Print results
for instance in evaluated_dataset:
print_dict(
instance,
keys_to_print=[
"prediction",
"references",
"score",
],
)
33 changes: 33 additions & 0 deletions examples/evaluate_qa_dataset_with_given_predictions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from unitxt import get_logger
from unitxt.api import create_dataset, evaluate
from unitxt.text_utils import print_dict

logger = get_logger()

# Set up question answer pairs in a dictionary
dataset = [
{"question": "What is the capital of Texas?", "answers": ["Austin"]},
{"question": "What is the color of the sky?", "answers": ["Blue"]},
]

predictions = ["San Antonio", "blue"]

dataset = create_dataset(
task="tasks.qa.open",
test_set=dataset,
metrics=[
"metrics.qa.open.recommended_no_gpu",
"metrics.qa.open.recommended_llm_as_judge",
],
)

evaluated_dataset = evaluate(predictions, dataset["test"])

# Print results
for instance in evaluated_dataset:
print_dict(
instance,
keys_to_print=[
"score",
],
)
64 changes: 64 additions & 0 deletions examples/evaluate_rag_end_to_end_dataset_with_given_predictions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
from unitxt import get_logger
from unitxt.api import create_dataset, evaluate
from unitxt.text_utils import print_dict

logger = get_logger()

#
contexts = [
"Austin is the capital of Texas.",
"Houston is in Texas",
"Houston is the the largest city in the state but not the capital of it.",
]

# Set up question answer pairs in a dictionary
dataset = [
{
"question": "What is the capital of Texas?",
"question_id": 0,
"reference_answers": ["Austin"],
"reference_contexts": [contexts[0]],
"reference_context_ids": [0],
"is_answerable_label": True,
},
{
"question": "Which is the the largest city in Texas?",
"question_id": 1,
"reference_answers": ["Houston"],
"reference_contexts": [contexts[1], contexts[2]],
"reference_context_ids": [1, 2],
"is_answerable_label": True,
},
]

predictions = [
{
"answer": "Houston",
"contexts": [contexts[2]],
"context_ids": [2],
"is_answerable": True,
},
{
"answer": "Houston",
"contexts": [contexts[2]],
"context_ids": [2],
"is_answerable": True,
},
]

dataset = create_dataset(
task="tasks.rag.end_to_end",
test_set=dataset,
split="test",
postprocessors=[],
)

evaluated_dataset = evaluate(predictions, dataset)
# Print results
for instance in evaluated_dataset:
print_dict(
instance,
keys_to_print=[
"score",
],
)
Loading

0 comments on commit 3019204

Please sign in to comment.