diff --git a/docs/evaluation.md b/docs/evaluation.md index 5102ec901..719fc3225 100644 --- a/docs/evaluation.md +++ b/docs/evaluation.md @@ -1,7 +1,7 @@ # Evaluation -Evaluations are a form of testing that helps you validate your LLM's responses -and ensure they meet your quality bar. +Evaluations are a form of testing that helps you validate the responses of your LLM +and ensure they meet your quality standards. Firebase Genkit supports third-party evaluation tools through plugins, paired with powerful observability features that provide insight into the runtime state @@ -10,100 +10,169 @@ data including inputs, outputs, and information from intermediate steps to evaluate the end-to-end quality of LLM responses as well as understand the performance of your system's building blocks. -For example, if you have a RAG flow, Genkit will extract the set of documents -that was returned by the retriever so that you can evaluate the quality of your -retriever while it runs in the context of the flow as shown below with the -Genkit faithfulness and answer relevancy metrics: +### Types of evaluations -```ts -import { genkit } from 'genkit'; -import { genkitEval, GenkitMetric } from '@genkit-ai/evaluator'; -import { vertexAI, textEmbedding004, gemini15Flash } from '@genkit-ai/vertexai'; +Genkit supports two types of evaluations: -const ai = genkit({ - plugins: [ - vertexAI(), - genkitEval({ - judge: gemini15Flash, - metrics: [GenkitMetric.FAITHFULNESS, GenkitMetric.ANSWER_RELEVANCY], - embedder: textEmbedding004, // GenkitMetric.ANSWER_RELEVANCY requires an embedder - }), - ], - // ... -}); -``` +* **Inference-based evaluation**: This type of evaluation is run against a collection of of pre-determined inputs and the corresponding outputs are assessed for quality. + + This is the most common evaluation type, suitable for most use cases. This approach tests the actual output of a system for each evaluation run. -**Note:** The configuration above requires installing the `genkit`, -`@genkit-ai/googleai`, `@genkit-ai/evaluator` and `@genkit-ai/vertexai` -packages. + The quality assessment can be done manually by visually inspecting the results or automated by using an evaluation metric. -```posix-terminal - npm install @genkit-ai/evaluator @genkit-ai/vertexai -``` +* **Raw evaluation**: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (`context`, `output`) must be present in the input dataset. This is useful when you have data coming from an external source (eg: collected from your production traces) and you simply want to have an objective measurement of the quality of the collected data. -Start by defining a set of inputs that you want to use as an input dataset -called `testInputs.json`. This input dataset represents the test cases you will -use to generate output for evaluation. + We will cover more on this approach in our Advanced section below. -```json -[ +Let us now take a look at how we can do inference-based evaluations using Genkit. + +## Quick start + +### Setup +1. Use an existing Genkit app or create a new one by following our [Getting started](get-started) guide. +2. Add the following code to define a simple RAG application to evaluate. For this guide, we will use a dummy retriever that always return the same documents. + +```js +import { genkit, z, Document } from "genkit"; +import { + googleAI, + gemini15Flash, + gemini15Pro, +} from "@genkit-ai/googleai"; + +// Initialize Genkit +export const ai = genkit ({ + plugins: [ + googleAI(), + ] +}); + +// Dummy retriever that always returns the same docs +export const dummyRetriever = ai.defineRetriever( { - "input": "What is the French word for Cheese?" + name: "dummyRetriever", }, - { - "input": "What green vegetable looks like cauliflower?" + async (i) => { + const facts = [ + "Dog is man's best friend", + "Dogs have evolved and were domesticated from wolves", + ]; + // Just return facts as documents. + return { documents: facts.map((t) => Document.fromText(t)) }; } -] -``` - -If the evaluator requires a reference output for evaluating a flow, you can pass both -input and reference output using this format instead: +); -```json -[ - { - "input": "What is the French word for Cheese?", - "reference": "Fromage" +// A simple question answering flow +export const qaFlow = ai.defineFlow({ + name: 'qaFlow', + inputSchema: z.string(), + outputSchema: z.string(), }, - { - "input": "What green vegetable looks like cauliflower?", - "reference": "Broccoli" + async (query) => { + const factDocs = await ai.retrieve({ + retriever: dummyRetriever, + query, + options: { k: 2 }, + }); + + const llmResponse = await ai.generate({ + model: gemini15Flash, + prompt: `Answer this question with the given context ${query}`, + docs: factDocs, + }); + return llmResponse.text; } -] +); ``` -Note that you can use any JSON data type in the input JSON file. Genkit will pass them along with the same data type to your flow. +3. (Optional) Add evaluation metrics to your application to use while evaluating. In this guide, we will use the `MALICIOUSNESS` metric from the `genkitEval` plugin. -You can then use the `eval:flow` command to evaluate your flow against the test -cases provided in `testInputs.json`. +```js +import { genkitEval, GenkitMetric } from "@genkit-ai/evaluator"; +import { gemini15Pro } from "@genkit-ai/googleai"; -```posix-terminal -genkit eval:flow menuSuggestionFlow --input testInputs.json +export const ai = genkit ({ + plugins: [ + ... + // Add this plugin to your Genkit initialization block + genkitEval({ + judge: gemini15Pro, + metrics: [GenkitMetric.MALICIOUSNESS], + }), + ] +}); ``` -If your flow requires auth, you may specify it using the `--auth` argument: +**Note:** The configuration above requires installing the `@genkit-ai/evaluator` package. ```posix-terminal -genkit eval:flow menuSuggestionFlow --input testInputs.json --auth "{\"email_verified\": true}" + npm install @genkit-ai/evaluator ``` -You can then see evaluation results in the Developer UI by running: +4. Start your Genkit application ```posix-terminal -genkit start +genkit start -- ``` -Then navigate to `localhost:4000/evaluate`. +### Create a dataset -Alternatively, you can provide an output file to inspect the output in a JSON -file. +Let us create a dataset to define the examples we want to use for evaluating our flow. -```posix-terminal -genkit eval:flow menuSuggestionFlow --input testInputs.json --output eval-result.json -``` +1. Go to the Dev UI at `http://localhost:4000` and click the **Datasets** button to open the Datasets page. + +2. Click on the **Create Dataset** button to open the create dataset dialog. + + a. Provide a `datasetId` to your new dataset. In this guide, we will be using `myFactsQaDataset`. + + b. Select `Flow` dataset type. + + c. Leave the validation target field empty and click **Save** + +3. You will be redirected to your new dataset page. This dataset is empty, let us add a few examples to our dataset. + + a. Click the **Add example** button to open the example editor panel. + + b. Only the `input` field is required. Enter `"Who is man's best friend?"` in the input field, and click **Save**. You will see that this example has been added to your dataset. + + c. Repeat steps (a) and (b) a couple more times to add more examples. In this guide, we add the following example inputs to the dataset: + + ``` + "Can I give milk to my cats?" + "From which animals did dogs evolve from?" + ``` + + By the end of this step, your dataset will have 3 examples in it, with the values mentioned above. -**Note:** Below you can see an example of how an LLM can help you generate the -test cases. +### Run evaluation and view results + +Now we're ready to evaluate our flow. Visit the `Evaluations` tab in the Dev UI and click the **Run new evaluation** button to get started. + +1. Select the `Flow` radio button, as we want to evaluate a flow. + +2. Select `qaFlow` as the target flow to evaluate. + +3. Select `myFactsQaDataset` as the target dataset to use for evaluation. + +4. (Optional) If you have installed an evaluator metric using Genkit plugins, you can see these metrics in this page. Select the metrics that you want to use with this evaluation run. This is entirely optional and omitting this step will return the results in the evaluation run, without any metrics associated with them. + +5. Finally, click **Run evaluation** to start evaluation. Depending on the flow you're testing, this may take a while. Once the evaluation is complete, you will see a success message with a link to view the results. Click on the link to go to the Evaluation details page. + +You can see the details of your evaluation run on this page, including original input, extracted context and metrics (if any). + + + +## Core concepts + +### Terminology + +- **Evaluation**: An evaluation is a process by which the quality of performance of a system is assessed. In Genkit, such a system is usually a Genkit primitive, such as a flow or a model. An evaluation can be automated or manual (human evaluation). + +- **Bulk inference** Inference is simply the act of running an input on a flow or model to get the corresponding output. Bulk inference involves performing inference on multiple inputs simultaneously. + +- **Metric** An evaluation metric is a criteria on which an inference is scored. E.g.: accuracy, faithfulness, maliciousness, whether the output is in English, etc. + +- **Dataset** A dataset is a collection of examples to use for inference-based evaluation. A dataset typically consists of `input` and optional `reference` fields. The `reference` field does not affect the inference step of evaluation but it is passed verbatim to any evaluation metrics. In Genkit, you can create a dataset through the Dev UI. ## Supported evaluators @@ -112,9 +181,9 @@ test cases. Genkit includes a small number of native evaluators, inspired by RAGAS, to help you get started: -* Faithfulness -* Answer Relevancy -* Maliciousness +* Faithfulness -- Measures the factual consistency of the generated answer against the given context +* Answer Relevancy -- Assesses how pertinent the generated answer is to the given prompt +* Maliciousness -- Measures whether the generated output intends to deceive, harm, or exploit ### Evaluator plugins @@ -122,58 +191,146 @@ Genkit supports additional evaluators through plugins like the VertexAI Rapid Ev ## Advanced use -`eval:flow` is a convenient way to quickly evaluate the flow, but sometimes you -might need more control over evaluation steps. This may occur if you are using a -different framework and already have some output you would like to evaluate. You -can perform all the steps that `eval:flow` performs semi-manually. +### Evaluation using the CLI -You can batch run your Genkit flow and add a unique label to the run which then -will be used to extract an evaluation dataset (a set of inputs, outputs, and -contexts). +Genkit CLI provides a rich API for performing evaluations. This is especially useful in environments where the Dev UI is not available (e.g. in a CI/CD workflow). + +Genkit CLI provides 3 main evaluation commands: `eval:flow`, `eval:extractData` and `eval:run`. + +#### `eval:flow` command -Run the flow over your test inputs: +The `eval:flow` command runs inference-based evaluation on an input dataset. This dataset may be provided either as a JSON file or by referencing an existing dataset in your Genkit runtime. + ```posix-terminal -genkit flow:batchRun myRagFlow test_inputs.json --output flow_outputs.json --label customLabel -``` +# Referencing an existing dataset +genkit eval:flow qaFlow --input myFactsQaDataset -Extract the evaluation data: +# or, using a dataset from a file +genkit eval:flow qaFlow --input testInputs.json +``` +Note: Make sure that you start your genkit app before running these CLI commands. ```posix-terminal -genkit eval:extractData myRagFlow --label customLabel --output customLabel_dataset.json +genkit start -- ``` -The exported data will be output as a JSON file with each testCase in the -following format: +Here, `testInputs.json` should be an array of objects containing an `input` field and an optional `reference` field, like below: ```json [ { - "testCaseId": string, - "input": string, - "output": string, - "context": array of strings, - "traceIds": array of strings, + "input": "What is the French word for Cheese?", + }, + { + "input": "What green vegetable looks like cauliflower?", + "reference": "Broccoli" } ] ``` -The data extractor will automatically locate retrievers and add the produced -docs to the context array. By default, `eval:run` will run against all -configured evaluators, and like `eval:flow`, results for `eval:run` will appear -in the evaluation page of Developer UI, located at `localhost:4000/evaluate`. +If your flow requires auth, you may specify it using the `--auth` argument: + +```posix-terminal +genkit eval:flow qaFlow --input testInputs.json --auth "{\"email_verified\": true}" +``` + +By default, the `eval:flow` and `eval:run` use all available metrics for evaluation. To run on a subset of the configured evaluators, use the `--evaluators` flag and +provide a comma-separated list of evaluators by name: + +```posix-terminal +genkit eval:flow qaFlow --input testInputs.json --evaluators=genkit/faithfulness,genkit/answer_relevancy +``` +You can view the results of your evaluation run in the Dev UI at `localhost:4000/evaluate`. + +#### `eval:extractData` and `eval:run` commands + +To support *raw evaluation*, Genkit provides tools to extract data from traces and run evaluation metrics on extracted data. This is useful, for e.g., if you are using a +different framework for evaluation or if you are collecting inferences from a different environment to test locally for output quality. + +You can batch run your Genkit flow and add a unique label to the run which then +will be used to extract an *evaluation dataset*. An evaluation dataset is a collection of inputs for evaluation metrics, *without* running any prior inference. + +Run your flow over your test inputs: + +```posix-terminal +genkit flow:batchRun qaFlow testInputs.json --label firstRunSimple +``` + +Extract the evaluation data: + +```posix-terminal +genkit eval:extractData qaFlow --label firstRunSimple --output factsEvalDataset.json +``` + +The exported data has a format different from the dataset format you've seen earlier. This is because this data is intended to be used with evaluation metrics directly, without any inference step. Here is the syntax of the extracted data. + +```json +Array<{ + "testCaseId": string, + "input": any, + "output": any, + "context": any[], + "traceIds": string[], +}>; +``` + +The data extractor will automatically locate retrievers and add the produced docs to the context array. You can run evaluation metrics on this extracted dataset using the `eval:run` command. + +```posix-terminal +genkit eval:run factsEvalDataset.json +``` + +By default, `eval:run` will run against all configured evaluators, and like `eval:flow`, results for `eval:run` will appear in the evaluation page of Developer UI, located at `localhost:4000/evaluate`. ### Custom extractors -You can also provide custom extractors to be used in `eval:extractData` and -`eval:flow` commands. Custom extractors allow you to override the default -extraction logic giving you more power in creating datasets and evaluating them. + + +Genkit provides reasonable default logic for extracting the necessary fields (`input`, `output` and `context`) +while doing an evaluation. However you may find that you need more control over the extraction logic for these fields. Genkit supports customs extractors to achieve this. You can provide custom extractors to be used in `eval:extractData` and +`eval:flow` commands. + +Let us first introduce an auxilary step in our `qaFlow` example: + +```js +export const qaFlow = ai.defineFlow({ + name: 'qaFlow', + inputSchema: z.string(), + outputSchema: z.string(), + }, + async (query) => { + const factDocs = await ai.retrieve({ + retriever: dummyRetriever, + query, + options: { k: 2 }, + }); + const factDocsModified = await run('factModified', async () => { + // Let us only facts that are considered silly. This is a hypothetical step + // for demo purposes, you may perform any arbitrary task inside a step + // and reference it in custom extractors. + // + // Assume you have a method that checks if a fact is silly + return factDocs.filter(d => isSillyFact(d.text)); + }); + + const llmResponse = await ai.generate({ + model: gemini15Flash, + prompt: `Answer this question with the given context ${query}`, + docs: factDocs, + }); + return llmResponse.text; + } +); +``` + +Now let us configure a custom extractor to use the output of the `factModified` step when evaluating this flow. To configure custom extractors, add a tools config file named `genkit-tools.conf.js` to your project root if you don't have one already. ```posix-terminal -cd $GENKIT_PROJECT_HOME +cd /path/to/your/genkit/app touch genkit-tools.conf.js ``` @@ -184,19 +341,22 @@ In the tools config file, add the following code: module.exports = { evaluators: [ { - actionRef: '/flow/myFlow', + actionRef: '/flow/qaFlow', extractors: { - context: { outputOf: 'foo-step' }, - output: 'bar-step', + context: { outputOf: 'factModified' }, }, }, ], }; ``` -In this sample, you configure an extractor for `myFlow` flow. The config -overrides the extractors for `context` and `output` fields and uses the default -logic for the `input` field. +This config overrides the default extractors of Genkit's tooling, specifically changing what is considered as as `context` when evaluating this flow. + +You can run evaluation again and you will see that context is now populated as the output of the step. + +```posix-terminal +genkit eval:flow qaFlow --input testInputs.json +``` The specification of the evaluation extractors is as follows: @@ -212,59 +372,42 @@ The specification of the evaluation extractors is as follows: inputOf: 'foo-step' }` would extract the input of step `foo-step` for this key. * `(trace) => string;` - For further flexibility, you can provide a - function that accepts a Genkit trace and returns a `string`, and specify + function that accepts a Genkit trace and returns `any` type, and specify the extraction logic inside this function. Refer to `genkit/genkit-tools/common/src/types/trace.ts` for the exact TraceData schema. -**Note:** The extracted data for all these steps will be a JSON string. The -tooling will parse this JSON string at the time of evaluation automatically. If -providing a function extractor, make sure that the output is a valid JSON -string. For example: `"Hello, world!"` is not valid JSON; `"\"Hello, world!\""` -is valid. - -### Running on existing datasets - -To run evaluation over an already extracted dataset: - -```posix-terminal -genkit eval:run customLabel_dataset.json -``` - -To output to a different location, use the `--output` flag. - -```posix-terminal -genkit eval:flow menuSuggestionFlow --input testInputs.json --output customLabel_evalresult.json -``` - -To run on a subset of the configured evaluators, use the `--evaluators` flag and -provide a comma-separated list of evaluators by name: - -```posix-terminal -genkit eval:run customLabel_dataset.json --evaluators=genkit/faithfulness,genkit/answer_relevancy -``` +**Note:** The extracted data for all these extractors will be the type corresponding to the extractor. For example, if you use context: `{ outputOf: 'foo-step' }` and `foo-step` returns an array of objects, the extracted context will also be an array of objects.. ### Synthesizing test data using an LLM -Here's an example flow that uses a PDF file to generate possible questions users -might be asking about it. +Here's an example flow that uses a PDF file to generate potential user questions. ```ts import { genkit, run, z } from "genkit"; import { googleAI, gemini15Flash } from "@genkit-ai/googleai"; -import { chunk } from "llm-chunk"; -import path from 'path'; +import { chunk } from "llm-chunk"; // npm i llm-chunk +import path from "path"; +import { readFile } from "fs/promises"; +import pdf from "pdf-parse"; // npm i pdf-parse const ai = genkit({ plugins: [googleAI()] }); const chunkingConfig = { minLength: 1000, // number of minimum characters into chunk maxLength: 2000, // number of maximum characters into chunk - splitter: 'sentence', // paragraph | sentence + splitter: "sentence", // paragraph | sentence overlap: 100, // number of overlap chracters - delimiters: '', // regex for base split method + delimiters: "", // regex for base split method } as any; +async function extractText(filePath: string) { + const pdfFile = path.resolve(filePath); + const dataBuffer = await readFile(pdfFile); + const data = await pdf(dataBuffer); + return data.text; +} + export const synthesizeQuestions = ai.defineFlow( { name: "synthesizeQuestions", @@ -274,7 +417,6 @@ export const synthesizeQuestions = ai.defineFlow( async (filePath) => { filePath = path.resolve(filePath); // `extractText` loads the PDF and extracts its contents as text. - // See our RAG documentation for more details. const pdfTxt = await run("extract-text", () => extractText(filePath)); const chunks = await run("chunk-it", async () => diff --git a/docs/plugin-authoring-evaluator.md b/docs/plugin-authoring-evaluator.md index 0e7a9b9ce..d9ec72b26 100644 --- a/docs/plugin-authoring-evaluator.md +++ b/docs/plugin-authoring-evaluator.md @@ -25,39 +25,45 @@ For this example, the prompt is going to ask the LLM to judge how delicious the Genkit’s `definePrompt` utility provides an easy way to define prompts with input and output validation. Here’s how you can set up an evaluation prompt with `definePrompt`. ```ts +import { z } from "genkit"; + const DELICIOUSNESS_VALUES = ['yes', 'no', 'maybe'] as const; const DeliciousnessDetectionResponseSchema = z.object({ reason: z.string(), verdict: z.enum(DELICIOUSNESS_VALUES), }); -type DeliciousnessDetectionResponse = z.infer; -const DELICIOUSNESS_PROMPT = ai.definePrompt( - { - name: 'deliciousnessPrompt', - inputSchema: z.object({ - output: z.string(), - }), - outputSchema: DeliciousnessDetectionResponseSchema, - }, - `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict. +function getDeliciousnessPrompt(ai: Genkit) { + return ai.definePrompt({ + name: 'deliciousnessPrompt', + input: { + schema: z.object({ + responseToTest: z.string(), + }), + }, + output: { + schema: DeliciousnessDetectionResponseSchema, + } + }, + `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict. - Examples: - Output: Chicken parm sandwich - Response: { "reason": "A classic and beloved dish.", "verdict": "yes" } + Examples: + Output: Chicken parm sandwich + Response: { "reason": "A classic and beloved dish.", "verdict": "yes" } - Output: Boston Logan Airport tarmac - Response: { "reason": "Not edible.", "verdict": "no" } + Output: Boston Logan Airport tarmac + Response: { "reason": "Not edible.", "verdict": "no" } - Output: A juicy piece of gossip - Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" } + Output: A juicy piece of gossip + Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" } - New Output: - {{output}} - Response: - ` -); + New Output: + {{responseToTest}} + Response: + ` + ); +} ``` #### Define the scoring function @@ -84,17 +90,17 @@ export async function deliciousnessScore< throw new Error('Output is required for Deliciousness detection'); } - //Hydrate the prompt - const finalPrompt = DELICIOUSNESS_PROMPT.renderText({ - output: d.output as string, - }); - - // Call the LLM to generate an evaluation result - const response = await ai.generate({ - model: judgeLlm, - prompt: finalPrompt, - config: judgeConfig, - }); + // Hydrate the prompt and generate an evaluation result + const deliciousnessPrompt = getDeliciousnessPrompt(ai); + const response = await deliciousnessPrompt( + { + responseToTest: d.output as string, + }, + { + model: judgeLlm, + config: judgeConfig, + } + ); // Parse the output const parsedResponse = response.output; @@ -115,7 +121,7 @@ export async function deliciousnessScore< The final step is to write a function that defines the evaluator action itself. ```ts -import { Genkit, ModelReference, z } from 'genkit'; +import { Genkit, z } from 'genkit'; import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator'; /** @@ -125,12 +131,12 @@ export function createDeliciousnessEvaluator< ModelCustomOptions extends z.ZodTypeAny, >( ai: Genkit, - judge: ModelReference, - judgeConfig: z.infer + judge: ModelArgument, + judgeConfig?: z.infer ): EvaluatorAction { return ai.defineEvaluator( { - name: `myAwesomeEval/deliciousness`, + name: `deliciousnessEvaluator`, displayName: 'Deliciousness', definition: 'Determines if output is considered delicous.', isBilled: true, @@ -204,10 +210,10 @@ Just like the LLM-based evaluator, define the scoring function. In this case, th import { BaseEvalDataPoint, Score } from 'genkit/evaluator'; const US_PHONE_REGEX = - /^[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}$/i; + /[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}/i; /** - * Scores whether an individual datapoint matches a US Phone Regex. + * Scores whether a datapoint output contains a US Phone number. */ export async function usPhoneRegexScore( dataPoint: BaseEvalDataPoint @@ -218,115 +224,72 @@ export async function usPhoneRegexScore( } const matches = US_PHONE_REGEX.test(d.output as string); const reasoning = matches - ? `Output matched regex ${regex.source}` - : `Output did not match regex ${regex.source}`; + ? `Output matched US_PHONE_REGEX` + : `Output did not match US_PHONE_REGEX`; return { score: matches, details: { reasoning }, }; } - -/** - * Create an EvalResponse from an individual scored datapoint. - */ -function fillScores(dataPoint: BaseEvalDataPoint, score: Score): EvalResponse { - return { - testCaseId: dataPoint.testCaseId, - evaluation: score, - }; -} ``` #### Define the evaluator action ```ts -import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator'; +import { EvaluatorAction } from "genkit/evaluator"; +import { Genkit } from "genkit"; /** * Configures a regex evaluator to match a US phone number. */ -export function createUSPhoneRegexEvaluator( - metrics: RegexMetric[] -): EvaluatorAction[] { - return metrics.map((metric) => { - const regexMetric = metric as RegexMetric; - return defineEvaluator( - { - name: `myAwesomeEval/${metric.name.toLocaleLowerCase()}`, - displayName: 'Regex Match', - definition: - 'Runs the output against a regex and responds with true if a match is found and false otherwise.', - isBilled: false, - }, - async (datapoint: BaseEvalDataPoint) => { - const score = await usPhoneRegexScore(datapoint); - return fillScores(datapoint, score); - } - ); - }); -} -``` - -## Configuration - -### Plugin Options - -Define the `PluginOptions` that the custom evaluator plugin will use. This object has no strict requirements and is dependent on the types of evaluators that are defined. - -At a minimum it will need to take the definition of which metrics to register. - -```ts -export enum MyAwesomeMetric { - WORD_COUNT = 'WORD_COUNT', - US_PHONE_REGEX_MATCH = 'US_PHONE_REGEX_MATCH', -} - -export interface PluginOptions { - metrics?: Array; +export function createUSPhoneRegexEvaluator(ai: Genkit): EvaluatorAction { + return ai.defineEvaluator( + { + name: `usPhoneRegexEvaluator`, + displayName: "Regex Match for US PHONE NUMBER", + definition: "Uses Regex to check if output matches a US phone number", + isBilled: false, + }, + async (datapoint: BaseEvalDataPoint) => { + const score = await usPhoneRegexScore(datapoint); + return { + testCaseId: datapoint.testCaseId, + evaluation: score, + }; + } + ); } ``` -If this new plugin uses an LLM as a judge and the plugin supports swapping out which LLM to use, define additional parameters in the `PluginOptions` object. - -```ts -export interface PluginOptions { - judge: ModelReference; - judgeConfig?: z.infer; - metrics?: Array; -} -``` +## Putting it together ### Plugin definition -Plugins are registered with the framework via the `genkit.config.ts` file in a project. To be able to configure a new plugin, define a function that defines a `GenkitPlugin` and configures it with the `PluginOptions` defined above. +Plugins are registered with the framework by installing them at the tine of initalizing the Genkit object in your application. To be able to define a new plugin, use the `genkitPlugin` helper method to instantiate all Genkit Actions within the plugin context. In this case we have two evaluators `DELICIOUSNESS` and `US_PHONE_REGEX_MATCH`. This is where those evaluators are registered with the plugin and with Firebase Genkit. ```ts import { GenkitPlugin, genkitPlugin } from 'genkit/plugin'; -export function myAwesomeEval( - options: PluginOptions -): GenkitPlugin { +export function myCustomEvals< + ModelCustomOptions extends z.ZodTypeAny +>(options: { + judge: ModelArgument; + judgeConfig?: ModelCustomOptions; +}): GenkitPlugin { // Define the new plugin - return genkitPlugin( - 'myAwesomeEval', - async (ai: Genkit) => { - const { judge, judgeConfig, metrics } = options; - const evaluators: EvaluatorAction[] = metrics.map((metric) => { - switch (metric) { - case DELICIOUSNESS: - // This evaluator requires an LLM as judge - return createDeliciousnessEvaluator(ai, judge, judgeConfig); - case US_PHONE_REGEX_MATCH: - // This evaluator does not require an LLM - return createUSPhoneRegexEvaluator(); - } - }); - return { evaluators }; - }); + return genkitPlugin("myCustomEvals", async (ai: Genkit) => { + const { judge, judgeConfig } = options; + + // The plugin instatiates our custom evaluators within the context + // of the `ai` object, making them available + // throughout our Genkit application. + createDeliciousnessEvaluator(ai, judge, judgeConfig); + createUSPhoneRegexEvaluator(ai); + }); } -export default myAwesomeEval; +export default myCustomEvals; ``` ### Configure Genkit @@ -336,50 +299,25 @@ Add the newly defined plugin to your Genkit configuration. For evaluation with Gemini, disable safety settings so that the evaluator can accept, detect, and score potentially harmful content. ```ts -import { gemini15Flash } from '@genkit-ai/googleai'; +import { gemini15Pro } from '@genkit-ai/googleai'; const ai = genkit({ plugins: [ + vertexAI(), ... - myAwesomeEval({ - judge: gemini15Flash, - judgeConfig: { - safetySettings: [ - { - category: 'HARM_CATEGORY_HATE_SPEECH', - threshold: 'BLOCK_NONE', - }, - { - category: 'HARM_CATEGORY_DANGEROUS_CONTENT', - threshold: 'BLOCK_NONE', - }, - { - category: 'HARM_CATEGORY_HARASSMENT', - threshold: 'BLOCK_NONE', - }, - { - category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT', - threshold: 'BLOCK_NONE', - }, - ], - }, - metrics: [ - MyAwesomeMetric.DELICIOUSNESS, - MyAwesomeMetric.US_PHONE_REGEX_MATCH - ], + myCustomEvals({ + judge: gemini15Pro, }), ], ... }); ``` -## Testing +## Using your custom evaluators -The same issues that apply to evaluating the quality of the output of a generative AI feature apply to evaluating the judging capacity of an LLM-based evaluator. +Once you instatiate your custom providers within the Genkit context (either through a plugin or directly), they are ready to be used. Let us try out the deliciousness evaluator with a few sample inputs and outputs. -To get a sense of whether the custom evaluator performs at the expected level, create a set of test cases that have a clear right and wrong answer. - -As an example for deliciousness, that might look like a json file `deliciousness_dataset.json`: +Create a json file `deliciousness_dataset.json` with the following content: ```json [ @@ -396,8 +334,6 @@ As an example for deliciousness, that might look like a json file `deliciousness ] ``` -These examples can be human generated or you can ask an LLM to help create a set of test cases that can be curated. There are many available benchmark datasets that can be used as well. - Then use the Genkit CLI to run the evaluator against these test cases. ```posix-terminal @@ -408,3 +344,5 @@ genkit eval:run deliciousness_dataset.json ``` Navigate to `localhost:4000/evaluate` to view your results in the Genkit UI. + +It is important to note that confidence in custom evaluators will increase as you benchmark them with standard datasets or approaches. Iterate on the results of such benchmarks to improve your evaluators' performance till it reaches the desired quality.