Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Obs AI Assistant] Improve LLM evaluation framework #204574

Merged
merged 26 commits into from
Dec 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
0c52af5
[Obs AI Assistant] Update evaluation framework readme
viduni94 Dec 13, 2024
e056f20
[Obs AI Assistant] Fix auth for the kibana url when custom elasticsea…
viduni94 Dec 13, 2024
5d043ae
[Obs AI Assistant] Create dataview if it doesn't exist
viduni94 Dec 13, 2024
f04074a
[Obs AI Assistant] Logs for service urls
viduni94 Dec 13, 2024
63e1618
[Obs AI Assistant] Temp skip for scenarios except alerts
viduni94 Dec 13, 2024
f2e8d57
[Obs AI Assistant] Add header to enable accessing internal APIs
viduni94 Dec 13, 2024
405678d
[Obs AI Assistant] Fix apm afterAll hook
viduni94 Dec 13, 2024
5869768
[Obs AI Assistant] Update error handling
viduni94 Dec 17, 2024
6196f53
[Obs AI Assistant] Update calls to internal urls
viduni94 Dec 17, 2024
3ddb0de
[Obs AI Assistant] Improve data view creation
viduni94 Dec 17, 2024
2e1b6e8
[Obs AI Assistant] Change internal origin to Kibana
viduni94 Dec 17, 2024
2a10441
[Obs AI Assistant] Improve scopes handling in the chat client
viduni94 Dec 17, 2024
b837d89
[Obs AI Assistant] Update elasticsearch and es|ql scope before/after …
viduni94 Dec 17, 2024
9f8dc78
[Obs AI Assistant] Fix eslint issues
viduni94 Dec 18, 2024
f6a7e21
[Obs AI Assistant] Fix eslint issues
viduni94 Dec 18, 2024
92c98b4
[Obs AI Assistant] Add new scenario/test for KB retrieval
viduni94 Dec 18, 2024
dfc5026
[Obs AI Assistant] Add new scenario for documentation and improve log…
viduni94 Dec 18, 2024
da19ba3
[Obs AI Assistant] Improve readme
viduni94 Dec 20, 2024
7c00158
[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…
kibanamachine Dec 23, 2024
2d97d5b
[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…
kibanamachine Dec 23, 2024
385f020
[Obs AI Assistant] Address PR comments
viduni94 Dec 24, 2024
620c5e2
[Obs AI Assistant] Revert auth change as it's not necessary
viduni94 Dec 24, 2024
932ba2c
[Obs AI Assistant] Make scope a part of the complete function
viduni94 Dec 24, 2024
faec300
[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…
kibanamachine Dec 24, 2024
63e5be1
[Obs AI Assistant] remove comment
viduni94 Dec 24, 2024
dbb34cc
[Obs AI Assistant] Avoid passing data only if data is empty
viduni94 Dec 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Overview

This tool is developed for our team working on the Elastic Observability platform, specifically focusing on evaluating the Observability AI Assistant. It simplifies scripting and evaluating various scenarios with the Large Language Model (LLM) integration.
This tool is developed for our team working on the Elastic Observability platform, specifically focusing on evaluating the Observability AI Assistant. It simplifies scripting and evaluating various scenarios with Large Language Model (LLM) integrations.

## Setup requirements

Expand All @@ -12,26 +12,40 @@ This tool is developed for our team working on the Elastic Observability platfor

## Running evaluations

Run the tool using:

`$ node x-pack/solutions/observability/plugins/observability_solution/observability_ai_assistant_app/scripts/evaluation/index.js`

This will evaluate all existing scenarios, and write the evaluation results to the terminal.

### Configuration

#### Kibana and Elasticsearch

By default, the tool will look for a Kibana instance running locally (at `http://localhost:5601`, which is the default address for running Kibana in development mode). It will also attempt to read the Kibana config file for the Elasticsearch address & credentials. If you want to override these settings, use `--kibana` and `--es`. Only basic auth is supported, e.g. `--kibana http://username:password@localhost:5601`. If you want to use a specific space, use `--spaceId`
#### To run the evaluation using a local Elasticsearch and Kibana instance:

#### Connector
- Run Elasticsearch locally: `yarn es snapshot --license trial`
sorenlouv marked this conversation as resolved.
Show resolved Hide resolved
- Start Kibana (Default address for Kibana in dev mode: `http://localhost:5601`)
- Run this command to start evaluating:
`$ node x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/index.js`

Use `--connectorId` to specify a `.gen-ai` or `.bedrock` connector to use. If none are given, it will prompt you to select a connector based on the ones that are available. If only a single supported connector is found, it will be used without prompting.

#### Persisting conversations

By default, completed conversations are not persisted. If you do want to persist them, for instance for reviewing purposes, set the `--persist` flag to store them. This will also generate a clickable link in the output of the evaluation that takes you to the conversation.

If you want to clear conversations on startup, use the `--clear` flag. This only works when `--persist` is enabled. If `--spaceId` is set, only conversations for the current space will be cleared.
This will evaluate all existing scenarios, and write the evaluation results to the terminal.

When storing conversations, the name of the scenario is used as a title. Set the `--autoTitle` flag to have the LLM generate a title for you.
#### To run the evaluation using a hosted deployment:
- Add the credentials of Elasticsearch to `kibana.dev.yml` as follows:
```
elasticsearch.hosts: https://<hosted-url>:<port>
elasticsearch.username: <username>
elasticsearch.password: <password>
elasticsearch.ssl.verificationMode: none
elasticsearch.ignoreVersionMismatch: true
```
- Start Kibana
- Run this command to start evaluating: `node x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/index.js --kibana http://<username>:<password>@localhost:5601`

By default the script will use the Elasticsearch credentials specified in `kibana.dev.yml`, if you want to override it use the `--es` flag when running the evaluation script:
E.g.: `node x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/index.js --kibana http://<username>:<password>@localhost:5601 --es https://<username>:<password>@<hosted-url>:<port>`

The `--kibana` and `--es` flags override the default credentials. Only basic auth is supported.

## Other (optional) configuration flags
- `--connectorId` - Specify a generative AI connector to use. If none are given, it will prompt you to select a connector based on the ones that are available. If only a single supported connector is found, it will be used without prompting.
- `--evaluateWith`: The connector ID to evaluate with. Leave empty to use the same connector, use "other" to get a selection menu.
- `--spaceId` - Specify the space ID if you want to use a specific space.
- `--persist` - By default, completed conversations are not persisted. If you want to persist them, for instance for reviewing purposes, include this flag when running the evaluation script. This will also generate a clickable link in the output of the evaluation that takes you to the conversation in Kibana.
- `--clear` - If you want to clear conversations on startup, include this command when running the evaluation script. This only works when `--persist` is enabled. If `--spaceId` is set, only conversations for the current space will be cleared
- `--autoTitle`: When storing conversations, the name of the scenario is used as a title. Set this flag to have the LLM generate a title for you. This only works when `--persist` is enabled.
- `--files`: A file or list of files containing the scenarios to evaluate. Defaults to all.
- `--grep`: A string or regex to filter scenarios by.
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ function runEvaluations() {
kibana: argv.kibana,
});

log.info(`Elasticsearch URL: ${serviceUrls.esUrl}`);

const kibanaClient = new KibanaClient(log, serviceUrls.kibanaUrl, argv.spaceId);
const esClient = new Client({
node: serviceUrls.esUrl,
Expand Down Expand Up @@ -100,7 +102,7 @@ function runEvaluations() {
evaluationConnectorId: evaluationConnector.id!,
persist: argv.persist,
suite: mocha.suite,
scopes: ['all'],
scopes: ['observability'],
});

const header: string[][] = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import { Message, MessageRole } from '@kbn/observability-ai-assistant-plugin/com
import { streamIntoObservable } from '@kbn/observability-ai-assistant-plugin/server';
import { ToolingLog } from '@kbn/tooling-log';
import axios, { AxiosInstance, AxiosResponse, isAxiosError } from 'axios';
import { isArray, omit, pick, remove } from 'lodash';
import { omit, pick, remove } from 'lodash';
import pRetry from 'p-retry';
import {
concatMap,
Expand Down Expand Up @@ -59,13 +59,14 @@ interface Options {
screenContexts?: ObservabilityAIAssistantScreenContext[];
}

type CompleteFunction = (
...args:
| [StringOrMessageList]
| [StringOrMessageList, Options]
| [string | undefined, StringOrMessageList]
| [string | undefined, StringOrMessageList, Options]
) => Promise<{
interface CompleteFunctionParams {
messages: StringOrMessageList;
conversationId?: string;
options?: Options;
scope?: AssistantScope;
}

type CompleteFunction = (params: CompleteFunctionParams) => Promise<{
conversationId?: string;
messages: InnerMessage[];
errors: ChatCompletionErrorEvent[];
Expand All @@ -74,7 +75,6 @@ type CompleteFunction = (
export interface ChatClient {
chat: (message: StringOrMessageList) => Promise<InnerMessage>;
complete: CompleteFunction;

evaluate: (
{}: { conversationId?: string; messages: InnerMessage[]; errors: ChatCompletionErrorEvent[] },
criteria: string[]
Expand Down Expand Up @@ -124,10 +124,10 @@ export class KibanaClient {
return this.axios<T>({
method,
url,
data: data || {},
...(method.toLowerCase() === 'delete' && !data ? {} : { data: data || {} }),
Copy link
Member

@sorenlouv sorenlouv Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply:

...(data ? { data } : {}),

headers: {
'kbn-xsrf': 'true',
'x-elastic-internal-origin': 'foo',
'x-elastic-internal-origin': 'Kibana',
},
}).catch((error) => {
if (isAxiosError(error)) {
Expand All @@ -148,7 +148,7 @@ export class KibanaClient {
}

async installKnowledgeBase() {
this.log.debug('Checking to see whether knowledge base is installed');
this.log.info('Checking whether the knowledge base is installed');

const {
data: { ready },
Expand All @@ -157,7 +157,7 @@ export class KibanaClient {
});

if (ready) {
this.log.info('Knowledge base is installed');
this.log.success('Knowledge base is already installed');
return;
}

Expand All @@ -176,15 +176,15 @@ export class KibanaClient {
{ retries: 10 }
);

this.log.info('Knowledge base installed');
this.log.success('Knowledge base installed');
}

async createSpaceIfNeeded() {
if (!this.spaceId) {
return;
}

this.log.debug(`Checking if space ${this.spaceId} exists`);
this.log.info(`Checking if space ${this.spaceId} exists`);

const spaceExistsResponse = await this.callKibana<{
id?: string;
Expand All @@ -204,7 +204,7 @@ export class KibanaClient {
});

if (spaceExistsResponse.data.id) {
this.log.debug(`Space id ${this.spaceId} found`);
this.log.success(`Space id ${this.spaceId} found`);
return;
}

Expand All @@ -223,14 +223,26 @@ export class KibanaClient {
);

if (spaceCreatedResponse.status === 200) {
this.log.info(`Created space ${this.spaceId}`);
this.log.success(`Created space ${this.spaceId}`);
} else {
throw new Error(
`Error creating space: ${spaceCreatedResponse.status} - ${spaceCreatedResponse.data}`
);
}
}

getMessages(message: string | Array<Message['message']>): Array<Message['message']> {
if (typeof message === 'string') {
return [
{
content: message,
role: MessageRole.User,
},
];
}
return message;
}

createChatClient({
connectorId,
evaluationConnectorId,
Expand All @@ -244,22 +256,11 @@ export class KibanaClient {
suite?: Mocha.Suite;
scopes: AssistantScope[];
}): ChatClient {
function getMessages(message: string | Array<Message['message']>): Array<Message['message']> {
if (typeof message === 'string') {
return [
{
content: message,
role: MessageRole.User,
},
];
}
return message;
}

const that = this;

let currentTitle: string = '';
let firstSuiteName: string = '';
let currentScopes = scopes;

if (suite) {
suite.beforeEach(function () {
Expand Down Expand Up @@ -362,23 +363,27 @@ export class KibanaClient {
that.log.info('Chat', name);

const chat$ = defer(() => {
that.log.debug(`Calling chat API`);
that.log.info('Calling the /chat API');
const params: ObservabilityAIAssistantAPIClientRequestParamsOf<'POST /internal/observability_ai_assistant/chat'>['params']['body'] =
{
name,
messages,
connectorId: connectorIdOverride || connectorId,
functions: functions.map((fn) => pick(fn, 'name', 'description', 'parameters')),
functionCall,
scopes,
scopes: currentScopes,
};

return that.axios.post(
that.getUrl({
pathname: '/internal/observability_ai_assistant/chat',
}),
params,
{ responseType: 'stream', timeout: NaN }
{
responseType: 'stream',
timeout: NaN,
headers: { 'x-elastic-internal-origin': 'Kibana' },
}
);
}).pipe(
switchMap((response) => streamIntoObservable(response.data)),
Expand All @@ -400,54 +405,33 @@ export class KibanaClient {
return {
chat: async (message) => {
const messages = [
...getMessages(message).map((msg) => ({
...this.getMessages(message).map((msg) => ({
message: msg,
'@timestamp': new Date().toISOString(),
})),
];
return chat('chat', { messages, functions: [] });
},
complete: async (...args) => {
that.log.info(`Complete`);
let messagesArg: StringOrMessageList | undefined;
let conversationId: string | undefined;
let options: Options = {};

function isMessageList(arg: any): arg is StringOrMessageList {
return isArray(arg) || typeof arg === 'string';
}
complete: async ({
messages: messagesArg,
conversationId,
options = {},
scope: newScope,
}: CompleteFunctionParams) => {
that.log.info('Calling complete');

// | [StringOrMessageList]
// | [StringOrMessageList, Options]
// | [string, StringOrMessageList]
// | [string, StringOrMessageList, Options]
if (args.length === 1) {
messagesArg = args[0];
} else if (args.length === 2 && !isMessageList(args[1])) {
messagesArg = args[0];
options = args[1];
} else if (
args.length === 2 &&
(typeof args[0] === 'string' || typeof args[0] === 'undefined') &&
isMessageList(args[1])
) {
conversationId = args[0];
messagesArg = args[1];
} else if (args.length === 3) {
conversationId = args[0];
messagesArg = args[1];
options = args[2];
}
// set scope
currentScopes = [newScope || 'observability'];

const messages = [
...getMessages(messagesArg!).map((msg) => ({
...this.getMessages(messagesArg!).map((msg) => ({
message: msg,
'@timestamp': new Date().toISOString(),
})),
];

const stream$ = defer(() => {
that.log.debug(`Calling /chat/complete API`);
that.log.info(`Calling /chat/complete API`);
return from(
that.axios.post(
that.getUrl({
Expand All @@ -460,9 +444,13 @@ export class KibanaClient {
connectorId,
persist,
title: currentTitle,
scopes,
scopes: currentScopes,
},
{ responseType: 'stream', timeout: NaN }
{
responseType: 'stream',
timeout: NaN,
headers: { 'x-elastic-internal-origin': 'Kibana' },
}
)
);
}).pipe(
Expand Down Expand Up @@ -615,7 +603,7 @@ export class KibanaClient {
})
.concat({
score: errors.length === 0 ? 1 : 0,
criterion: 'The conversation encountered errors',
criterion: 'The conversation did not encounter any errors',
reasoning: errors.length
? `The following errors occurred: ${errors.map((error) => error.error.message)}`
: 'No errors occurred',
Expand Down
Loading
Loading