[Obs AI Assistant] Improve LLM evaluation framework #204574

viduni94 · 2024-12-17T13:20:30Z

Closes #203122

Summary

Problem

The Obs AI Assistant LLM evaluation framework cannot successfully run in the current state in the main branch and has missing scenarios.

Problems identified:

Unable to run the evaluation with a local Elasticsearch instance
Alerts and APM results are skipped entirely when reporting the final result on the terminal (due to consistent failures in the tests)
State contaminations between runs makes the script throw errors when run multiple times.
Authentication issues when calling /internal APIs

Solution

As a part of spacetime, worked on fixing the current issues in the LLM evaluation framework and working on improving and enhancing the framework.

Fixes

Problem	RC (Root Cause)	Fixed?
Running with a local Elasticsearch instance	Service URLs were not picking up the correct auth because of the format specified in `kibana.dev.yml`	✅
Alerts and APM results skipped in final result	Most (if not all) tests are failing in the alerts and APM suites, hence no final results are reported.	✅ (all test scenarios fixed)
State contaminations between runs	Some `after` hooks were not running successfully because of an error in the `callKibana` method	✅
Authentication issues when calling `/internal` APIs	The required headers are not present in the request	✅

Enhancements / Improvements

What was added	How does it enhance the framework
Added new KB retrieval test to the KB scenario	More scenarios covered
Added new scenario for the `retrieve_elastic_doc` function	Cover missing newly added functions
Enhance how scope is used for each scenario and apply correct scope	The scope determines the wording of the system message. Certain scenarios need to be scoped to observability (e.g.: `alerts`) to produce the best result. At present all scenarios use the scope `all` which is not ideal and doesn't align with the actual functionality of the AI Assistant
Avoid throwing unnecessary errors on the console (This was fixed by adding guard rails, e.g.: not creating a dataview if it exists)	Makes it easier to navigate through the results printed on the terminal
Improved readme	Easier to configure and use the framework while identifying all possible options
Improved logging	Easier to navigate through the terminal output

Checklist

The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

elasticmachine · 2024-12-23T13:50:15Z

Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant)

github-actions · 2024-12-23T13:50:35Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

dgieselaar

Looks great @viduni94! just a few nits

.../solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md

.../observability/plugins/observability_ai_assistant_app/scripts/evaluation/get_service_urls.ts

dgieselaar · 2024-12-24T09:42:56Z

...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts

@@ -124,13 +124,13 @@ export class KibanaClient {
    return this.axios<T>({
      method,
      url,
-      data: data || {},
+      ...(method.toLowerCase() !== 'delete' ? { data: data || {} } : {}),


why is this needed?

Without this condition, deleting ruleIds fails here - https://github.com/elastic/kibana/pull/204574/files#diff-23cc9139c91a064a3ca574552ad823023c579cc2c68ff7f277c392102a0d526aL139

Because the DELETE method doesn't allow an undefined or empty body.

Can we change this to checking whether data is empty and if so, not setting the key? IIRC there actually are some routes in Kibana that allow for a request body with the DELETE method (whether that's a good idea or not 😄 )

Sure, will do.

...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts

...gins/observability_ai_assistant_app/scripts/evaluation/scenarios/elasticsearch/index.spec.ts

viduni94 · 2024-12-24T15:50:28Z

Results after the updates:

-------------------------------------------
Model azure-gpt4 scored 112.6 out of 123
-------------------------------------------
-------------------------------------------
Model azure-gpt4 Scores per Category
-------------------------
Category: Alert function - Scored 10 out of 10
-------------------------
Category: APM - Scored 11 out of 17
-------------------------
Category: Retrieve documentation function - Scored 14 out of 14
-------------------------
Category: Elasticsearch functions - Scored 19 out of 19
-------------------------
Category: ES|QL query generation - Scored 43.6 out of 48
-------------------------
Category: Knowledge base - Scored 15 out of 15
-------------------------------------------

…rch credentials are provided

…tests

…ging

…fix'

…-fix'

elasticmachine · 2024-12-30T14:45:15Z

💚 Build Succeeded

Buildkite Build
Commit: dbb34cc
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-204574-dbb34cc0d67b

Metrics [docs]

✅ unchanged

History

💚 Build #263140 succeeded d889b3e
💚 Build #262959 succeeded 40c6445
💚 Build #262943 succeeded 541fa80
💚 Build #262791 succeeded 112383c

cc @viduni94

Closes elastic#203122 ## Summary ### Problem The Obs AI Assistant LLM evaluation framework cannot successfully run in the current state in the `main` branch and has missing scenarios. Problems identified: - Unable to run the evaluation with a local Elasticsearch instance - Alerts and APM results are skipped entirely when reporting the final result on the terminal (due to consistent failures in the tests) - State contaminations between runs makes the script throw errors when run multiple times. - Authentication issues when calling `/internal` APIs ### Solution As a part of spacetime, worked on fixing the current issues in the LLM evaluation framework and working on improving and enhancing the framework. #### Fixes | Problem | RC (Root Cause) | Fixed? | |------------------------|---------------------------------|--------| | Running with a local Elasticsearch instance | Service URLs were not picking up the correct auth because of the format specified in `kibana.dev.yml` | ✅ | | Alerts and APM results skipped in final result | Most (if not all) tests are failing in the alerts and APM suites, hence no final results are reported. | ✅ (all test scenarios fixed) | | State contaminations between runs | Some `after` hooks were not running successfully because of an error in the `callKibana` method | ✅ | | Authentication issues when calling `/internal` APIs | The required headers are not present in the request | ✅ | #### Enhancements / Improvements | What was added | How does it enhance the framework | |------------------------|---------------------------------| | Added new KB retrieval test to the KB scenario | More scenarios covered | | Added new scenario for the `retrieve_elastic_doc` function | Cover missing newly added functions | | Enhance how scope is used for each scenario and apply correct scope | The scope determines the wording of the system message. Certain scenarios need to be scoped to observability (e.g.: `alerts`) to produce the best result. At present all scenarios use the scope `all` which is not ideal and doesn't align with the actual functionality of the AI Assistant | | Avoid throwing unnecessary errors on the console (This was fixed by adding guard rails, e.g.: not creating a dataview if it exists) | Makes it easier to navigate through the results printed on the terminal | | Improved readme | Easier to configure and use the framework while identifying all possible options | | Improved logging | Easier to navigate through the terminal output | ### Checklist - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>

sorenlouv · 2025-01-02T08:43:19Z

...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts

@@ -124,10 +124,10 @@ export class KibanaClient {
    return this.axios<T>({
      method,
      url,
-      data: data || {},
+      ...(method.toLowerCase() === 'delete' && !data ? {} : { data: data || {} }),


What about simply:

...(data ? { data } : {}),

Closes elastic#203122 ## Summary ### Problem The Obs AI Assistant LLM evaluation framework cannot successfully run in the current state in the `main` branch and has missing scenarios. Problems identified: - Unable to run the evaluation with a local Elasticsearch instance - Alerts and APM results are skipped entirely when reporting the final result on the terminal (due to consistent failures in the tests) - State contaminations between runs makes the script throw errors when run multiple times. - Authentication issues when calling `/internal` APIs ### Solution As a part of spacetime, worked on fixing the current issues in the LLM evaluation framework and working on improving and enhancing the framework. #### Fixes | Problem | RC (Root Cause) | Fixed? | |------------------------|---------------------------------|--------| | Running with a local Elasticsearch instance | Service URLs were not picking up the correct auth because of the format specified in `kibana.dev.yml` | ✅ | | Alerts and APM results skipped in final result | Most (if not all) tests are failing in the alerts and APM suites, hence no final results are reported. | ✅ (all test scenarios fixed) | | State contaminations between runs | Some `after` hooks were not running successfully because of an error in the `callKibana` method | ✅ | | Authentication issues when calling `/internal` APIs | The required headers are not present in the request | ✅ | #### Enhancements / Improvements | What was added | How does it enhance the framework | |------------------------|---------------------------------| | Added new KB retrieval test to the KB scenario | More scenarios covered | | Added new scenario for the `retrieve_elastic_doc` function | Cover missing newly added functions | | Enhance how scope is used for each scenario and apply correct scope | The scope determines the wording of the system message. Certain scenarios need to be scoped to observability (e.g.: `alerts`) to produce the best result. At present all scenarios use the scope `all` which is not ideal and doesn't align with the actual functionality of the AI Assistant | | Avoid throwing unnecessary errors on the console (This was fixed by adding guard rails, e.g.: not creating a dataview if it exists) | Makes it easier to navigate through the results printed on the terminal | | Improved readme | Easier to configure and use the framework while identifying all possible options | | Improved logging | Easier to navigate through the terminal output | ### Checklist - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>

viduni94 added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting v9.0.0 Team:Obs AI Assistant Observability AI Assistant labels Dec 17, 2024

viduni94 self-assigned this Dec 17, 2024

viduni94 force-pushed the improve-llm-evaluation-framework branch 6 times, most recently from 57b281e to e82e4e6 Compare December 20, 2024 15:52

viduni94 marked this pull request as ready for review December 23, 2024 13:50

viduni94 requested a review from a team as a code owner December 23, 2024 13:50

viduni94 force-pushed the improve-llm-evaluation-framework branch from b60786e to 5d7fe68 Compare December 23, 2024 13:50

botelastic bot added the ci:project-deploy-observability Create an Observability project label Dec 23, 2024

dgieselaar reviewed Dec 24, 2024

View reviewed changes

viduni94 requested a review from dgieselaar December 24, 2024 15:49

viduni94 force-pushed the improve-llm-evaluation-framework branch from 40c6445 to d889b3e Compare December 27, 2024 13:45

dgieselaar approved these changes Dec 30, 2024

View reviewed changes

viduni94 added 8 commits December 30, 2024 08:03

[Obs AI Assistant] Update evaluation framework readme

0c52af5

[Obs AI Assistant] Fix auth for the kibana url when custom elasticsea…

e056f20

…rch credentials are provided

[Obs AI Assistant] Create dataview if it doesn't exist

5d043ae

[Obs AI Assistant] Logs for service urls

f04074a

[Obs AI Assistant] Temp skip for scenarios except alerts

63e1618

[Obs AI Assistant] Add header to enable accessing internal APIs

f2e8d57

[Obs AI Assistant] Fix apm afterAll hook

405678d

[Obs AI Assistant] Update error handling

5869768

viduni94 and others added 17 commits December 30, 2024 08:03

[Obs AI Assistant] Update calls to internal urls

6196f53

[Obs AI Assistant] Improve data view creation

3ddb0de

[Obs AI Assistant] Change internal origin to Kibana

2e1b6e8

[Obs AI Assistant] Improve scopes handling in the chat client

2a10441

[Obs AI Assistant] Update elasticsearch and es|ql scope before/after …

b837d89

…tests

[Obs AI Assistant] Fix eslint issues

9f8dc78

[Obs AI Assistant] Fix eslint issues

f6a7e21

[Obs AI Assistant] Add new scenario/test for KB retrieval

92c98b4

[Obs AI Assistant] Add new scenario for documentation and improve log…

dfc5026

…ging

[Obs AI Assistant] Improve readme

da19ba3

[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…

7c00158

…fix'

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

2d97d5b

…-fix'

[Obs AI Assistant] Address PR comments

385f020

[Obs AI Assistant] Revert auth change as it's not necessary

620c5e2

[Obs AI Assistant] Make scope a part of the complete function

932ba2c

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

faec300

…-fix'

[Obs AI Assistant] remove comment

63e5be1

viduni94 force-pushed the improve-llm-evaluation-framework branch from d889b3e to 63e5be1 Compare December 30, 2024 13:03

[Obs AI Assistant] Avoid passing data only if data is empty

dbb34cc

viduni94 merged commit 38310a5 into elastic:main Dec 31, 2024
8 checks passed

sorenlouv reviewed Jan 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Obs AI Assistant] Improve LLM evaluation framework #204574

[Obs AI Assistant] Improve LLM evaluation framework #204574

viduni94 commented Dec 17, 2024 •

edited

Loading

elasticmachine commented Dec 23, 2024

github-actions bot commented Dec 23, 2024

dgieselaar left a comment

dgieselaar Dec 24, 2024

viduni94 Dec 24, 2024 •

edited

Loading

dgieselaar Dec 30, 2024

viduni94 Dec 30, 2024

viduni94 commented Dec 24, 2024

elasticmachine commented Dec 30, 2024 •

edited

Loading

sorenlouv Jan 2, 2025 •

edited

Loading

[Obs AI Assistant] Improve LLM evaluation framework #204574

[Obs AI Assistant] Improve LLM evaluation framework #204574

Conversation

viduni94 commented Dec 17, 2024 • edited Loading

Summary

Problem

Solution

Fixes

Enhancements / Improvements

Checklist

elasticmachine commented Dec 23, 2024

github-actions bot commented Dec 23, 2024

🤖 GitHub comments

dgieselaar left a comment

Choose a reason for hiding this comment

dgieselaar Dec 24, 2024

Choose a reason for hiding this comment

viduni94 Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

dgieselaar Dec 30, 2024

Choose a reason for hiding this comment

viduni94 Dec 30, 2024

Choose a reason for hiding this comment

viduni94 commented Dec 24, 2024

elasticmachine commented Dec 30, 2024 • edited Loading

💚 Build Succeeded

Metrics [docs]

History

sorenlouv Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

viduni94 commented Dec 17, 2024 •

edited

Loading

viduni94 Dec 24, 2024 •

edited

Loading

elasticmachine commented Dec 30, 2024 •

edited

Loading

sorenlouv Jan 2, 2025 •

edited

Loading