Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge feature/explorer-query-assistant to main #1325

Merged
merged 94 commits into from
Jan 12, 2024

Conversation

joshuali925
Copy link
Member

@joshuali925 joshuali925 commented Dec 22, 2023

Description

merges the feature branch feature/explorer-query-assistant to main. most of the commits should be reviewed, the conflicts resolution diff can be viewed here: 36a82fc, and diff after the merge commit can be viewed here: 36a82fc...feature/explorer-query-assistant

This PR adds the Query Assist feature, where users can use ml-commons agents to generate PPL based on their natural language question, and get a summarized response of the query results in Event Analytics.

To use this feature, users should set the following keys in opensearch_dashboards.yml:

observability.query_assist.enabled: true
observability.query_assist.ppl_agent_id: "<id>"
observability.query_assist.response_summary_agent_id: "<id>"
observability.query_assist.error_summary_agent_id: "<id>"
Click for some sample requests to create the agents
endpoint=localhost:9200
response_summary_agent_id=$(curl -s -k "${endpoint}/_plugins/_ml/agents/_register" -XPOST -H 'Content-Type: application/json' --data-binary @- << EOF | jq -r '.agent_id'
{
  "name": "response_summary_agent",
  "type": "flow",
  "description": "Olly summarize success",
  "app_type": "Olly",
  "tools": [
    {
      "type": "MLModelTool",
      "Name": "SummarizeSuccess",
      "description": "Use this tool to summarize a success response",
      "parameters": {
        "model_id": "$model",
        "prompt": "\n\nHuman: You will be given a search response, summarize it as a concise paragraph while considering the following:\nUser's question on index '\${parameters.index}': \${parameters.question}\nPPL (Piped Processing Language) query used: \${parameters.query}\n\nGive some documents to support your point.\nNote that the output could be truncated, summarize what you see. Don't mention about total items returned and don't mention about the fact that output is truncated if you see 'Output is too long, truncated' in the response.\n\nSkip the introduction; go straight into the summarization.\n\nUse the following pieces of context to answer the users question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n\${parameters.response}\n\nAssistant:",
        "response_filter": "$.completion"
      }
    }
  ]
}
EOF
)
echo "response_summary_agent_id: ${response_summary_agent_id}"

error_summary_agent_id=$(curl -s -k "${endpoint}/_plugins/_ml/agents/_register" -XPOST -H 'Content-Type: application/json' --data-binary @- << EOF | jq -r '.agent_id'
{
  "name": "error_summary_agent",
  "type": "flow",
  "description": "Olly summarize error",
  "app_type": "Olly",
  "tools": [
    {
      "type": "MLModelTool",
      "Name": "SummarizeError",
      "description": "Use this tool to summarize a error response",
      "include_output_in_agent_response": true,
      "parameters": {
        "model_id": "$model",
        "prompt": "\n\nHuman: You will be given an API response with errors, summarize it as a concise paragraph. Do not try to answer the user's question.\nIf the error cannot be fixed, eg. no such field or function not supported, then give suggestions to rephrase the question.\nIt is imperative that you must not give suggestions on how to fix the error or alternative PPL query, or answers to the question.\n\nConsider the following:\nUser's question on index '\${parameters.index}': \${parameters.question}\nPPL (Piped Processing Language) query used: \${parameters.query}\n\nSkip the introduction; go straight into the summarization.\n\nUse the following pieces of context to answer the users question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n\${parameters.response}\n\nAssistant:",
        "response_filter": "$.completion"
      }
    },
    {
      "type": "MLModelTool",
      "Name": "Suggestions",
      "description": "Use this tool to generate possible questions for an index",
      "include_output_in_agent_response": true,
      "parameters": {
        "model_id": "$model",
        "prompt": "\n\nHuman: OpenSearch index: \${parameters.index}\n\nRecommend 2 or 3 possible questions on this index given the fields below. Only give the questions, do not give descriptions of questions and do not give PPL queries.\n\nThe format for a field is\n\`\`\`\n- field_name: field_type (sample field value)\n\`\`\`\n\nFields:\n\${parameters.fields}\n\nPut each question in a <question> tag.\n\nAssistant:",
        "response_filter": "$.completion"
      }
    }
  ]
}
EOF
)
echo "error_summary_agent_id: ${error_summary_agent_id}"

ppl_agent_id=$(curl -s -k "${endpoint}/_plugins/_ml/agents/_register" -XPOST -H 'Content-Type: application/json' -d '{
  "name": "ppl_agent",
  "type": "flow",
  "description": "Olly ppl agent",
  "app_type": "Olly",
  "tools": [
    {
      "type": "PPLTool",
      "name": "TransferQuestionToPPLAndExecuteTool",
      "description": "Use this tool to transfer natural language to generate PPL and execute PPL to query inside. Use this tool after you know the index name, otherwise, call IndexRoutingTool first. The input parameters are: {index:IndexName, question:UserQuestion}",
      "parameters": {
        "model_id": "'"$model"'",
        "prompt": "\n\nHuman:You will be given a question about some metrics from a user.\nUse context provided to write a PPL query that can be used to retrieve the information.\n\nHere is a sample PPL query:\nsource=`<index>` | where `<field>` = '"'"'`<value>`'"'"'\n\nHere are some sample questions and the PPL query to retrieve the information. The format for fields is\n```\n- field_name: field_type (sample field value)\n```\n\nFor example, below is a field called `timestamp`, it has a field type of `date`, and a sample value of it could look like `1686000665919`.\n```\n- timestamp: date (1686000665919)\n```\n----------------\n\nThe following text contains fields and questions/answers for the '"'"'accounts'"'"' index\n\nFields:\n- account_number: long (101)\n- address: text ('"'"'880 Holmes Lane'"'"')\n- age: long (32)\n- balance: long (39225)\n- city: text ('"'"'Brogan'"'"')\n- email: text ('"'"'[email protected]'"'"')\n- employer: text ('"'"'Pyrami'"'"')\n- firstname: text ('"'"'Amber'"'"')\n- gender: text ('"'"'M'"'"')\n- lastname: text ('"'"'Duke'"'"')\n- state: text ('"'"'IL'"'"')\n- registered_at: date (1686000665919)\n\nQuestion: Give me some documents in index '"'"'accounts'"'"'\nPPL: source=`accounts` | head\n\nQuestion: Give me 5 oldest people in index '"'"'accounts'"'"'\nPPL: source=`accounts` | sort -age | head 5\n\nQuestion: Give me first names of 5 youngest people in index '"'"'accounts'"'"'\nPPL: source=`accounts` | sort +age | head 5 | fields `firstname`\n\nQuestion: Give me some addresses in index '"'"'accounts'"'"'\nPPL: source=`accounts` | fields `address`\n\nQuestion: Find the documents in index '"'"'accounts'"'"' where firstname is '"'"'Hattie'"'"'\nPPL: source=`accounts` | where `firstname` = '"'"'Hattie'"'"'\n\nQuestion: Find the emails where firstname is '"'"'Hattie'"'"' or lastname is '"'"'Frank'"'"' in index '"'"'accounts'"'"'\nPPL: source=`accounts` | where `firstname` = '"'"'Hattie'"'"' OR `lastname` = '"'"'frank'"'"' | fields `email`\n\nQuestion: Find the documents in index '"'"'accounts'"'"' where firstname is not '"'"'Hattie'"'"' and lastname is not '"'"'Frank'"'"'\nPPL: source=`accounts` | where `firstname` != '"'"'Hattie'"'"' AND `lastname` != '"'"'frank'"'"'\n\nQuestion: Find the emails that contain '"'"'.com'"'"' in index '"'"'accounts'"'"'\nPPL: source=`accounts` | where QUERY_STRING(['"'"'email'"'"'], '"'"'.com'"'"') | fields `email`\n\nQuestion: Find the documents in index '"'"'accounts'"'"' where there is an email\nPPL: source=`accounts` | where ISNOTNULL(`email`)\n\nQuestion: Count the number of documents in index '"'"'accounts'"'"'\nPPL: source=`accounts` | stats COUNT() AS `count`\n\nQuestion: Count the number of people with firstnaQuestion: Count the number of people withe=`accounts` | where `firstname` ='"'"'Amber'"'"' | stats COUNT() AS `count`\n\nQuestion: How many people are older than 33? index is '"'"'accounts'"'"'\nPPL: source=`accounts` | where `age` > 33 | stats COUNT() AS `count`\n\nQuestion: How many distinct ages? index is '"'"'accounts'"'"'\nPPL: source=`accounts` | stats DISTINCT_COUNT(age) AS `distinct_count`\n\nQuestion: How many males and females in index '"'"'accounts'"'"'?\nPPL: source=`accounts` | stats COUNT() AS `count` BY `gender`\n\nQuestion: What is the average, minimum, maximum age in '"'"'accounts'"'"' index?\nPPL: source=`accounts` | stats AVG(`age`) AS `avg_age`, MIN(`age`) AS `min_age`, MAX(`age`) AS `max_age`\n\nQuestion: Show all states sorted by average balance. index is '"'"'accounts'"'"'\nPPL: source=`accounts` | stats AVG(`balance`) AS `avg_balance` BY `state` | sort +avg_balance\n\n----------------\n\nThe following text contains fields and questions/answers for the '"'"'ecommerce'"'"' index\n\nFields:\n- category: text ('"'"'Men'"'"'s Clothing'"'"')\n- currency: keyword ('"'"'EUR'"'"')\n- customer_birth_date: date (null)\n- customer_first_name: text ('"'"'Eddie'"'"')\n- customer_full_name: text ('"'"'Eddie Underwood'"'"')\n- customer_gender: keyword ('"'"'MALE'"'"')\n- customer_id: keyword ('"'"'38'"'"')\n- customer_last_name: text ('"'"'Underwood'"'"')\n- customer_phone: keyword ('"'"''"'"')\n- day_of_week: keyword ('"'"'Monday'"'"')\n- day_of_week_i: integer (0)\n- email: keyword ('"'"'[email protected]'"'"')\n- event.dataset: keyword ('"'"'sample_ecommerce'"'"')\n- geoip.city_name: keyword ('"'"'Cairo'"'"')\n- geoip.continent_name: keyword ('"'"'Africa'"'"')\n- geoip.country_iso_code: keyword ('"'"'EG'"'"')\n- geoip.location: geo_point ([object Object])\n- geoip.region_name: keyword ('"'"'Cairo Governorate'"'"')\n- manufacturer: text ('"'"'Elitelligence,Oceanavigations'"'"')\n- order_date: date (2023-06-05T09:28:48+00:00)\n- order_id: keyword ('"'"'584677'"'"')\n- products._id: text (null)\n- products.base_price: half_float (null)\n- products.base_unit_price: half_float (null)\n- products.category: text (null)\n- products.created_on: date (null)\n- products.discount_amount: half_float (null)\n- products.discount_percentage: half_float (null)\n- products.manufacturer: text (null)\n- products.min_price: half_float (null)\n- products.price: half_float (null)\n- products.product_id: long (null)\n- products.product_name: text (null)\n- products.quantity: integer (null)\n- products.sku: keyword (null)\n- products.tax_amount: half_float (null)\n- products.taxful_price: half_float (null)\n- products.taxless_price: half_float (null)\n- products.unit_discount_amount: half_float (null)\n- sku: keyword ('"'"'ZO0549605496,ZO0299602996'"'"')\n- taxful_total_price: half_float (36.98)\n- taxless_total_price: half_float (36.98)\n- total_quantity: integer (2)\n- total_unique_products: integer (2)\n- type: keyword ('"'"'order'"'"')\n- user: keyword ('"'"'eddie'"'"')\n\nQuestion: What is the average price of products in clothing category ordered in the last 7 days? index is '"'"'ecommerce'"'"'\nPPL: source=`ecommerce` | where QUERY_STRING(['"'"'category'"'"'], '"'"'clothing'"'"') AND `order_date` > DATE_SUB(NOW(), INTERVAL 7 DAY) | stats AVG(`taxful_total_price`) AS `avg_price`\n\nQuestion: What is the average price of products in each city ordered today by every 2 hours? index is '"'"'ecommerce'"'"'\nPPL: source=`ecommerce` | where `order_date` > DATE_SUB(NOW(), INTERVAL 24 HOUR) | stats AVG(`taxful_total_price`) AS `avg_price` by SPAN(`order_date`, 2h) AS `span`, `geoip.city_name`\n\nQuestion: What is the total revenue of shoes each day in this week? index is '"'"'ecommerce'"'"'\nPPL: source=`ecommerce` | where QUERY_STRING(['"'"'category'"'"'], '"'"'shoes'"'"') AND `order_date` > DATE_SUB(NOW(), INTERVAL 1 WEEK) | stats SUM(`taxful_total_price`) AS `revenue` by SPAN(`order_date`, 1d) AS `span`\n\n----------------\n\nThe following text contains fields and questions/answers for the '"'"'events'"'"' index\nFields:\n- timestamp: long (1686000665919)\n- attributes.data_stream.dataset: text ('"'"'nginx.access'"'"')\n- attributes.data_stream.namespace: text ('"'"'production'"'"')\n- attributes.data_stream.type: text ('"'"'logs'"'"')\n- body: text ('"'"'172.24.0.1 - - [02/Jun/2023:23:09:27 +0000] '"'"'GET / HTTP/1.1'"'"' 200 4955 '"'"'-'"'"' '"'"'Mozilla/5.0 zgrab/0.x'"'"''"'"')\n- communication.source.address: text ('"'"'127.0.0.1'"'"')\n- communication.source.ip: text ('"'"'172.24.0.1'"'"')\n- container_id: text (null)\n- container_name: text (null)\n- event.category: text ('"'"'web'"'"')\n- event.domain: text ('"'"'nginx.access'"'"')\n- event.kind: text ('"'"'event'"'"')\n- event.name: text ('"'"'access'"'"')\n- event.result: text ('"'"'success'"'"')\n- event.type: text ('"'"'access'"'"')\n- http.flavor: text ('"'"'1.1'"'"')\n- http.request.method: text ('"'"'GET'"'"')\n- http.response.bytes: long (4955)\n- http.response.status_code: keyword ('"'"'200'"'"')\n- http.url: text ('"'"'/'"'"')\n- log: text (null)\n- observerTime: date (1686000665919)\n- source: text (null)\n- span_id: text ('"'"'abcdef1010'"'"')\n- trace_id: text ('"'"'102981ABCD2901'"'"')\n\nQuestion: What are recent logs with errors and contains word '"'"'test'"'"'? index is '"'"'events'"'"'\nPPL: source=`events` | where QUERY_STRING(['"'"'http.response.status_code'"'"'], '"'"'4* OR 5*'"'"') AND QUERY_STRING(['"'"'body'"'"'], '"'"'test'"'"') AND `observerTime` > DATE_SUB(NOW(), INTERVAL 5 MINUTE)\n\nQuestion: What is the total number of log with a status code other than 200 in 2023 Feburary? index is '"'"'events'"'"'\nPPL: source=`events` | where QUERY_STRING(['"'"'http.response.status_code'"'"'], '"'"'!200'"'"') AND `observerTime` >= '"'"'2023-03-01 00:00:00'"'"' AND `observerTime` < '"'"'2023-04-01 00:00:00'"'"' | stats COUNT() AS `count`\n\nQuestion: Count the number of business days that have web category logs last week? index is '"'"'events'"'"'\nPPL: source=`events` | where `category` = '"'"'web'"'"' AND `observerTime` > DATE_SUB(NOW(), INTERVAL 1 WEEK) AND DAY_OF_WEEK(`observerTime`) >= 2 AND DAY_OF_WEEK(`observerTime`) <= 6 | stats DISTINCT_COUNT(DATE_FORMAT(`observerTime`, '"'"'yyyy-MM-dd'"'"')) AS `distinct_count`\n\nQuestion: What are the top traces with largest bytes? index is '"'"'events'"'"'\nPPL: source=`events` | stats SUM(`http.response.bytes`) AS `sum_bytes` by `trace_id` | sort -sum_bytes | head\n\nQuestion: Give me log patterns? index is '"'"'events'"'"'\nPPL: source=`events` | patterns `body` | stats take(`body`, 1) AS `sample_pattern` by `patterns_field` | fields `sample_pattern`\n\nQuestion: Give me log patterns for logs with errors? index is '"'"'events'"'"'\nPPL: source=`events` | where QUERY_STRING(['"'"'http.response.status_code'"'"'], '"'"'4* OR 5*'"'"') | patterns `body` | stats take(`body`, 1) AS `sample_pattern` by `patterns_field` | fields `sample_pattern`\n\n----------------\n\nUse the following steps to generate the PPL query:\n\nStep 1. Find all field entities in the question.\n\nStep 2. Pick the fields that are relevant to the question from the provided fields list using entities. Rules:\n#01 Consider the field name, the field type, and the sample value when picking relevant fields. For example, if you need to filter flights departed from '"'"'JFK'"'"', look for a `text` or `keyword` field with a field name such as '"'"'departedAirport'"'"', and the sample value should be a 3 letter IATA airport code. Similarly, if you need a date field, look for a relevant field name with type `date` and not `long`.\n#02 You must pick a field with `date` type when filtering on date/time.\n#03 You must pick a field with `date` type when aggregating by time interval.\n#04 You must not use the sample value in PPL query, unless it is relevant to the question.\n#05 You must only pick fields that are relevant, and must pick the whole field name from the fields list.\n#06 You must not use fields that are not in the fields list.\n#07 You must not use the sample values unless relevant to the question.\n#08 You must pick the field that contains a log line when asked about log patterns. Usually it is one of `log`, `body`, `message`.\n\nStep 3. Use the choosen fields to write the PPL query. Rules:\n#01 Always use comparisons to filter date/time, eg. '"'"'where `timestamp` > DATE_SUB(NOW(), INTERVAL 1 DAY)'"'"'; or by absolute time: '"'"'where `timestamp` > '"'"'yyyy-MM-dd HH:mm:ss'"'"''"'"', eg.  '"'"'where `timestamp` < '"'"'2023-01-01 00:00:00'"'"''"'"'. Do not use `DATE_FORMAT()`.\n#02 Only use PPL syntax and keywords appeared in the question or in the examples.\n#03 If user asks for current or recent status, filter the time field for last 5 minutes.\n#04 The field used in '"'"'SPAN(`<field>`, <interval>)'"'"' must have type `date`, not `long`.\n#05 When aggregating by `SPAN` and another field, put `SPAN` after `by` and before the other field, eg. '"'"'stats COUNT() AS `count` by SPAN(`timestamp`, 1d) AS `span`, `category`'"'"'.\n#06 You must put values in quotes when filtering fields with `text` or `keyword` field type.\n#07 To find documents that contain certain phrases in string fields, use `QUERY_STRING` which supports multiple fields and wildcard, eg. '"'"'where QUERY_STRING(['"'"'field1'"'"', '"'"'field2'"'"'], '"'"'prefix*'"'"')'"'"'.\n#08 To find 4xx and 5xx errors using status code, if the status code field type is numberic (eg. `integer`), then use '"'"'where `status_code` >= 400'"'"'; if the field is a string (eg. `text` or `keyword`), then use '"'"'where QUERY_STRING(['"'"'status_code'"'"'], '"'"'4* OR 5*'"'"')'"'"'.\n\n----------------\nPlease only contain PPL inside your response.\n----------------\nQuestion: %s? index is `%s`\nFields:\n%s\n\nAssistant:",
        "response_filter": "$.completion"
      }
    }
  ]
}' | jq -r '.agent_id')
echo "ppl_agent_id: ${ppl_agent_id}"

There will be another PR to change the config keys to from *_id to *_name, and use the ml-commons agent search API to get agent id at run time when ml-commons agent search API is ready.

Screenshots:

When query assist is enabled and configured:

image

When it is not enabled (default option):

image

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

paulstn and others added 30 commits November 5, 2023 17:57
Signed-off-by: Paul Sebastian <[email protected]>
Changes are from git diff 338fba7..7344cfa -- public ':!public/components/llm_chat'
Signed-off-by: Joshua Li <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
Signed-off-by: Paul Sebastian <[email protected]>
@paulstn paulstn added the enhancement New feature or request label Jan 10, 2024
@paulstn paulstn self-requested a review January 10, 2024 23:36
@joshuali925 joshuali925 force-pushed the feature/explorer-query-assistant branch from f138ee9 to 7822356 Compare January 11, 2024 18:16
@paulstn paulstn force-pushed the feature/explorer-query-assistant branch from ff0d2ca to f9f1dfe Compare January 12, 2024 01:18
@paulstn paulstn merged commit 38957cd into main Jan 12, 2024
10 of 31 checks passed
@joshuali925 joshuali925 mentioned this pull request Jan 12, 2024
6 tasks
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/dashboards-observability/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/dashboards-observability/backport-2.x
# Create a new branch
git switch --create backport/backport-1325-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 38957cd501b5bea93982a8f21c1f2e784d9f9b7d
# Push it to GitHub
git push --set-upstream origin backport/backport-1325-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/dashboards-observability/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1325-to-2.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants