-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Import] CSV format support #194386
[Auto Import] CSV format support #194386
Conversation
Pinging @elastic/security-scalability (Team:Security-Scalability) |
@elasticmachine merge upstream |
x-pack/plugins/integration_assistant/server/graphs/log_type_detection/csv.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/integration_assistant/server/graphs/log_type_detection/csv.ts
Outdated
Show resolved
Hide resolved
@elasticmachine merge upstream |
merge conflict between base and head |
…ranch 'main' of github.com:elastic/kibana into auto-import/csv-format
@elasticmachine merge upstream |
@elasticmachine merge upstream |
💔 Build Failed
Failed CI StepsHistory
cc @ilyannn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Starting backport for target branches: 8.x https://github.com/elastic/kibana/actions/runs/11325642463 |
## Release Notes Automatic Import can now create integrations for logs in the CSV format. Owing to the maturity of log format support, we thus remove the verbiage about requiring the JSON/NDJSON format. ## Summary **Added: the CSV feature** The issue is elastic#194342 When the user adds a log sample whose format is recognized as CSV by the LLM, we now parse the samples and insert the [csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html) processor into the generated pipeline. If the header is present, we use it for the field names and add a [drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html) processor that removes a header from the document stream by comparing the values to the header values. If the header is missing, we ask the LLM to generate a list of column names, providing some context like package and data stream title. Should the header or LLM suggestion provide unsuitable for a specific column, we use `column1`, `column2` and so on as a fallback. To avoid duplicate column names, we can add postfixes like `_2` as necessary. If the format appears to be CSV, but the `csv` processor returns fails, we bubble up an error using the recently introduced `ErrorThatHandlesItsOwnResponse` class. We also provide the first example of passing the additional attributes of an error (in this case, the original CSV error) back to the client. The error message is composed on the client side. **Removed: supported formats message** The message that asks the user to upload the logs in `JSON/NDJSON format` is removed in this PR: <img width="741" alt="image" src="https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12"> **Refactoring** The refactoring makes the "→JSON" conversion process more uniform across different chains and centralizes processor definitions in `.../server/util/processors.ts`. Log format chain now expects the LLM to follow the `SamplesFormat` when providing the information rather than an ad-hoc format. When testing, the `fail` method is [not supported in `jest`](https://stackoverflow.com/a/54244479/23968144), so it is removed. See the PR for examples and follow-up. --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit 6a72037)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
# Backport This will backport the following commits from `main` to `8.x`: - [[Auto Import] CSV format support (#194386)](#194386) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ilya Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-14T10:24:58Z","message":"[Auto Import] CSV format support (#194386)\n\n## Release Notes\r\n\r\nAutomatic Import can now create integrations for logs in the CSV format.\r\nOwing to the maturity of log format support, we thus remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n## Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is #194342 \r\n\r\nWhen the user adds a log sample whose format is recognized as CSV by the\r\nLLM, we now parse the samples and insert the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor into the generated pipeline.\r\n\r\nIf the header is present, we use it for the field names and add a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor that removes a header from the document stream by comparing\r\nthe values to the header values.\r\n\r\nIf the header is missing, we ask the LLM to generate a list of column\r\nnames, providing some context like package and data stream title.\r\n\r\nShould the header or LLM suggestion provide unsuitable for a specific\r\ncolumn, we use `column1`, `column2` and so on as a fallback. To avoid\r\nduplicate column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the format appears to be CSV, but the `csv` processor returns fails,\r\nwe bubble up an error using the recently introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide the first\r\nexample of passing the additional attributes of an error (in this case,\r\nthe original CSV error) back to the client. The error message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported formats message**\r\n \r\nThe message that asks the user to upload the logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img width=\"741\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n \r\nThe refactoring makes the \"→JSON\" conversion process more uniform across\r\ndifferent chains and centralizes processor definitions in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now expects the LLM to follow the `SamplesFormat` when\r\nproviding the information rather than an ad-hoc format.\r\n \r\nWhen testing, the `fail` method is [not supported in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it is\r\nremoved.\r\n\r\nSee the PR for examples and follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["v9.0.0","release_note:feature","backport:prev-minor","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto Import] CSV format support","number":194386,"url":"https://github.com/elastic/kibana/pull/194386","mergeCommit":{"message":"[Auto Import] CSV format support (#194386)\n\n## Release Notes\r\n\r\nAutomatic Import can now create integrations for logs in the CSV format.\r\nOwing to the maturity of log format support, we thus remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n## Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is #194342 \r\n\r\nWhen the user adds a log sample whose format is recognized as CSV by the\r\nLLM, we now parse the samples and insert the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor into the generated pipeline.\r\n\r\nIf the header is present, we use it for the field names and add a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor that removes a header from the document stream by comparing\r\nthe values to the header values.\r\n\r\nIf the header is missing, we ask the LLM to generate a list of column\r\nnames, providing some context like package and data stream title.\r\n\r\nShould the header or LLM suggestion provide unsuitable for a specific\r\ncolumn, we use `column1`, `column2` and so on as a fallback. To avoid\r\nduplicate column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the format appears to be CSV, but the `csv` processor returns fails,\r\nwe bubble up an error using the recently introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide the first\r\nexample of passing the additional attributes of an error (in this case,\r\nthe original CSV error) back to the client. The error message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported formats message**\r\n \r\nThe message that asks the user to upload the logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img width=\"741\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n \r\nThe refactoring makes the \"→JSON\" conversion process more uniform across\r\ndifferent chains and centralizes processor definitions in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now expects the LLM to follow the `SamplesFormat` when\r\nproviding the information rather than an ad-hoc format.\r\n \r\nWhen testing, the `fail` method is [not supported in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it is\r\nremoved.\r\n\r\nSee the PR for examples and follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/194386","number":194386,"mergeCommit":{"message":"[Auto Import] CSV format support (#194386)\n\n## Release Notes\r\n\r\nAutomatic Import can now create integrations for logs in the CSV format.\r\nOwing to the maturity of log format support, we thus remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n## Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is #194342 \r\n\r\nWhen the user adds a log sample whose format is recognized as CSV by the\r\nLLM, we now parse the samples and insert the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor into the generated pipeline.\r\n\r\nIf the header is present, we use it for the field names and add a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor that removes a header from the document stream by comparing\r\nthe values to the header values.\r\n\r\nIf the header is missing, we ask the LLM to generate a list of column\r\nnames, providing some context like package and data stream title.\r\n\r\nShould the header or LLM suggestion provide unsuitable for a specific\r\ncolumn, we use `column1`, `column2` and so on as a fallback. To avoid\r\nduplicate column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the format appears to be CSV, but the `csv` processor returns fails,\r\nwe bubble up an error using the recently introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide the first\r\nexample of passing the additional attributes of an error (in this case,\r\nthe original CSV error) back to the client. The error message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported formats message**\r\n \r\nThe message that asks the user to upload the logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img width=\"741\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n \r\nThe refactoring makes the \"→JSON\" conversion process more uniform across\r\ndifferent chains and centralizes processor definitions in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now expects the LLM to follow the `SamplesFormat` when\r\nproviding the information rather than an ad-hoc format.\r\n \r\nWhen testing, the `fail` method is [not supported in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it is\r\nremoved.\r\n\r\nSee the PR for examples and follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}}]}] BACKPORT--> Co-authored-by: Ilya Nikokoshev <[email protected]>
Release Notes
Automatic Import can now create integrations for logs in the CSV format. Owing to the maturity of log format support, we thus remove the verbiage about requiring the JSON/NDJSON format.
Summary
Added: the CSV feature
The issue is #194342
When the user adds a log sample whose format is recognized as CSV by the LLM, we now parse the samples and insert the csv processor into the generated pipeline.
If the header is present, we use it for the field names and add a drop processor that removes a header from the document stream by comparing the values to the header values.
If the header is missing, we ask the LLM to generate a list of column names, providing some context like package and data stream title.
Should the header or LLM suggestion provide unsuitable for a specific column, we use
column1
,column2
and so on as a fallback. To avoid duplicate column names, we can add postfixes like_2
as necessary.If the format appears to be CSV, but the
csv
processor returns fails, we bubble up an error using the recently introducedErrorThatHandlesItsOwnResponse
class. We also provide the first example of passing the additional attributes of an error (in this case, the original CSV error) back to the client. The error message is composed on the client side.Removed: supported formats message
The message that asks the user to upload the logs in
JSON/NDJSON format
is removed in this PR:Refactoring
The refactoring makes the "→JSON" conversion process more uniform across different chains and centralizes processor definitions in
.../server/util/processors.ts
.Log format chain now expects the LLM to follow the
SamplesFormat
when providing the information rather than an ad-hoc format.When testing, the
fail
method is not supported injest
, so it is removed.Examples
Postgres logs (original)
When using the original Postgres log file from the integrations repository we get the following error:
The reason is that the file is indeed not a correct CSV file, due to line breaks leading to unmatched quotes:
Potsgres logs (fixed)
If we remove the line breaks and perform some additional fixes to the reserved fields, the integration ai_postgres_202410081032-1.0.0.zip is generated successfully, with LLM filling some of the columns names:
Example event
Recorded Future
This sample log file has a header. We generate the ai_rf_url_202410082213-1.0.0.zip with the following sample:
PanOS System Log
This sample log file does not have a header. We provide a "Format" string from the docs as part of the datastream title.
We produce the integration ai_panos_202410082244-1.0.0.zip:
Note the usage of separate
FUTURE_USE_N
columns.Without the hints, the results (ai_panos_202410082250-1.0.0.zip) are less impressive:
PanOS Tunnel Inspection Log
Example event
Todo
Completed items
Possible follow-up
CSV feature:
Log formats:
unsupported
. Instead we can ask again. → [Automatic Import] Retry if the LLM does not return a valid log format answer #196038Bugs:
@timestamp
– a lot of our integrations don't have this field as a result. → [Automatic Import] Ensure the @timestamp field is present #196040long
, which makeselastic-package test
unhappy. → [Automatic Import] Conversion to target field type #196041