Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Import] CSV format support #194386

Merged
merged 42 commits into from
Oct 14, 2024
Merged

Conversation

ilyannn
Copy link
Contributor

@ilyannn ilyannn commented Sep 30, 2024

Release Notes

Automatic Import can now create integrations for logs in the CSV format. Owing to the maturity of log format support, we thus remove the verbiage about requiring the JSON/NDJSON format.

Summary

Added: the CSV feature

The issue is #194342

When the user adds a log sample whose format is recognized as CSV by the LLM, we now parse the samples and insert the csv processor into the generated pipeline.

If the header is present, we use it for the field names and add a drop processor that removes a header from the document stream by comparing the values to the header values.

If the header is missing, we ask the LLM to generate a list of column names, providing some context like package and data stream title.

Should the header or LLM suggestion provide unsuitable for a specific column, we use column1, column2 and so on as a fallback. To avoid duplicate column names, we can add postfixes like _2 as necessary.

If the format appears to be CSV, but the csv processor returns fails, we bubble up an error using the recently introduced ErrorThatHandlesItsOwnResponse class. We also provide the first example of passing the additional attributes of an error (in this case, the original CSV error) back to the client. The error message is composed on the client side.

Removed: supported formats message

The message that asks the user to upload the logs in JSON/NDJSON format is removed in this PR:

image

Refactoring

The refactoring makes the "→JSON" conversion process more uniform across different chains and centralizes processor definitions in .../server/util/processors.ts.

Log format chain now expects the LLM to follow the SamplesFormat when providing the information rather than an ad-hoc format.

When testing, the fail method is not supported in jest, so it is removed.

Examples

Postgres logs (original)

When using the original Postgres log file from the integrations repository we get the following error:

image

The reason is that the file is indeed not a correct CSV file, due to line breaks leading to unmatched quotes:

2021-01-04 01:07:04.364 UTC,"postgres","postgres",86,"172.24.0.1:45126",5ff26a0c.56,9,"idle",2021-01-04 01:06:20 UTC,3/8,0,LOG,00000,"statement: SELECT name FROM  (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings   WHERE context != 'internal'   UNION ALL SELECT 'all') ss  WHERE substring(name,1,3)='log'
LIMIT 1000",,,,,,,,,"psql","client backend"

Potsgres logs (fixed)

If we remove the line breaks and perform some additional fixes to the reserved fields, the integration ai_postgres_202410081032-1.0.0.zip is generated successfully, with LLM filling some of the columns names:

image
Example event
{
  "ai_postgres_202410081032": {
    "logs": {
      "backend_type": "postmaster",
      "column11": "0",
      "column4": "1",
      "column7": "7",
      "error_code": "00000",
      "session_start_time": "2021-01-04 01:06:13 UTC",
      "timestamp": "2021-01-04 01:07:20.001 UTC"
    }
  },
  "ecs": {
    "version": "8.11.0"
  },
  "event": {
    "category": [
      "database",
      "configuration"
    ],
    "created": "2021-01-04T01:07:20.001Z",
    "original": "2021-01-04 01:07:20.001 UTC,,,1,,5ff26a05.1,7,,2021-01-04 01:06:13 UTC,,0,LOG,00000,\"parameter \"\"log_min_duration_statement\"\" changed to \"\"0\"\"\",,,,,,,,,\"\",\"postmaster\"",
    "start": "2021-01-04T01:06:13.000Z",
    "type": [
      "info",
      "change"
    ]
  },
  "log": {
    "level": "LOG"
  },
  "message": "parameter \"log_min_duration_statement\" changed to \"0\"",
  "tags": [
    "preserve_original_event"
  ]
}

Recorded Future

This sample log file has a header. We generate the ai_rf_url_202410082213-1.0.0.zip with the following sample:

image

PanOS System Log

This sample log file does not have a header. We provide a "Format" string from the docs as part of the datastream title.

We produce the integration ai_panos_202410082244-1.0.0.zip:

image

Note the usage of separate FUTURE_USE_N columns.

Without the hints, the results (ai_panos_202410082250-1.0.0.zip) are less impressive:

image

PanOS Tunnel Inspection Log

Example event
        {
            "ai_panos_tunnel_202410132313": {
                "log": {
                    "A_Slice_Differentiator": "asd",
                    "A_Slice_Service_Type": "ast",
                    "Action_Flags": "user",
                    "Action_Source": "action",
                    "Application_Category": "cat",
                    "Application_Characteristic": "char",
                    "Application_Container": "con",
                    "Application_Risk": "4",
                    "Application_SaaS": "no",
                    "Application_Sanctioned_State": "no",
                    "Application_Subcategory": "app-sc",
                    "Application_Technology": "tech",
                    "Destination_External_Dynamic_List": "100",
                    "Destination_Location": "dst-loc",
                    "Destination_Zone": "d-zone",
                    "Device_Group_Hierarchy_Level_1": "0",
                    "Device_Group_Hierarchy_Level_2": "0",
                    "Device_Group_Hierarchy_Level_3": "0",
                    "Device_Group_Hierarchy_Level_4": "0",
                    "Dynamic_User_Group": "dug",
                    "FUTURE_USE": "1",
                    "FUTURE_USE_2": "2561",
                    "Flags": "0",
                    "Generated_Time": "2021/11/23 00:44:44",
                    "High_Resolution_Timestamp": "2021-11-23T00:44:44.930-08:00",
                    "Inbound_Interface": "inbound",
                    "Log_Action": "log",
                    "Maximum_Encapsulation": "20",
                    "Monitor_Tag_IMEI": "imei",
                    "PCAP_ID": "pcap",
                    "PDU_Session_ID": "100",
                    "Packets_Sent": "10",
                    "Parent_Session_ID": "1000",
                    "Parent_Start_Time": "1000",
                    "Receive_Time": "2021/11/23 00:44:44",
                    "Remote_User_ID": "100",
                    "Remote_User_IP": "81.2.69.192",
                    "Sequence_Number": "1000",
                    "Serial_Number": "1234567890",
                    "Sessions_Closed": "1000",
                    "Sessions_Created": "1000",
                    "Source_External_Dynamic_List": "100",
                    "Source_Location": "src-loc",
                    "Source_Zone": "s-zone",
                    "Start_Time": "2021-11-23T00:44:44.930-08:00",
                    "Strict_Check": "75",
                    "Subtype": "start",
                    "Tunnel": "1000",
                    "Tunnel_Fragment": "50",
                    "Tunnel_ID_IMSI": "imsi",
                    "Tunnel_Inspection_Rule": "rule1",
                    "Type": "START",
                    "Unknown_Protocol": "200",
                    "Virtual_System": "vsys",
                    "Virtual_System_Name": "vsys-name"
                }
            },
            "destination": {
                "bytes": "10",
                "geo": {
                    "city_name": "Changchun",
                    "continent_name": "Asia",
                    "country_iso_code": "CN",
                    "country_name": "China",
                    "location": {
                        "lat": 43.88,
                        "lon": 125.3228
                    },
                    "region_iso_code": "CN-22",
                    "region_name": "Jilin Sheng"
                },
                "ip": "175.16.199.1",
                "nat": {
                    "ip": "10.0.0.30",
                    "port": "9300"
                },
                "packets": "10",
                "port": "9550",
                "user": {
                    "name": "d-user"
                }
            },
            "ecs": {
                "version": "8.11.0"
            },
            "event": {
                "action": "action",
                "category": [
                    "network",
                    "authentication",
                    "session"
                ],
                "duration": "1234567890",
                "id": "id",
                "original": "1,2021/11/23 00:44:44,1234567890,START,start,2561,2021/11/23 00:44:44,10.0.0.10,175.16.199.1,10.0.0.20,10.0.0.30,rule,,d-user,app,vsys,s-zone,d-zone,inbound,outbound,log,,id,100,9000,9550,9200,9300,0,tcp,action,4,1000,user,src-loc,dst-loc,0,0,0,0,vsys-name,d-name,imsi,imei,1000,1000,1000,10,10,10,10,10,10,20,200,75,50,1000,1000,end,action,2021-11-23T00:44:44.930-08:00,1234567890,rule1,81.2.69.192,100,100,pcap,dug,100,100,2021-11-23T00:44:44.930-08:00,asd,ast,100,app-sc,cat,tech,4,char,con,no,no",
                "reason": "end",
                "sequence": "100",
                "severity": "4",
                "start": "2021-11-23T08:44:44.930Z",
                "type": [
                    "start",
                    "connection",
                    "info",
                    "end"
                ]
            },
            "host": {
                "name": "d-name"
            },
            "network": {
                "application": "app",
                "bytes": "10",
                "name": "outbound",
                "packets": "10",
                "transport": "tcp"
            },
            "related": {
                "hosts": [
                    "d-name"
                ],
                "ip": [
                    "10.0.0.30",
                    "175.16.199.1",
                    "10.0.0.20",
                    "10.0.0.10",
                    "81.2.69.192"
                ],
                "user": [
                    "d-user"
                ]
            },
            "rule": {
                "name": "rule",
                "uuid": "100"
            },
            "source": {
                "bytes": "10",
                "ip": "10.0.0.10",
                "nat": {
                    "ip": "10.0.0.20",
                    "port": "9200"
                },
                "port": "9000"
            },
            "tags": [
                "preserve_original_event"
            ]
        }

Todo

Completed items
  • Unit or functional tests were updated or added to match the most common scenarios
  • Strings are internationalized
  • Upload an example generated for CSV
  • Upload more examples

Possible follow-up

CSV feature:

Log formats:

Bugs:

@ilyannn ilyannn added release_note:feature Makes this part of the condensed release notes Team:Security-Scalability Team label for Security Integrations Scalability Team labels Sep 30, 2024
@ilyannn ilyannn self-assigned this Sep 30, 2024
@ilyannn ilyannn marked this pull request as ready for review September 30, 2024 13:45
@ilyannn ilyannn requested a review from a team as a code owner September 30, 2024 13:45
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-scalability (Team:Security-Scalability)

@ilyannn ilyannn added the backport:skip This commit does not require backporting label Sep 30, 2024
@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 1, 2024

@elasticmachine merge upstream

@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 4, 2024

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

merge conflict between base and head

@ilyannn ilyannn marked this pull request as draft October 4, 2024 22:38
@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 11, 2024

@elasticmachine merge upstream

@ilyannn ilyannn added backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) and removed backport:prev-major Backport to (8.x, 8.18, 8.17, 8.16) the previous major branch and other branches in development labels Oct 13, 2024
@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 13, 2024

@elasticmachine merge upstream

@ilyannn ilyannn enabled auto-merge (squash) October 13, 2024 20:14
@ilyannn ilyannn disabled auto-merge October 13, 2024 20:27
@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 13, 2024

💔 Build Failed

Failed CI Steps

History

cc @ilyannn

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ilyannn ilyannn merged commit 6a72037 into elastic:main Oct 14, 2024
20 checks passed
@ilyannn ilyannn deleted the auto-import/csv-format branch October 14, 2024 10:25
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11325642463

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 14, 2024
## Release Notes

Automatic Import can now create integrations for logs in the CSV format.
Owing to the maturity of log format support, we thus remove the verbiage
about requiring the JSON/NDJSON format.

## Summary

**Added: the CSV feature**

The issue is elastic#194342

When the user adds a log sample whose format is recognized as CSV by the
LLM, we now parse the samples and insert the
[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)
processor into the generated pipeline.

If the header is present, we use it for the field names and add a
[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)
processor that removes a header from the document stream by comparing
the values to the header values.

If the header is missing, we ask the LLM to generate a list of column
names, providing some context like package and data stream title.

Should the header or LLM suggestion provide unsuitable for a specific
column, we use `column1`, `column2` and so on as a fallback. To avoid
duplicate column names, we can add postfixes like `_2` as necessary.

If the format appears to be CSV, but the `csv` processor returns fails,
we bubble up an error using the recently introduced
`ErrorThatHandlesItsOwnResponse` class. We also provide the first
example of passing the additional attributes of an error (in this case,
the original CSV error) back to the client. The error message is
composed on the client side.

**Removed: supported formats message**

The message that asks the user to upload the logs in `JSON/NDJSON
format` is removed in this PR:

<img width="741" alt="image"
src="https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12">

**Refactoring**

The refactoring makes the "→JSON" conversion process more uniform across
different chains and centralizes processor definitions in
`.../server/util/processors.ts`.

Log format chain now expects the LLM to follow the `SamplesFormat` when
providing the information rather than an ad-hoc format.

When testing, the `fail` method is [not supported in
`jest`](https://stackoverflow.com/a/54244479/23968144), so it is
removed.

See the PR for examples and follow-up.

---------

Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit 6a72037)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 14, 2024
# Backport

This will backport the following commits from `main` to `8.x`:
- [[Auto Import] CSV format support
(#194386)](#194386)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ilya
Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-14T10:24:58Z","message":"[Auto
Import] CSV format support (#194386)\n\n## Release
Notes\r\n\r\nAutomatic Import can now create integrations for logs in
the CSV format.\r\nOwing to the maturity of log format support, we thus
remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n##
Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is
#194342 \r\n\r\nWhen the user
adds a log sample whose format is recognized as CSV by the\r\nLLM, we
now parse the samples and insert
the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor
into the generated pipeline.\r\n\r\nIf the header is present, we use it
for the field names and add
a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor
that removes a header from the document stream by comparing\r\nthe
values to the header values.\r\n\r\nIf the header is missing, we ask the
LLM to generate a list of column\r\nnames, providing some context like
package and data stream title.\r\n\r\nShould the header or LLM
suggestion provide unsuitable for a specific\r\ncolumn, we use
`column1`, `column2` and so on as a fallback. To avoid\r\nduplicate
column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the
format appears to be CSV, but the `csv` processor returns fails,\r\nwe
bubble up an error using the recently
introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide
the first\r\nexample of passing the additional attributes of an error
(in this case,\r\nthe original CSV error) back to the client. The error
message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported
formats message**\r\n \r\nThe message that asks the user to upload the
logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img
width=\"741\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n
\r\nThe refactoring makes the \"→JSON\" conversion process more uniform
across\r\ndifferent chains and centralizes processor definitions
in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now
expects the LLM to follow the `SamplesFormat` when\r\nproviding the
information rather than an ad-hoc format.\r\n \r\nWhen testing, the
`fail` method is [not supported
in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it
is\r\nremoved.\r\n\r\nSee the PR for examples and
follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["v9.0.0","release_note:feature","backport:prev-minor","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto
Import] CSV format
support","number":194386,"url":"https://github.com/elastic/kibana/pull/194386","mergeCommit":{"message":"[Auto
Import] CSV format support (#194386)\n\n## Release
Notes\r\n\r\nAutomatic Import can now create integrations for logs in
the CSV format.\r\nOwing to the maturity of log format support, we thus
remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n##
Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is
#194342 \r\n\r\nWhen the user
adds a log sample whose format is recognized as CSV by the\r\nLLM, we
now parse the samples and insert
the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor
into the generated pipeline.\r\n\r\nIf the header is present, we use it
for the field names and add
a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor
that removes a header from the document stream by comparing\r\nthe
values to the header values.\r\n\r\nIf the header is missing, we ask the
LLM to generate a list of column\r\nnames, providing some context like
package and data stream title.\r\n\r\nShould the header or LLM
suggestion provide unsuitable for a specific\r\ncolumn, we use
`column1`, `column2` and so on as a fallback. To avoid\r\nduplicate
column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the
format appears to be CSV, but the `csv` processor returns fails,\r\nwe
bubble up an error using the recently
introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide
the first\r\nexample of passing the additional attributes of an error
(in this case,\r\nthe original CSV error) back to the client. The error
message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported
formats message**\r\n \r\nThe message that asks the user to upload the
logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img
width=\"741\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n
\r\nThe refactoring makes the \"→JSON\" conversion process more uniform
across\r\ndifferent chains and centralizes processor definitions
in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now
expects the LLM to follow the `SamplesFormat` when\r\nproviding the
information rather than an ad-hoc format.\r\n \r\nWhen testing, the
`fail` method is [not supported
in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it
is\r\nremoved.\r\n\r\nSee the PR for examples and
follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/194386","number":194386,"mergeCommit":{"message":"[Auto
Import] CSV format support (#194386)\n\n## Release
Notes\r\n\r\nAutomatic Import can now create integrations for logs in
the CSV format.\r\nOwing to the maturity of log format support, we thus
remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n##
Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is
#194342 \r\n\r\nWhen the user
adds a log sample whose format is recognized as CSV by the\r\nLLM, we
now parse the samples and insert
the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor
into the generated pipeline.\r\n\r\nIf the header is present, we use it
for the field names and add
a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor
that removes a header from the document stream by comparing\r\nthe
values to the header values.\r\n\r\nIf the header is missing, we ask the
LLM to generate a list of column\r\nnames, providing some context like
package and data stream title.\r\n\r\nShould the header or LLM
suggestion provide unsuitable for a specific\r\ncolumn, we use
`column1`, `column2` and so on as a fallback. To avoid\r\nduplicate
column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the
format appears to be CSV, but the `csv` processor returns fails,\r\nwe
bubble up an error using the recently
introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide
the first\r\nexample of passing the additional attributes of an error
(in this case,\r\nthe original CSV error) back to the client. The error
message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported
formats message**\r\n \r\nThe message that asks the user to upload the
logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img
width=\"741\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n
\r\nThe refactoring makes the \"→JSON\" conversion process more uniform
across\r\ndifferent chains and centralizes processor definitions
in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now
expects the LLM to follow the `SamplesFormat` when\r\nproviding the
information rather than an ad-hoc format.\r\n \r\nWhen testing, the
`fail` method is [not supported
in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it
is\r\nremoved.\r\n\r\nSee the PR for examples and
follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}}]}]
BACKPORT-->

Co-authored-by: Ilya Nikokoshev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) Feature:AutomaticImport release_note:feature Makes this part of the condensed release notes Team:Security-Scalability Team label for Security Integrations Scalability Team v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants