[Auto Import] CSV format support #194386

ilyannn · 2024-09-30T08:47:00Z

Release Notes

Automatic Import can now create integrations for logs in the CSV format. Owing to the maturity of log format support, we thus remove the verbiage about requiring the JSON/NDJSON format.

Summary

Added: the CSV feature

The issue is #194342

When the user adds a log sample whose format is recognized as CSV by the LLM, we now parse the samples and insert the csv processor into the generated pipeline.

If the header is present, we use it for the field names and add a drop processor that removes a header from the document stream by comparing the values to the header values.

If the header is missing, we ask the LLM to generate a list of column names, providing some context like package and data stream title.

Should the header or LLM suggestion provide unsuitable for a specific column, we use column1, column2 and so on as a fallback. To avoid duplicate column names, we can add postfixes like _2 as necessary.

If the format appears to be CSV, but the csv processor returns fails, we bubble up an error using the recently introduced ErrorThatHandlesItsOwnResponse class. We also provide the first example of passing the additional attributes of an error (in this case, the original CSV error) back to the client. The error message is composed on the client side.

Removed: supported formats message

The message that asks the user to upload the logs in JSON/NDJSON format is removed in this PR:

Refactoring

The refactoring makes the "→JSON" conversion process more uniform across different chains and centralizes processor definitions in .../server/util/processors.ts.

Log format chain now expects the LLM to follow the SamplesFormat when providing the information rather than an ad-hoc format.

When testing, the fail method is not supported in jest, so it is removed.

Examples

Postgres logs (original)

When using the original Postgres log file from the integrations repository we get the following error:

The reason is that the file is indeed not a correct CSV file, due to line breaks leading to unmatched quotes:

2021-01-04 01:07:04.364 UTC,"postgres","postgres",86,"172.24.0.1:45126",5ff26a0c.56,9,"idle",2021-01-04 01:06:20 UTC,3/8,0,LOG,00000,"statement: SELECT name FROM  (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings   WHERE context != 'internal'   UNION ALL SELECT 'all') ss  WHERE substring(name,1,3)='log'
LIMIT 1000",,,,,,,,,"psql","client backend"

Potsgres logs (fixed)

If we remove the line breaks and perform some additional fixes to the reserved fields, the integration ai_postgres_202410081032-1.0.0.zip is generated successfully, with LLM filling some of the columns names:

Example event

{
  "ai_postgres_202410081032": {
    "logs": {
      "backend_type": "postmaster",
      "column11": "0",
      "column4": "1",
      "column7": "7",
      "error_code": "00000",
      "session_start_time": "2021-01-04 01:06:13 UTC",
      "timestamp": "2021-01-04 01:07:20.001 UTC"
    }
  },
  "ecs": {
    "version": "8.11.0"
  },
  "event": {
    "category": [
      "database",
      "configuration"
    ],
    "created": "2021-01-04T01:07:20.001Z",
    "original": "2021-01-04 01:07:20.001 UTC,,,1,,5ff26a05.1,7,,2021-01-04 01:06:13 UTC,,0,LOG,00000,\"parameter \"\"log_min_duration_statement\"\" changed to \"\"0\"\"\",,,,,,,,,\"\",\"postmaster\"",
    "start": "2021-01-04T01:06:13.000Z",
    "type": [
      "info",
      "change"
    ]
  },
  "log": {
    "level": "LOG"
  },
  "message": "parameter \"log_min_duration_statement\" changed to \"0\"",
  "tags": [
    "preserve_original_event"
  ]
}

Recorded Future

This sample log file has a header. We generate the ai_rf_url_202410082213-1.0.0.zip with the following sample:

PanOS System Log

This sample log file does not have a header. We provide a "Format" string from the docs as part of the datastream title.

We produce the integration ai_panos_202410082244-1.0.0.zip:

Note the usage of separate FUTURE_USE_N columns.

Without the hints, the results (ai_panos_202410082250-1.0.0.zip) are less impressive:

PanOS Tunnel Inspection Log

Log sample has only one line in it
We provide information from the docs
Generated integration: ai_panos_tunnel_202410132313-1.0.0.zip
Generated event (compare to the manually developed integration):

Example event

        {
            "ai_panos_tunnel_202410132313": {
                "log": {
                    "A_Slice_Differentiator": "asd",
                    "A_Slice_Service_Type": "ast",
                    "Action_Flags": "user",
                    "Action_Source": "action",
                    "Application_Category": "cat",
                    "Application_Characteristic": "char",
                    "Application_Container": "con",
                    "Application_Risk": "4",
                    "Application_SaaS": "no",
                    "Application_Sanctioned_State": "no",
                    "Application_Subcategory": "app-sc",
                    "Application_Technology": "tech",
                    "Destination_External_Dynamic_List": "100",
                    "Destination_Location": "dst-loc",
                    "Destination_Zone": "d-zone",
                    "Device_Group_Hierarchy_Level_1": "0",
                    "Device_Group_Hierarchy_Level_2": "0",
                    "Device_Group_Hierarchy_Level_3": "0",
                    "Device_Group_Hierarchy_Level_4": "0",
                    "Dynamic_User_Group": "dug",
                    "FUTURE_USE": "1",
                    "FUTURE_USE_2": "2561",
                    "Flags": "0",
                    "Generated_Time": "2021/11/23 00:44:44",
                    "High_Resolution_Timestamp": "2021-11-23T00:44:44.930-08:00",
                    "Inbound_Interface": "inbound",
                    "Log_Action": "log",
                    "Maximum_Encapsulation": "20",
                    "Monitor_Tag_IMEI": "imei",
                    "PCAP_ID": "pcap",
                    "PDU_Session_ID": "100",
                    "Packets_Sent": "10",
                    "Parent_Session_ID": "1000",
                    "Parent_Start_Time": "1000",
                    "Receive_Time": "2021/11/23 00:44:44",
                    "Remote_User_ID": "100",
                    "Remote_User_IP": "81.2.69.192",
                    "Sequence_Number": "1000",
                    "Serial_Number": "1234567890",
                    "Sessions_Closed": "1000",
                    "Sessions_Created": "1000",
                    "Source_External_Dynamic_List": "100",
                    "Source_Location": "src-loc",
                    "Source_Zone": "s-zone",
                    "Start_Time": "2021-11-23T00:44:44.930-08:00",
                    "Strict_Check": "75",
                    "Subtype": "start",
                    "Tunnel": "1000",
                    "Tunnel_Fragment": "50",
                    "Tunnel_ID_IMSI": "imsi",
                    "Tunnel_Inspection_Rule": "rule1",
                    "Type": "START",
                    "Unknown_Protocol": "200",
                    "Virtual_System": "vsys",
                    "Virtual_System_Name": "vsys-name"
                }
            },
            "destination": {
                "bytes": "10",
                "geo": {
                    "city_name": "Changchun",
                    "continent_name": "Asia",
                    "country_iso_code": "CN",
                    "country_name": "China",
                    "location": {
                        "lat": 43.88,
                        "lon": 125.3228
                    },
                    "region_iso_code": "CN-22",
                    "region_name": "Jilin Sheng"
                },
                "ip": "175.16.199.1",
                "nat": {
                    "ip": "10.0.0.30",
                    "port": "9300"
                },
                "packets": "10",
                "port": "9550",
                "user": {
                    "name": "d-user"
                }
            },
            "ecs": {
                "version": "8.11.0"
            },
            "event": {
                "action": "action",
                "category": [
                    "network",
                    "authentication",
                    "session"
                ],
                "duration": "1234567890",
                "id": "id",
                "original": "1,2021/11/23 00:44:44,1234567890,START,start,2561,2021/11/23 00:44:44,10.0.0.10,175.16.199.1,10.0.0.20,10.0.0.30,rule,,d-user,app,vsys,s-zone,d-zone,inbound,outbound,log,,id,100,9000,9550,9200,9300,0,tcp,action,4,1000,user,src-loc,dst-loc,0,0,0,0,vsys-name,d-name,imsi,imei,1000,1000,1000,10,10,10,10,10,10,20,200,75,50,1000,1000,end,action,2021-11-23T00:44:44.930-08:00,1234567890,rule1,81.2.69.192,100,100,pcap,dug,100,100,2021-11-23T00:44:44.930-08:00,asd,ast,100,app-sc,cat,tech,4,char,con,no,no",
                "reason": "end",
                "sequence": "100",
                "severity": "4",
                "start": "2021-11-23T08:44:44.930Z",
                "type": [
                    "start",
                    "connection",
                    "info",
                    "end"
                ]
            },
            "host": {
                "name": "d-name"
            },
            "network": {
                "application": "app",
                "bytes": "10",
                "name": "outbound",
                "packets": "10",
                "transport": "tcp"
            },
            "related": {
                "hosts": [
                    "d-name"
                ],
                "ip": [
                    "10.0.0.30",
                    "175.16.199.1",
                    "10.0.0.20",
                    "10.0.0.10",
                    "81.2.69.192"
                ],
                "user": [
                    "d-user"
                ]
            },
            "rule": {
                "name": "rule",
                "uuid": "100"
            },
            "source": {
                "bytes": "10",
                "ip": "10.0.0.10",
                "nat": {
                    "ip": "10.0.0.20",
                    "port": "9200"
                },
                "port": "9000"
            },
            "tags": [
                "preserve_original_event"
            ]
        }

Todo

Completed items

Unit or functional tests were updated or added to match the most common scenarios
Strings are internationalized
Upload an example generated for CSV
Upload more examples

Possible follow-up

CSV feature:

Also use package and data stream description to generate the columns; let the user know they can use this option to provide instructions to the LLM. → [Automatic Import] Explicit instructions to generate the CSV columns #196042
If the CSV processor returns an error, provide a richer error description and an option to ignore it and continue the generation. → [Automatic Import] Rich error message for CSV parsing errors #196043

Log formats:

Provide a new message to tell the user about the formats we support on the upload screen as well as in the error that is shown for unsupported formats. → [Automatic Import] Tell users about the log formats we support #196039
If the LLM does not return a valid answer when asked about the log format, we currently default to unsupported. Instead we can ask again. → [Automatic Import] Retry if the LLM does not return a valid log format answer #196038

Bugs:

When mapping to ECS fields, we do not ask LLM to map to @timestamp – a lot of our integrations don't have this field as a result. → [Automatic Import] Ensure the @timestamp field is present #196040
When mapping the fields we don't convert them to the ECS field format, e.g. to long, which makes elastic-package test unhappy. → [Automatic Import] Conversion to target field type #196041

elasticmachine · 2024-09-30T13:45:29Z

Pinging @elastic/security-scalability (Team:Security-Scalability)

ilyannn · 2024-10-01T11:17:37Z

@elasticmachine merge upstream

x-pack/plugins/integration_assistant/server/graphs/log_type_detection/csv.ts

x-pack/plugins/integration_assistant/server/types.ts

ilyannn · 2024-10-04T11:11:15Z

@elasticmachine merge upstream

elasticmachine · 2024-10-04T11:11:19Z

merge conflict between base and head

…ranch 'main' of github.com:elastic/kibana into auto-import/csv-format

ilyannn · 2024-10-11T15:49:09Z

@elasticmachine merge upstream

ilyannn · 2024-10-13T20:09:21Z

@elasticmachine merge upstream

elasticmachine · 2024-10-13T20:28:56Z

💔 Build Failed

Buildkite Build
Commit: b519093

Failed CI Steps

Quick Checks

History

💔 Build #241897 failed 99e4ae3
💛 Build #241702 was flaky 2afe829
💔 Build #241425 failed 25d8009
💔 Build #241343 failed 85e1063
💚 Build #240154 succeeded c045198

cc @ilyannn

bhapas

LGTM

kibanamachine · 2024-10-14T10:25:17Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11325642463

## Release Notes Automatic Import can now create integrations for logs in the CSV format. Owing to the maturity of log format support, we thus remove the verbiage about requiring the JSON/NDJSON format. ## Summary **Added: the CSV feature** The issue is elastic#194342 When the user adds a log sample whose format is recognized as CSV by the LLM, we now parse the samples and insert the [csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html) processor into the generated pipeline. If the header is present, we use it for the field names and add a [drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html) processor that removes a header from the document stream by comparing the values to the header values. If the header is missing, we ask the LLM to generate a list of column names, providing some context like package and data stream title. Should the header or LLM suggestion provide unsuitable for a specific column, we use `column1`, `column2` and so on as a fallback. To avoid duplicate column names, we can add postfixes like `_2` as necessary. If the format appears to be CSV, but the `csv` processor returns fails, we bubble up an error using the recently introduced `ErrorThatHandlesItsOwnResponse` class. We also provide the first example of passing the additional attributes of an error (in this case, the original CSV error) back to the client. The error message is composed on the client side. **Removed: supported formats message** The message that asks the user to upload the logs in `JSON/NDJSON format` is removed in this PR: <img width="741" alt="image" src="https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12"> **Refactoring** The refactoring makes the "→JSON" conversion process more uniform across different chains and centralizes processor definitions in `.../server/util/processors.ts`. Log format chain now expects the LLM to follow the `SamplesFormat` when providing the information rather than an ad-hoc format. When testing, the `fail` method is [not supported in `jest`](https://stackoverflow.com/a/54244479/23968144), so it is removed. See the PR for examples and follow-up. --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit 6a72037)

kibanamachine · 2024-10-14T10:29:53Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

# Backport This will backport the following commits from `main` to `8.x`: - [[Auto Import] CSV format support (#194386)](#194386)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Ilya Nikokoshev <[email protected]>

ilyannn added 2 commits September 29, 2024 22:37

Working version

bd26a3d

Remove header as necessary

6bcb8bf

ilyannn added release_note:feature Makes this part of the condensed release notes Team:Security-Scalability Team label for Security Integrations Scalability Team labels Sep 30, 2024

ilyannn self-assigned this Sep 30, 2024

ilyannn added 3 commits September 30, 2024 11:53

Start column numbering from 1

aa1b32d

Add tests for toSafeColumnName

b3c1819

Raise the errors directly

f0e400b

ilyannn marked this pull request as ready for review September 30, 2024 13:45

ilyannn requested a review from a team as a code owner September 30, 2024 13:45

ilyannn added the backport:skip This commit does not require backporting label Sep 30, 2024

ilyannn added 6 commits September 30, 2024 16:46

Merge branch 'main' into auto-import/csv-format

e3ae722

Fix the client injection in the test

0c99981

Merge branch 'main' into auto-import/csv-format

f3f1f04

Disallow column names starting from numbers

6585afe

Fix a CSV without header issue

a9525db

Better error diagnostics

a2ba690

Merge branch 'main' into auto-import/csv-format

072cea1

bhapas reviewed Oct 1, 2024

View reviewed changes

ilyannn added 5 commits October 4, 2024 15:14

Merge branch 'auto-import/csv-format' of github.com:ilyannn/kibana; b…

cb32e25

…ranch 'main' of github.com:elastic/kibana into auto-import/csv-format

Use a custom error class for CSV parse errors

8f5f9de

Extract GenerationError* interfaces

a302127

Use the handleCSV name

a28777a

Ask LLM to generate CSV columns

2c2f677

ilyannn marked this pull request as draft October 4, 2024 22:38

Fix error not appearing properly

73f4146

ilyannn added 4 commits October 10, 2024 22:07

Change the UNPARSEABLE_CSV_DATA messages

85e1063

Remove unused error

cb9d7ef

Generate the schema

25d8009

Make sure the header field is filled out

026fb89

Merge branch 'main' into auto-import/csv-format

2afe829

ilyannn added backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) and removed backport:prev-major Backport to (8.x, 8.18, 8.17, 8.16) the previous major branch and other branches in development labels Oct 13, 2024

Remove LOGS_SAMPLE_DESCRIPTION_2 message

99e4ae3

elasticmachine and others added 2 commits October 13, 2024 22:09

Merge branch 'main' into auto-import/csv-format

a25d142

Remove translations of the JSON/NDJSON message

b519093

ilyannn enabled auto-merge (squash) October 13, 2024 20:14

ilyannn disabled auto-merge October 13, 2024 20:27

Remove the japanese translation as well

daf2f9e

This was referenced Oct 13, 2024

[Automatic Import] Ensure the @timestamp field is present #196040

Closed

[Automatic Import] Conversion to target field type #196041

Open

bhapas approved these changes Oct 14, 2024

View reviewed changes

ilyannn merged commit 6a72037 into elastic:main Oct 14, 2024
20 checks passed

ilyannn deleted the auto-import/csv-format branch October 14, 2024 10:25

kibanamachine added the v9.0.0 label Oct 14, 2024

kibanamachine mentioned this pull request Oct 14, 2024

[8.x] [Auto Import] CSV format support (#194386) #196090

Merged

kibanamachine added the v8.16.0 label Oct 14, 2024

ilyannn mentioned this pull request Oct 14, 2024

[Automatic Import] Tell users about the log formats we support #196039

Open

ebeahan mentioned this pull request Oct 14, 2024

[Automatic Import] Support CSV formatted logs #194342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto Import] CSV format support #194386

[Auto Import] CSV format support #194386

ilyannn commented Sep 30, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Sep 30, 2024

ilyannn commented Oct 1, 2024

ilyannn commented Oct 4, 2024

elasticmachine commented Oct 4, 2024

ilyannn commented Oct 11, 2024

ilyannn commented Oct 13, 2024

elasticmachine commented Oct 13, 2024 •

edited

Loading

bhapas left a comment

kibanamachine commented Oct 14, 2024

kibanamachine commented Oct 14, 2024

[Auto Import] CSV format support #194386

[Auto Import] CSV format support #194386

Conversation

ilyannn commented Sep 30, 2024 • edited by kibanamachine Loading

Release Notes

Summary

Examples

Postgres logs (original)

Potsgres logs (fixed)

Recorded Future

PanOS System Log

PanOS Tunnel Inspection Log

Todo

Possible follow-up

elasticmachine commented Sep 30, 2024

ilyannn commented Oct 1, 2024

ilyannn commented Oct 4, 2024

elasticmachine commented Oct 4, 2024

ilyannn commented Oct 11, 2024

ilyannn commented Oct 13, 2024

elasticmachine commented Oct 13, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

History

bhapas left a comment

Choose a reason for hiding this comment

kibanamachine commented Oct 14, 2024

kibanamachine commented Oct 14, 2024

💚 All backports created successfully

Questions ?

ilyannn commented Sep 30, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Oct 13, 2024 •

edited

Loading