Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.1] [ML] Text structure finder caps exclude lines pattern at 1000 characters #84239

Conversation

droberts195
Copy link
Contributor

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Backport of #84236

…haracters

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Backport of elastic#84236
@droberts195 droberts195 added backport auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) v8.1.1 labels Feb 22, 2022
@elasticsearchmachine elasticsearchmachine merged commit ae83bd2 into elastic:8.1 Feb 22, 2022
@droberts195 droberts195 deleted the limit_exclude_lines_pattern_length_81 branch February 22, 2022 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport v8.1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants