Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Text structure finder caps exclude lines pattern at 1000 characters #84236

Merged

Conversation

droberts195
Copy link
Contributor

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Fixes #83434

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Fixes elastic#83434
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 22, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've created a changelog YAML for you.

@droberts195 droberts195 merged commit 362351f into elastic:master Feb 22, 2022
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:
An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 84236

@droberts195 droberts195 deleted the limit_exclude_lines_pattern_length branch February 22, 2022 18:13
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Feb 22, 2022
…haracters

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Backport of elastic#84236
elasticsearchmachine pushed a commit that referenced this pull request Feb 22, 2022
…haracters (#84239)

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Backport of #84236
probakowski pushed a commit to probakowski/elasticsearch that referenced this pull request Feb 23, 2022
…ers (elastic#84236)

Because of the way Filebeat parses CSV files the text structure finder
needs to generate a regular expression that will ignore the header row
of the CSV file.

It does this by concatenating the column names separated by the delimiter
with optional quoting. However, if there are hundreds of columns this can
lead to a very long regular expression, potentially one that cannot be
evaluated by some programming languages.

This change limits the length of the regular expression to 1000 characters
by only including elements for the first few columns when there are many.
Matching 1000 characters of header should be sufficient to reliably
identify the header row even when it is much longer. It is extremely
unlikely that there would be a data row where the first 1000 characters
exactly matched the header but then subsequent fields diverged.

Fixes elastic#83434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.1.1 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Structure finder should not generate regexes more than 1000 characters long
4 participants