Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip_header => "true" not working #68

Open
espogian opened this issue Aug 7, 2018 · 11 comments
Open

skip_header => "true" not working #68

espogian opened this issue Aug 7, 2018 · 11 comments
Assignees

Comments

@espogian
Copy link

espogian commented Aug 7, 2018

Hi this issue is apparently always showing when using the following Logstash filter configuration:

filter {
	csv {
		skip_header => "true"
		columns => [
			...

Even using the skip_header => "true" option, Logstash is indexing the very first row of the CSV.
Reproduced with Logstash 6.3.0

@MorrieAtElastic
Copy link

I have a customer seeing this problem in 6.2.4; I have subsequently reproduced it in both 6.2.4 and 6.4.0:

Test configuration file:

input { 
file { 
path => [/home/TestData/example.csv"] 
start_position => "beginning" 
sincedb_path => "/dev/null" 
} 
} 

filter { 
csv { 
columns => ["FLOW_FILE_ID", "ID", "Name", "Year"] 
separator => "," 
add_field => { "SESSION_ID" => "%{FLOW_FILE_ID}%{ID}" } 
skip_header => "true" 
} 
mutate {convert => ["Year", "integer"]} 

} 

output { 
elasticsearch { 
action => "index" 
hosts => "localhost:9200" 
index => "example" 
user => "elastic" 
password => "password" 
workers => 1 
} 
stdout {} 
} 

Test Data - example.csv

FLOW_FILE_ID,ID,Name,Year 
123,1,Kiano Reevs,1976 
456,2,Lee Childs ,1978

@guyboertje
Copy link

@espogian @MorrieAtElastic

The skip_headers code is:

        if (@skip_header && ([email protected]?) && (@columns == values))
          event.cancel
          return
        end

The above means that the columns setting must be specified and the columns array value must be exactly the same as the parsed line as an array.

Please verify that there are no extra spaces or similar in the sample data header line.

If there were spaces lurking in the data, then you should add those spaces to the columns setting too.

@guyboertje
Copy link

This test config is successful

input {
  generator {
    lines => ["red,amber,green", "stop,wait for it,go go go - man"]
    count => 1
  }
}

filter {
  csv {
    separator => ","
    columns => ["red","amber","green"]
    skip_header => "true"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

Results

bin/logstash -f /elastic/tmp/testing/confs/test-csv-skip_headers.conf
Sending Logstash's logs to /elastic/tmp/logstash-6.3.2/logs which is now configured via log4j2.properties
[2018-08-28T10:21:08,573][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-08-28T10:21:09,194][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.3.2"}
[2018-08-28T10:21:12,276][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-08-28T10:21:12,367][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x64dac7c2 sleep>"}
[2018-08-28T10:21:12,438][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2018-08-28T10:21:12,968][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
{
           "red" => "stop",
      "@version" => "1",
         "amber" => "wait for it",
         "green" => "go go go - man",
       "message" => "stop,wait for it,go go go - man",
    "@timestamp" => 2018-08-28T09:21:12.411Z,
          "host" => "Elastics-MacBook-Pro.local",
      "sequence" => 0
}
[2018-08-28T10:21:13,079][INFO ][logstash.pipeline        ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x64dac7c2 run>"}

@MorrieAtElastic
Copy link

I was able to identify a leading space in the header line based on the debug output. Check the leading space in the "message" field below. When I eliminated that trailing space the "skip_header" function did work.

{
"@Version" => "1",
"message" => "FLOW_FILE_ID,ID,Name,Year ",
"FLOW_FILE_ID" => "FLOW_FILE_ID",
"Year" => 0,
"Name" => "Name",
"path" => "/home/morrie/TestData/example4.csv",
"@timestamp" => 2018-08-28T10:18:44.401Z,
"ID" => "ID",
"host" => "morrie-Elasti"
}

My problem with this is that this is a rather weak definition of a header. It seems to me that the header is by definition the first line of the file. Is an enhancement request reasonable to redefine the meaning of "header" for a CSV file, as it exists in the code?

@espogian
Copy link
Author

@guyboertje when you specify that "the columns array value must be exactly the same as the parsed line as an array", do you mean that the columns names specified in Logstash configuration should be exactly the same of the first row of the file? In other words, it is not possible to change columns names in this particular situation?

@MorrieAtElastic
Copy link

My customer states that he "completely agrees with" my assessment:

"This logic is incorrect. All they need to do in this case is to skip the first line of the CSV file. This case is error prone and causing us spend a lot of time fixing something that not suppose to be fixed in the first place. I appreciate it if they can supply a simpler solution."

The definition of a header is the first line of the file. If this is problematic then provide a parameter which allows customers to define how many lines their headers are and what - if any offset - the header occupies in the file. But the current code definition of header is quite problematic and does not reflect the reality of headers in a CSV file. This is something that should be changed as soon as reasonably possible.

@guyboertje - if you'd prefer me to raise an enhancement request, please let me know in which repo to file it. If you prefer to file said enhancement request yourself, please do so at your first convenience. But the current processing logic is definitely wrong and should be changed.

And thanks for moving on this so quickly.

@guyboertje
Copy link

@espogian @MorrieAtElastic

How does the CSV filter know it is looking at the first line? The CSV filter is not exclusively associated with files.

  • We have multiple workers, there is no guarantee that, if the CSV file was stateful, the first line seen would be the first line read from a file.
  • Filebeat does include the offset, so in theory we could assume that a line from position 0 is the first line.
  • The File input does not currently include the offset.
  • Some CSV files repeat the header.
  • Some sources of CSV lines are not files, e.g. Redis, S3, JMS, Kafka, RabbitMq etc.
  • If the CSV filter retains state, how should it respond when LS is restarted? It would need another setting to control "clean start" to resonate with inputs have a similar setting.

With all the above considered, the best we could do was to allow the user to specify the line to be skipped.

In other words, it is not possible to change columns names in this particular situation?

This is perhaps the biggest oversight in the current implementation. If any enhancement is needed it is to add a new setting skip_lines_file_path where users can define a file of strings that are (full) lines to skip or (TBD) a regex like string that if matched to a line will skip it.

@MorrieAtElastic
Copy link

  • I think the regex idea has merit and I have asked my customer to review it.

  • I think clearly we need an improvement over the current system: every CSV file can be expected have a header and we need to be able to handle the header.

@guyboertje
Copy link

@MorrieAtElastic

every CSV file can be expected have a header

While I agree with that statement, however, without user supplied hints, the plugin really can't know for certain whether the current line it is processing is a header line or not.

I like the regex idea. I'm on the fence about whether it should be a file (with periodic reloading) or an array. If a file then users can add to the file while LS is running etc.

An alternative is to add a conditional before the CSV filter...

  if [message] =~ "^(start-of-header1|start-of-header2)" {
    drop {}
  }

@marksparrish
Copy link

I think the best way to handle "noise" in your file is to use the drop filter
if [FIRST_COLUMN] in ["FIRST_COLUMN"] { drop { } }
This way you can add more "noise" as needed.
You can also do this in the output by writing the "noise" to a file for later inspection to see if it needs to be handled.
if [FIRST_COLUMN] in ["FIRST_COLUMN"] { #write to noise file } else { #write to es }
Maybe the csv filter could provide a "noise" setting to automatically do this, but then you would lose the ability branch the output.

@colinsurprenant
Copy link
Contributor

The new cvs codec should help with this, in particular, when paired with the file input, a separate codec instance will be used per-file where different files can have different headers .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants