-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip_header => "true" not working #68
Comments
I have a customer seeing this problem in 6.2.4; I have subsequently reproduced it in both 6.2.4 and 6.4.0: Test configuration file:
Test Data - example.csv
|
The skip_headers code is:
The above means that the Please verify that there are no extra spaces or similar in the sample data header line. If there were spaces lurking in the data, then you should add those spaces to the |
This test config is successful
Results
|
I was able to identify a leading space in the header line based on the debug output. Check the leading space in the "message" field below. When I eliminated that trailing space the "skip_header" function did work. { My problem with this is that this is a rather weak definition of a header. It seems to me that the header is by definition the first line of the file. Is an enhancement request reasonable to redefine the meaning of "header" for a CSV file, as it exists in the code? |
@guyboertje when you specify that "the columns array value must be exactly the same as the parsed line as an array", do you mean that the columns names specified in Logstash configuration should be exactly the same of the first row of the file? In other words, it is not possible to change columns names in this particular situation? |
My customer states that he "completely agrees with" my assessment: "This logic is incorrect. All they need to do in this case is to skip the first line of the CSV file. This case is error prone and causing us spend a lot of time fixing something that not suppose to be fixed in the first place. I appreciate it if they can supply a simpler solution." The definition of a header is the first line of the file. If this is problematic then provide a parameter which allows customers to define how many lines their headers are and what - if any offset - the header occupies in the file. But the current code definition of header is quite problematic and does not reflect the reality of headers in a CSV file. This is something that should be changed as soon as reasonably possible. @guyboertje - if you'd prefer me to raise an enhancement request, please let me know in which repo to file it. If you prefer to file said enhancement request yourself, please do so at your first convenience. But the current processing logic is definitely wrong and should be changed. And thanks for moving on this so quickly. |
How does the CSV filter know it is looking at the first line? The CSV filter is not exclusively associated with files.
With all the above considered, the best we could do was to allow the user to specify the line to be skipped.
This is perhaps the biggest oversight in the current implementation. If any enhancement is needed it is to add a new setting |
|
While I agree with that statement, however, without user supplied hints, the plugin really can't know for certain whether the current line it is processing is a header line or not. I like the regex idea. I'm on the fence about whether it should be a file (with periodic reloading) or an array. If a file then users can add to the file while LS is running etc. An alternative is to add a conditional before the CSV filter...
|
I think the best way to handle "noise" in your file is to use the drop filter |
The new cvs codec should help with this, in particular, when paired with the file input, a separate codec instance will be used per-file where different files can have different headers . |
Hi this issue is apparently always showing when using the following Logstash filter configuration:
Even using the
skip_header => "true"
option, Logstash is indexing the very first row of the CSV.Reproduced with Logstash 6.3.0
The text was updated successfully, but these errors were encountered: