-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/2108 - csv parser #4439
Conversation
plugins/parsers/csv/parser.go
Outdated
if p.Delimiter != "" { | ||
runeStr := []rune(p.Delimiter) | ||
if len(runeStr) > 1 { | ||
return nil, fmt.Errorf("delimiter must be a single character, got: %v", p.Delimiter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic, but can you use the non-default verb to print (%s
in this case)
plugins/parsers/csv/parser.go
Outdated
} | ||
|
||
for _, fieldName := range p.FieldColumns { | ||
if recordFields[fieldName] == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using value, ok := recordFields[fieldName]
then check if !ok
and return, line 115 becomes unnecessary.
plugins/parsers/csv/parser.go
Outdated
return metrics, nil | ||
} | ||
|
||
//does not use any information in header and assumes DataColumns is set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a comment is directly above an exported function, start it with // FunctionName ...
plugins/parsers/registry.go
Outdated
nameColumn string, | ||
timestampColumn string, | ||
timestampFormat string, | ||
defaultTags map[string]string) (Parser, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error isn't useful here. Where it's an unexported function not matching any interface, I'd remove it, or remove the function altogether and just instantiate a CSVParser
on line 154.
plugins/parsers/csv/parser.go
Outdated
"github.com/influxdata/telegraf/metric" | ||
) | ||
|
||
type CSVParser struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/#parsing:
I think you need to allow comments, quote characters, column skipping, row skipping, header row count, and trimming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like the quote character can be customized when using the Go csv parser, and we would want to stick to this implementation.
The other options sound great, but I don't think they are must have and we could add them later depending on available time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely add features to allow comments and trimming. I know quote characters aren't supported by the go csv parser. Regarding column skipping, there is already functionality for that by simply not adding the column name to either csv_tag_columns
or csv_field_columns.
The header row count raises a few issues about how a header with more than one line would be interpreted, unless we decide to skip it entirely. It would most likely not mesh well with the function to extract column names from the header. We would have to decide how we want to configure that; I think we could probably pair it with the row skipping configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some clarification on how these options should work:
The csv_skip_rows
option is an integer that controls the number of lines at the beginning of the file that should be skipped over. I would think you would want a bufio.Reader so you can call ReadLine() this many times before passing this reader into csv.Reader.
The csv_skip_columns
option is an integer that controls the number of columns, from the left, to skip over.
Finally csv_header_row_count
would replace csv_header
, it would be an integer that is the number of rows to treat as the column names, the values would be concatenated for each column. This is applied after you csv_skip_rows
, here is an example:
foo,bar
1,2,3
This would produce the column names: ["foo1", "bar1", "3"]. Make sure to allow for lines of differing length.
plugins/parsers/csv/parser.go
Outdated
|
||
//does not use any information in header and assumes DataColumns is set | ||
func (p *CSVParser) ParseLine(line string) (telegraf.Metric, error) { | ||
r := bytes.NewReader([]byte(line)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParseLine does not do the same validation as Parse; I don't see delimiter set, for example.
Perhaps extract building the csv reading into a new function that both parse and parseline use.
@maxunt Don't forget to rebase/merge so that the unrelated grok documentation is no longer present. |
docs/DATA_FORMATS_INPUT.md
Outdated
## By default, this is the name of the plugin | ||
## the `name_override` config overrides this | ||
# csv_name_column = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this csv_measurement_column
docs/DATA_FORMATS_INPUT.md
Outdated
## Columns listed here will be added as fields | ||
## the field type is infered from the value of the field | ||
csv_field_columns = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add all non-tag columns as fields. If someone wants to skip a field they can use fieldpass/fielddrop
docs/DATA_FORMATS_INPUT.md
Outdated
## as there are columns of data | ||
## If `csv_header` is set to false, this config must be used | ||
csv_data_columns = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this csv_column_names
internal/config/config.go
Outdated
val, _ := strconv.ParseBool(str.Value) | ||
c.CSVTrimSpace = val | ||
} else { | ||
//for config with quotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have these else clauses, if its not a bool then it should be an error. This is actually a bug throughout this function, when the type is wrong for the field name it looks like currently we delete the field, when we should return an error and refuse to start Telegraf.
plugins/parsers/csv/parser.go
Outdated
"github.com/influxdata/telegraf/metric" | ||
) | ||
|
||
type CSVParser struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some clarification on how these options should work:
The csv_skip_rows
option is an integer that controls the number of lines at the beginning of the file that should be skipped over. I would think you would want a bufio.Reader so you can call ReadLine() this many times before passing this reader into csv.Reader.
The csv_skip_columns
option is an integer that controls the number of columns, from the left, to skip over.
Finally csv_header_row_count
would replace csv_header
, it would be an integer that is the number of rows to treat as the column names, the values would be concatenated for each column. This is applied after you csv_skip_rows
, here is an example:
foo,bar
1,2,3
This would produce the column names: ["foo1", "bar1", "3"]. Make sure to allow for lines of differing length.
plugins/parsers/csv/parser_test.go
Outdated
require.NoError(t, err2) | ||
|
||
//deep equal fields | ||
require.True(t, reflect.DeepEqual(goodMetric.Fields(), returnedMetric.Fields())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Equal(t, goodMetric.Fields(), returnedMetric.Fields())
Call these expected/actual
, or want/got
.
plugins/parsers/csv/parser_test.go
Outdated
|
||
metrics, err := p.Parse([]byte(testCSV)) | ||
require.NoError(t, err) | ||
require.Equal(t, true, reflect.DeepEqual(expectedFields, metrics[0].Fields())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Equal(t, expectedFields, metrics[0].Fields())
Check other tests and make sure you are using this everywhere.
plugins/parsers/csv/parser_test.go
Outdated
metrics, err := p.Parse([]byte(testCSV)) | ||
for k := range metrics[0].Fields() { | ||
log.Printf("want: %v, %T", expectedFields[k], expectedFields[k]) | ||
log.Printf("got: %v, %T", metrics[0].Fields()[k], metrics[0].Fields()[k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boo! no logging in tests
plugins/parsers/csv/parser.go
Outdated
} | ||
|
||
func (p *CSVParser) parseRecord(record []string) (telegraf.Metric, error) { | ||
recordFields := make(map[string]string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you won't need this intermediate map if you make fields implicit.
plugins/parsers/csv/parser.go
Outdated
// attempt type conversions | ||
if iValue, err := strconv.Atoi(value); err == nil { | ||
fields[fieldName] = iValue | ||
} else if fValue, err := strconv.ParseFloat(value, 64); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require all floats to have a decimal part to avoid type mismatch errors. @goller Is this going to work for us in the future?
docs/DATA_FORMATS_INPUT.md
Outdated
@@ -2,6 +2,17 @@ | |||
|
|||
Telegraf is able to parse the following input data formats into metrics: | |||
|
|||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs fixed up due to merge issues. I just merged your updates to the JSON parser so you may need to update again too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure i correctly followed your new format for the INPUT_DATA_FORMATS file when i resolved the merge conflict if you could take a look at that. I think the csv is missing the proper link to its section
52810e9
to
c058db6
Compare
plugins/parsers/csv/parser.go
Outdated
//concatenate header names | ||
for i := range header { | ||
name := header[i] | ||
name = strings.Trim(name, " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may not want to trim the strings.
csvReader := csv.NewReader(r) | ||
// ensures that the reader reads records of different lengths without an error | ||
csvReader.FieldsPerRecord = -1 | ||
if p.Delimiter != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's compute the effective delimiter and comment values as part of a New() function, then when Parse is called we can create the reader without needing to worry to redo this work.
plugins/parsers/csv/parser.go
Outdated
} | ||
} else if p.HeaderRowCount == 0 && len(p.ColumnNames) == 0 { | ||
// if there is no header and no DataColumns, that's an error | ||
return nil, fmt.Errorf("there must be a header if `csv_data_columns` is not specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can also go in the New function.
|
||
// if there is nothing in DataColumns, ParseLine will fail | ||
if len(p.ColumnNames) == 0 { | ||
return nil, fmt.Errorf("[parsers.csv] data columns must be specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can go in New function
plugins/parsers/csv/parser.go
Outdated
for i, fieldName := range p.ColumnNames { | ||
if i < len(record) { | ||
value := record[i] | ||
value = strings.Trim(value, " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to trim
plugins/parsers/csv/parser.go
Outdated
// will default to plugin name | ||
measurementName := p.MetricName | ||
if recordFields[p.MeasurementColumn] != nil { | ||
measurementName = recordFields[p.MeasurementColumn].(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could panic if the column is not a string, perhaps we should pull the value from record
?
plugins/parsers/csv/parser.go
Outdated
if recordFields[tagName] == nil { | ||
return nil, fmt.Errorf("could not find field: %v", tagName) | ||
} | ||
tags[tagName] = recordFields[tagName].(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will panic if not a string.
plugins/parsers/csv/parser.go
Outdated
if recordFields[p.TimestampColumn] == nil { | ||
return nil, fmt.Errorf("timestamp column: %v could not be found", p.TimestampColumn) | ||
} | ||
tStr := recordFields[p.TimestampColumn].(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could panic, maybe use record
with two return value form:
timeColumn := record[p.TimestampColumn]
if timeColumn != nil {
//
}
timeString, ok := col.(string); ok {
}
Might be easier to deal with errors if you put this into a function.
plugins/parsers/csv/parser.go
Outdated
return nil, fmt.Errorf("timestamp column: %v could not be found", p.TimestampColumn) | ||
} | ||
tStr := recordFields[p.TimestampColumn].(string) | ||
if p.TimestampFormat == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can verify this in the New function, and then treat this as a struct invariant.
closes: #2108
Required for all PRs: