-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce option to auto-convert unquoted field values to float or integer #54
Labels
Comments
Some notes on my suggestion:
|
A simple example, with a header followed by two data rows:
A variation, with quotes in the header row only:
|
Should be easy to leverage the converters feature in the csv library. Changing line: values = CSV.parse_line(source, :col_sep => @separator, :quote_char => @quote_char) to: values = CSV.parse_line(source, :converters => :all, :col_sep => @separator, :quote_char => @quote_char) Makes processing go from:
to:
If there's any takers, it's just a matter of exposing a new setting to auto detect and set this parameter in the |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background to this suggestion
I help to develop a product that forwards logs to Logstash over TCP as either CSV or JSON Lines.
The product can forward hundreds of types of logs. The original logs are in proprietary binary formats. The product parses the original logs and converts them before forwarding.
With JSON Lines, a single Logstash config works for all of these log types.
However, with CSV, the Logstash configs are not only specific to each log type, but are also specific to the particular set of fields (columns) being forwarded from that log type.
With CSV, each Logstash config contains:
columns
option that identifies the specific set of columns that the config is intended forconvert
option that converts numeric fields to eitherinteger
orfloat
Without that
convert
option, Logstash forwards all CSV field values as strings.For my personal use cases, I'm reasonably happy forwarding JSON Lines, so, for me, using CSV is more-or-less unnecessary. But when I think about users who—for whatever reason—want to use CSV, I look at those
columns
andconvert
options in the CSV flavor of the configs, and I wonder: can we avoid those?I understand that I can avoid
columns
by usingautodetect_column_names
, but what aboutconvert
?Note:
My suggestion
Introduce a new option to the CSV filter that enables it to determine data types from the first row of input CSV data:
I understand that this introduces specific expectations about the formatting of the CSV input data. I'm comfortable with that because I have the luxury of being able to control the format of that data. I understand that this suggestion won't apply to all CSV-format data.
A possible variation, instead of using field values to distinguish string vs numeric: use column names. That is, if a column name (in the header row) is enclosed in quotes, then it's a string field. Otherwise, it's numeric; and float vs integer can be determined by the presence or absence of a decimal point in the first data row (after the header). This would mean that only the header would need to involve quotes; I realize that the first option I described explicitly mentions using the first row of input data, so, strictly speaking, subsequent rows would not need such highlighting, but I'm slightly more "squeamish" about applying such formatting to a single data row than I am about applying that formatting to the header row.
(The product already—deliberately, by design—outputs all values of float fields with a decimal point; if the value of a float field happens to be an integer, then that value is output with a trailing decimal point followed by a single zero. I get that JSON—what Logstash forwards to Elasticsearch—distinguishes between strings and numbers, but not between integers and floating-point numbers.)
Thoughts? (Here's one: forget it, CSV was never supposed to specify data types; don't try to make it.)
The text was updated successfully, but these errors were encountered: