Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CSV encoder configurable #17261

Closed
scMarkus opened this issue May 1, 2023 · 3 comments · Fixed by #18320
Closed

Make CSV encoder configurable #17261

scMarkus opened this issue May 1, 2023 · 3 comments · Fixed by #18320
Labels
domain: codecs Anything related to Vector's codecs (encoding/decoding) type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@scMarkus
Copy link
Contributor

scMarkus commented May 1, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

Currently the CSV encoder used in multiple places like the S3 sink is using a standard configuration in terms of column separator (,) and quotation escaping ("). This conforms to rfc 4180. Anyway we are currently using a setup where we are using a tab delimited (TSV: \t) variant of it. Furthermore the respective data is read via Apache Spark which uses \ as quotation escape character as its default. We would like to transparently move the logging pipeline towards vector without disrupting the downstream processes which would make said configuration options mandatory to us when writing the data.

Attempted Solutions

As far as I can see there is no documented configuration option to change the described behavior. Looking into the vector source it seams to me as the relevant part is:

let mut wtr = csv::Writer::from_writer(buffer.writer());

This is using from_writer of the respective CSV library which sets default values. Alternatively a WriterBuilder of the same library could be used which has option to set a delimiter and quotation escape char as well as some other configs.

Proposal

What indeed does exists in the documentation already is the required encoding.csv.fields option. In my mind it would be a nice fit to extend encoding.csv by encoding.csv.delimiter and encoding.csv.quotation_escape and use those to configure the csv::WriterBuilder.

References

feat(codec): Add csv encoding for sinks #16603

Version

vector 0.29.0 (x86_64-unknown-linux-gnu 33b3868 2023-04-11 22:19:54.512366263)

@scMarkus scMarkus added the type: feature A value-adding code addition that introduce new functionality. label May 1, 2023
@spencergilbert spencergilbert added type: enhancement A value-adding code change that enhances its existing functionality. domain: codecs Anything related to Vector's codecs (encoding/decoding) and removed type: feature A value-adding code addition that introduce new functionality. labels May 1, 2023
@scMarkus
Copy link
Contributor Author

Would the proposed approach be reasonable? I may find some time trying to implement this. If so would one be so kind and hint me at how to implement new config options?

@spencergilbert
Copy link
Contributor

I'll see if I can get @vectordotdev/ux-team to respond by the end of the week - we have started adding some options on codecs like decoding.<codec>.<option> (ref).

So I could see the encoding.csv.<options> fitting in with that pattern.

@jszwedko
Copy link
Member

Agreed, the options can be nested as suggested like:

encoding.csv.delimiter = ...
encoding.csv.quotes = ...

We'd be happy to see a PR for this if you feel motivated @scMarkus. The options would be added here:

pub struct CsvSerializerOptions {
/// Configures the fields that will be encoded, as well as the order in which they
/// appear in the output.
///
/// If a field is not present in the event, the output will be an empty string.
///
/// Values of type `Array`, `Object`, and `Regex` are not supported and the
/// output will be an empty string.
pub fields: Vec<ConfigTargetPath>,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: codecs Anything related to Vector's codecs (encoding/decoding) type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
3 participants