Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private fields used for processing that are not encoded #1448

Closed
pheanex opened this issue Dec 27, 2019 · 11 comments · Fixed by #1915
Closed

Private fields used for processing that are not encoded #1448

pheanex opened this issue Dec 27, 2019 · 11 comments · Fixed by #1915
Assignees
Labels
domain: data model Anything related to Vector's internal data model domain: logs Anything related to Vector's log events needs: requirements Needs a a list of requirements before work can be begin type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@pheanex
Copy link

pheanex commented Dec 27, 2019

Problem:

  • Currently it does not seem possible to configure vector to use fields for templating sink-configuration-parameters (like "key_prefix" in aws_s3 sink) *only* and not actually forward those fields in sinks.

Ideas:

  • Add sink-configuration option to blacklist a list of fields
  • Introduce a meta-field concept to distinguish fields-to-be-shipped and fields only used for routing/configuration
@binarylogic
Copy link
Contributor

Ah, that makes sense, thanks for letting us know. We have discussed exactly this feature in the past. To help us understand the full scope of this feature would you mind sharing your exact example? Specifically, how the field is initially inserted into the event and how you plan to use it in the aws_s3 sink. This will help us determine the ergonomics of private/public fields.

@binarylogic binarylogic added event type: log type: enhancement A value-adding code change that enhances its existing functionality. labels Dec 27, 2019
@binarylogic binarylogic added this to the Improve data processing milestone Dec 27, 2019
@binarylogic
Copy link
Contributor

@a-rodin I'm adding this to the "Improve data processing" milestone since you're re-approaching the data model. I think it would be worthwhile to consider private fields as you make those changes.

@binarylogic binarylogic changed the title Feature: Add option to filter fields in sink aws_s3 Private fields used for processing that are not encoded Dec 27, 2019
@binarylogic binarylogic assigned ghost Dec 27, 2019
@binarylogic binarylogic added the needs: requirements Needs a a list of requirements before work can be begin label Dec 27, 2019
@binarylogic
Copy link
Contributor

Noting, we need a spec that addresses the following:

  • How can users explicitly mark field(s) as private/public? Ex: they ingest a field from an upstream source and want to reactively mark it as private.
  • How are name conflicts handled between public and private fields? If conflicts are allowed, how will users specify public/private via our interpolation syntax? {{field_name}}
  • How does this work for other transforms like lua, add_fields, remove_fields, etc.?

@ghost
Copy link

ghost commented Dec 27, 2019

It is not a spec yet, but there is a thought to approach this problem from a slightly different angle. If we implement proper nesting of fields, it would be possible to move all public fields into a dedicated subfield with configurable name.

For example, the stdin source could create events looking like this:

{
    "data": {
        "message": "message",
        "timestamp": "2019-12-28T12:34:56.789Z",
        "host": "localhost"
    }
}

Here the data subfield would be the default target field name, but it could be changed as part of the sink configuration.

Then making fields public/private could be as easy as renaming them (#377):

[transforms.hide_host]
  type = "rename_fields"
  [transforms.hide_host.fields]
    host = "data.host"

All sinks would take the name of the field to send, which by default would be again data.

Benefits of this approach:

  • the fact that the field is public or private can be derived from its fully qualified name;
  • no name conflicts would be possible;
  • the transforms would just use data.* for public fields and other names for private fields.

@binarylogic
Copy link
Contributor

That is an interesting approach, although I think the behavior should be inversed: public by default. I believe that to be the path of least surprise. We could always provide a global, or source level, option to change his default.

To demonstrate with your example, the stdin source would receive data like this by default:

{
  "message": "message",
  "timestamp": "2019-12-28T12:34:56.789Z",
  "host": "localhost"
}

And then a special @private or (_private) namespace could be used for private fields:

[transforms.hide_host]
  type = "rename_fields"
  [transforms.hide_host.fields]
    host = "_private.host"

Alterantively, the @ character is an interesting flag to mark a field as private.

@ghost
Copy link

ghost commented Dec 30, 2019

I believe that to be the path of least surprise.

I agree that at least for existing users it would be much less surprising.

I'm somewhat concerned about possible collisions, as the user potentially might have fields with names starting from @private./_private//@, but it probably can be solved by making the prefix configurable.

@pheanex
Copy link
Author

pheanex commented Jan 2, 2020

@binarylogic we are trying to save S3 volume by using ~static fields from parsed_json for templating s3-key_prefixes rather than including this data in every log-line.

[sources.files]
  type = "file"
  include = ["/logs/*.log"]
[transforms.json_parser]
  type = "json_parser"
  inputs = ["files"]
  field = "log" # contains {"field1": 1, "field2": 2, "field3": 3, "field4": 4}
[sinks.AWSS3]
  type = "aws_s3"
  inputs = ["json_parser"]
  bucket = "<bucket>"
  region = "<region"
  encoding = "ndjson"
  key_prefix = "{{ field1 }}/{{ field2 }}/{{ field3 }}/"
  compression = "gzip"
  # We use fields1-3 for prefixing and upload only field4:
  # blacklist = ["field1", "field2", "field3"]

@lukesteensen
Copy link
Member

To avoid repeating the explicit/implicit fields split that we only recently ripped out, I'd suggest that we consider handling this at the encoder level. That seems to be where the real payoff is and it lets us avoid mucking around too much in the data model.

For example, we could expand the encoding = "json" type of config options into something that optionally takes a list of fields to include or exclude. With properly nested fields, this allows users to implement a very simple data/metadata split and avoids the problem of collisions.

@anton-ryzhov
Copy link
Contributor

Totally agree with @lukesteensen , adjusting encoder with include-exclude list of properties will perfectly solve this issue.
While bringing concept of private/metadata properties overcomplicates the processing.

Another thought — "metadata" for one sink could be usual data for another.

@binarylogic
Copy link
Contributor

binarylogic commented Jan 29, 2020

I wanted to propose a few configuration examples and agree on them before we begin work:

1. Backward compatibility

If possible, I'd like to preserve and deprecate the current syntax since this is a popular option:

[sinks.my_sink]
  type = "clickhouse"
  encoding = "json"

2. New syntax

[sinks.my_sink]
  type = "clickhouse"
  encoding.format = "json"
  encoding.only_fields = ["timestamp", "message"]
  encoding.except_fields = ["_meta"]

Where only_fields and except_fields are mutually exclusive, only_fields should take priority. This, of course, can be represented as a table as well:

[sinks.my_sink]
  type = "clickhouse"

  [sinks.my_sink.encoding]
    format = "json"
    only_fields = ["timestamp", "message"]
    except_fields = ["_meta"]

3. Embedding the coercer transform?

This feels like overstepping a bit, but might be simple if we can reuse the coercer code. It's also a progressively more strict option.

[sinks.my_sink]
  type = "clickhouse"

  [sinks.my_sink.encoding]
    format = "json"

  [sinks.my_sink.types]
    message = "string"
    timestamp = "timestamp"
    my_custom_field = "int"

4. Global defaults

This is closely related to #1446. But it would be nice to have global defaults as well.

[log_schema]
except_fields = ["_*"] # glob support

And possible with pattern support:

[log_schema]
except_fields = ["_*"] # perhaps this should be regex? I'm curious about performance.

@Hoverbear
Copy link
Contributor

Let's support the second option 2 first!

@binarylogic binarylogic added domain: logs Anything related to Vector's log events domain: data model Anything related to Vector's internal data model and removed event type: log labels Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: data model Anything related to Vector's internal data model domain: logs Anything related to Vector's log events needs: requirements Needs a a list of requirements before work can be begin type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants