Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(codecs): Add syslog encoder #21307

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

polarathene
Copy link

@polarathene polarathene commented Sep 18, 2024

Replaces: #17668

  • Each commit has a scoped change. I built off the from the great work @syedriko originally put together and refactored that heavily, applying the review feedback I had provided there among other fixes.
  • Each commit provides additional context for the changes. Thus if something is unclear in the review diff, using git blame on my branch (Git web UI works fine for that) may provide the extra context needed 👍

NOTE:

  • This PR has been opened at the request of others for providing feedback on the general direction and approach.
  • The branch does not yet reflect the current progress. I had been working on a refactor of some changes earlier this year that ran into a blocker to completion for committing. I intended to resolve that first, but have not yet been able to afford the time.
  • As I have not been actively working on this feature for a while, I may not be able to respond to some feedback as well as I would like to at this stage.

I'll try to allocate time within October to get those changes applied to this branch, along with a proper PR description at that point.

The conflicts were already resolved locally IIRC, they'll be taken care of with the future update.

syedriko and others added 16 commits March 15, 2024 17:23
Original commit from syedriko
This is only a temporary change to make the diffs for future commits easier to follow.
- Introduce a `Pri` struct with fields for severity and facility as enum values.
  - `Pri` uses `strum` crate to parse string values into their appropriate enum variant.
  - Handles the responsibility of encoding the two enum values ordinal values into the `PRIVAL` value for the encoder.
- As `Facility` and `Severity` enums better represent their ordinal mapping directly
  - The `Fixed` + `Field` subtyping with custom deserializer isn't necessary. Parsing a string that represents the enum by name or its ordinal representation is much simpler.
  - Likewise this removes the need for the get methods as the enum can provide both the `String` or `u8` representation as needed.
`SyslogSerializer::encode()` has been simplified.
- Only  matching `Event::Log` is relevant, an `if let` bind instead of `match` helps remove a redundant level of nesting.
- This method only focuses on boilerplate now, delegating the rest to `ConfigDecanter` (_adapt `LogEvent` + encoder config_) and `SyslogMessage` (_encode into syslog message string_).
- This removes some complexity during actual encoding logic, which should only be concerned about directly encoding from one representation to another, not complimentary features related to Vector config or it's type system.

The new `ConfigDecanter` is where many of the original helper methods that were used by `SyslogSerializer::encode()` now reside. This change better communicates the scope of their usage.
- Any interaction with `LogEvent` is now contained within the methods of this new struct. Likewise for the consumption of the encoder configuration (instead of queries to config throughout encoding).
- The `decant_config()` method better illustrates an overview of the data we're encoding and where that's being sourced from via the new `SyslogMessage` struct, which splits off the actual encoding responsibility (see next commit).
`SyslogSerializerConfig` has been simplified.
- Facility / Severity deserializer methods aren't needed, as per their prior refactor with `strum`.
- The `app_name` default is set via `decant_config()` when not configured explicitly.
- The other two fields calling a `default_nil_value()` method instead use an option value which encodes `None` into the expected `-` value.
- Everything else does not need a serde attribute to apply a default, the `Default` trait on the struct is sufficient.
- `trim_prefix` was removed as it didn't seem relevant. `tag` was also removed as it's represented by several subfields in RFC 5424 which RFC 3164 can also use.

`SyslogMessage::encode()` refactors the original PR encoding logic:
- Syslog Header fields focused, the PRI and final message value have already been prepared prior. They are only referenced at the end of `encode()` to combine into the final string output.
- While less efficient than `push_str()`, each match variant has a clear structure returned via the array `join(" ")` which minimizes the noise of `SP` from the original PR. Value preparation prior to this is clear and better documented.
- `Tag` is a child struct to keep the main logic easy to grok. `StructuredData` is a similar case.
No changes beyond relocating the code into a single file.
- Drop notes referring to original PR differences + StructuredData adaption references. None of it should be relevant going forward.
- Revise some other notes.
- Drop `add_log_source` method (introduced from the original PR author) in favor of using `StructuredData` support instead.
This should be simple and lightweight enough to justify for the DRY benefit?

This way the method doesn't need to be duplicated redundantly. That was required because there is no trait for `FromRepr` provided via `strum`. That would require a similar amount of lines for the small duplication here.

The `akin` macro duplicates the `impl` block for each value in the `&enums` array.
- `ConfigDecanter::get_message()` replaces the fallback method in favor of `to_string_lossy()` (a dedicated equivalent for converting `Value` type to a String type (_technically it is a CoW str, hence the follow-up with `to_string()`_)).
  - This also encodes the value better, especially for the default `log_namespace: false` as the message value (when `String`) is not quote wrapped, which matches the behaviour of the `text` encoder output.
  - Additionally uses the `LogEvent` method `get_message()` directly from `lib/vector-core/src/event
/log_event.rs`. This can better retrieve the log message regardless of the `log_namespace` setting.
- Encoding of RFC 5424 fields has changed to inline the `version` constant directly, instead of via a redundant variable. If there's ever multiple versions that need to be supported, it could be addressed then.
- The RFC 5424 timestamp has a max precision of microseconds, thus this should be rounded and `AutoSi` can be used (_or `Micros` if it should have fixed padding instead of truncating trailing `000`_).
- The original PR author appears to have relied on a hard-coded timestamp key here.
- `DateTime<Local>` would render the timestamp field with the local timezone offset, but other than that `DateTime<Utc>` would seem more consistent with usage in Vector, especially since any original TZ context is lost by this point?
- Notes adjusted accordingly, with added TODO query for each encoding mode to potentially support configurable timezone.
- Move encoder config settings under a single `syslog` config field. This better mirrors configuration options for existing encoders like Avro and CSV.
- `ConfigDecanter::value_by_key()` appears to accomplish roughly the same as the existing helper method `to_string_lossy()`. Prefer that instead. This also makes the `StructuredData` helper `value_to_string()` redundant too at a glance?
- Added some reference for the priority value `PRIVAL`.
- `Pri::from_str_variants()` uses the existing defaults for fallback, communicate that more clearly. Contextual note is no longer useful, removed.
To better communicate the allowed values, these two config fields can change from the `String` type to their appropriate enum type.
- This relies on serde to deserialize the config value to the enum which adds a bit more noise to grok.
- It does make `Pri::from_str_variants()` redundant, while the `into_variant()` methods are refactored to `deserialize()` with a proper error message emitted to match the what serde would normally emit for failed enum variant deserialization.
- A drawback of this change is that these two config fields lost the ability to reference a different value path in the `LogEvent`. That'll be addressed in a future commit.
In a YAML config a string can optionally be wrapped with quotes, while a number that isn't quote wrapped will be treated as a number type.

The current support was only for string numbers, this change now supports flexibility for config using ordinal values in YAML regardless of quote usage.

The previous `Self::into_variant(&s)` logic could have been used instead of bringing in `serde-aux`, but the external helper attribute approach seems easier to grok/follow as the intermediary container still seems required for a terse implementation.

The match statement uses a reference (_which requires a deref for `from_repr`_) to appease the borrow checker for the later borrow needed by `value` in the error message.
This seems redundant given the context? Mostly adds unnecessary noise.

Could probably `impl Configurable` or similar to try workaround the requirement. The metadata description could generate the variant list similar to how it's been handled for error message handling?
Not sure if this is worthwhile, but it adopts error message convention elsewhere I've seen by managing them via Snafu.
@bits-bot
Copy link

bits-bot commented Sep 18, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: codecs Anything related to Vector's codecs (encoding/decoding) labels Sep 18, 2024
@gaby
Copy link

gaby commented Nov 7, 2024

@polarathene Any updates on this ?

@polarathene
Copy link
Author

Any updates on this ?

No sorry. Several unexpected events happened in a short time period, I was unsuccessful at avoiding burnout which I'm still in the process of exiting from.

Context

FWIW, I over committed myself in 2023, while facing several setbacks and issues in 2024. This isn't the only work where I'm pinged for updates to move things forward that have come to a halt.

I do understand the frustration of wanting to have this become actively developed again and merged, but without any assistance it's a task I struggle to allocate time for. More on that below.


PR was opened due to demand

When I raised a concern about the lack of feedback received thus far, it was insisted that I open a PR, even if it's not in the state desired for handing to reviewers.

#17668 (comment)

Efforts are appreciated but AFAIK it is not equivalent to a PR; I'm pretty certain you are only able to comment and provide input with a pull request.

#17668 (comment)

in order for reviewers to leave proper feedback for a branch it needs to have an issue or PR ticket. This would allow reviewers to make comparison between branches and might proceed with the better one.

However opening the PR early has made no difference vs the diff link I had previously provided. Which was anticipated:

#17668 (comment)

When you say that we have nothing and making no progress without a PR, are you offering to participate in review?
So far no one has provided any feedback on the linked branch/diff, so if a new PR left in a WIP status is opened will anything change?


My previous input

None of this has changed.

  • I've been seeking feedback/guidance since July but none received yet, even after opening the PR as requested.
  • I still want to work on it, but it is difficult to prioritize in my current situation.

#17668 (comment)

I was hoping to just get general feedback if it was going in the right direction or if there was anything that stood out at a glance.

#17668 (comment)

It would probably help to know how you'd like to use/configure it.

  • How important is it to have config for setting default/fallback values?
  • How important is it to have customization for keys vs managing it all via VRL?

Feedback on the last shared iteration would also be helpful to know if that's going in the right direction or not. I might have more time spare by October or November to finish up my branch.

#17668 (comment)

I'd love to finish the PR sooner but have no employer at present, thus it's tricky to afford dedicating time to this over other priorities 😓

#17668 (comment)

My work on it is purely voluntary, no employer backing, so dedicating my spare time on it is tricky 😓

#17668 (comment)

I do want to complete the work, it's just been difficult to budget my volunteer time for.
So if nobody helps, I will eventually be able to block time out again for coming back to this. I sunk a considerable chunk of time into the branch already, would be an awful waste to leave it to rot.


Personal friction

When it comes to finding time to allocate to this PR, there's a bit more to it than covered below, but it's effectively a high cost item for me to tackle.

I need to have:

  • No other events with higher priorities to resolve, or that unintentionally become time sinks (ADHD related)
  • Sufficient time to context switch to this PR / task, and the energy to figure out and resolve any blocker or how to move forward (receiving the feedback and guidance requested would be helpful).
  • Sufficient hardware resources. Often I am lacking enough memory on my system to confidently work on this PR. I juggle quite a lot and experience crashes where I lose a fair amount of work (I do make an effort to save and add notes, but it's a tad complicated). I also have very low storage spare on the system which has been a point of friction in the past for this PR.

As such, the most optimal time is when I have nothing pressing elsewhere and I'm able to save everything I can to perform a system reboot (I only have a few of these a year as preparing for one can take multiple days for me). I'd much rather have projects properly isolated to individual VMs which would address quite a lot of these problems and allow me to work much more efficiently, but that requires time.

If I had a separate system temporarily dedicated for working on this PR I could tackle it much more easily. Presently my goals are:

  • Get a new phone (the one I've used since 2018 broke a few days ago)
  • Finish up a documentation PR
  • Prepare for system reboot and hope my system boots (sometimes it appears bricked, or it fails in other ways).
  • Block out time for Vector and hope nothing interrupts getting to the sync point to update this PR branch.

Optimistically the system reboot will be completed by the 17th if nothing else interferes, so there should be a new push on this branch before the end of the month.

If that happens, I'd toggle of the draft status and ping for review/feedback. If I receive feedback in a timely manner progress will be quick from then on, but if there's a notable delay (over a week) I'd likely have switched to something else and it could take a while to return back to the PR. I'm slightly concerned about timing as we approach December, but hopefully we can get this PR back on track 💪

@jcantrill
Copy link

@StephenWakely you commented in the previous PR, is it possible for you to provide additional feedback? My limited familiarity with RUST makes my input less then useful

@polarathene polarathene mentioned this pull request Nov 8, 2024
@StephenWakely
Copy link
Contributor

Sorry, I had assumed that since the PR is in draft you weren't looking for feedback yet, but I can definitely take a look

@pront pront marked this pull request as ready for review November 13, 2024 15:15
@pront pront requested a review from a team as a code owner November 13, 2024 15:15
@polarathene
Copy link
Author

polarathene commented Nov 13, 2024

I had assumed that since the PR is in draft you weren't looking for feedback yet

Not after anything too deep, just a glance over by someone more familiar with Vector. My local branch still has some changes to get pushed up that refactor some of the current changes, so only a light review for providing guidance / expectations is needed.

I left it in draft due to not having found time to sync my local branch to this PR yet, and that the PR itself is still not ready to merge as it's not functioning as desired. I ran into a blocker on the local branch earlier this year and will bring that up to discuss here or on Discord once the branch is sync'd (presently the local changes are only partially committed / stashed).


I'm still on schedule per my prior comment timeline for returning to this PR next week. Ideally someone can look over and provide feedback before Tuesday.

Copy link
Author

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some parts I can recall that I would like to receive feedback on, I'll bring up other concerns next week.

@@ -10,11 +10,13 @@ name = "generate-avro-fixtures"
path = "tests/bin/generate-avro-fixtures.rs"

[dependencies]
akin = { version = "0.4", optional = true }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know if there's any policies about the additional crate deps I've added to support the PR. They're all marked optional and assigned to the syslog feature.

I could minimize the deps but it'd make the feature code itself less maintainer friendly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally if it is needed and the license is ok then we are fine to pull it in like you have done here.

However, I don't really see that this crate is adding enough to warrant it's inclusion. It's only used in one place from what I can see. There it would be more idiomatic to just use a simple macro:

macro_rules! deserialize_impl {
    ($enum:ty) => {
	impl $enum {
	    fn deserialize<'de, D>(deserializer: D) -> Result<Self, D::Error>
	
...

}

deserialize_impl!(Facility);
deserialize_impl!(Severity);

lib/codecs/src/encoding/format/syslog.rs Show resolved Hide resolved
Comment on lines +46 to +73
/// Syslog serializer options.
#[configurable_component]
#[derive(Clone, Debug, Default)]
// Serde default makes all config keys optional.
// Each field assigns either a fixed value, or field name (lookup field key to retrieve dynamic value per `LogEvent`).
#[serde(default)]
pub struct SyslogSerializerOptions {
/// RFC
rfc: SyslogRFC,
/// Facility
#[serde(deserialize_with = "Facility::deserialize")]
facility: Facility,
/// Severity
#[serde(deserialize_with = "Severity::deserialize")]
severity: Severity,

/// App Name
app_name: Option<String>,
/// Proc ID
proc_id: Option<String>,
/// Msg ID
msg_id: Option<String>,

/// Payload key
payload_key: String,

// Q: The majority of the fields above pragmatically only make sense as config for keys to query?
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was adapted from the original PR, payload_key has since been removed in favor of the actual Vector log message when available or could be set via VRL.

In my local branch I was looking into having most of these field that have an equivalent syslog struct field to instead share the same struct:

pub struct SyslogSerializerOptions {
    /// RFC
    rfc: SyslogRFC,

    #[serde(flatten)]
    #[configurable(derived)]
    priority: Pri,

    #[serde(flatten)]
    #[configurable(derived)]
    tag: Tag,
}

There was another variation that instead of configuring overrides to the defaults, would use ConfigTargetPath as a more appropriate config approach by allowing the user to adjust the VRL keys to map into the syslog struct fields:

pub struct SyslogSerializerOptions {
    /// Facility
    #[default = "facility"]
    facility: ConfigTargetPath,
    /// Severity
    #[default = "severity"]
    severity: ConfigTargetPath,

    /// Application Name
    #[default = "appname"]
    app_name: ConfigTargetPath,
    /// Process ID
    #[default = "procid"]
    proc_id: ConfigTargetPath,
    /// Message ID
    #[default = "msgid"]
    msg_id: ConfigTargetPath,

    /// Hostname
    #[default = "hostname"]
    hostname: ConfigTargetPath,
    /// Message
    message: Option<ConfigTargetPath>,
    /// Timestamp
    timestamp: Option<ConfigTargetPath>,
    /// Structured Data
    structured_data: Option<ConfigTargetPath>,
}

Probably the right way to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really these values should ideally be extracted by their meaning from the event. I've added some links to the comments I made on the other PR about them.

I added some explanation of what we actually mean by meaning here. The concept of meaning is in a bit of transition in Vector at the minute, hence perhaps it's not completely clear how it all fits in - but hopefully in time it will!

Copy link
Contributor

@StephenWakely StephenWakely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts in putting this together! I appreciate it has been a bit of journey.

Mostly the changes needed are to make it work better when we have log_namespace: true set.

There's a few other checks that need to pass in order to merge this. See here for more details.

Any questions, please give me a shout.

rfc: SyslogRFC,
/// Facility
#[serde(deserialize_with = "Facility::deserialize")]
facility: Facility,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I said here should apply to this.

facility: Facility,
/// Severity
#[serde(deserialize_with = "Severity::deserialize")]
severity: Severity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I said here should apply to this.

msg_id: Option<String>,

/// Payload key
payload_key: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| Remove this option. This should always be the field with the meaning message.

I would disagree with this suggestion as it reduces the flexibility of using the protocol. This implies you always want the "message" portion of the payload to be what vector determines is the "message" and not for instance how an administrator may have transformed it in a prior pipeline to be something else. The default if not specified IMO should be "message" but allow the flexibility; there is no reason not to

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies you always want the "message" portion of the payload to be what vector determines is the "message" and not for instance how an administrator may have transformed it in a prior pipeline to be something else.

Could you elaborate why you'd have introduced a difference in semantics? Please document a valid use-case for why you'd want such functionality so we know it's not an XY problem.

When would you transform the log message to not represent the message, yet provide it as input to an encoder to but require a different key for the actual message content?


I am familiar with the original PR use with add_source to compose metadata related to k8s with the actual message. You could do that with VRL, or for the metadata leverage structured data?

Copy link

@jcantrill jcantrill Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why you'd have introduced a difference in semantics? Please document a valid use-case for why you'd want such functionality so we know it's not an XY problem.

Vector is configurable as a transformation engine. Logs coming in do not need to equal logs going out. Consider almost any case where my desired payload is a subset of the original message, a value in the message which was a JSON string. In a prior transform, I could parse the message and now it is available to me as key/value pairs, where I only care about some portion of the message.

By removing this option does it mean in some prior transform do I always have to modify the event and set the key "message" to the desired outcome? Maybe I'm missing where the encoding ends and transforms and sinks begin 🤷‍♂️

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing this option does it mean in some prior transform do I always have to modify the event and set the key "message" to the desired outcome? Maybe I'm missing where the encoding ends and transforms and sinks begin 🤷‍♂️

Encoding is A => B, given input A produce output B (syslog formatted log in this case).

Perhaps I should phrase this a bit differently?:

  • If Vector supports encoding your input to other existing log encoders the way you want, you should expect to get the same with this syslog encoder.
  • If it does not, then there would need to be a very valid reason why this encoder is an exception vs what you'd otherwise need to do with any other encoder already supported by Vector.

As such, you should find that what you're requesting isn't specific to this encoder implemented via the PR, but something one would expect to be encoder agnostic.

That would be a separate feature request to raise, which may be rejected on the basis that VRL would be advised for your use-case to pre-process your input for the encoder instead.


I'm not an active Vector user myself yet, but from what I do understand message is an internal convention and the encoders rely on that to be semantically correct as input, thus you'd be expected to use existing Vector features to handle that prior to encoding.

@StephenWakely will chime in if I'm mistaken, and you're also welcome to correct me too.

severity: Severity,

/// App Name
app_name: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// App Name
app_name: Option<String>,
/// Proc ID
proc_id: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +121 to +124
let y = |v| self.replace_if_proxied_opt(v);
let app_name = y(&config.app_name).unwrap_or("vector".to_owned());
let proc_id = y(&config.proc_id);
let msg_id = y(&config.msg_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all be set like:

field = config.value.unwrap_or_else(|| log.get_field_by_meaning("service"))

Comment on lines +218 to +224
let payload = if config.payload_key.is_empty() {
self.log.get_message().map(|v| v.to_string_lossy().to_string() )
} else {
self.value_by_key(&config.payload_key)
};

payload.unwrap_or_default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No payload key, so just something like this should do:

Suggested change
let payload = if config.payload_key.is_empty() {
self.log.get_message().map(|v| v.to_string_lossy().to_string() )
} else {
self.value_by_key(&config.payload_key)
};
payload.unwrap_or_default()
self.log.get_message().map(|v| v.to_string_lossy().to_string())

// https://datatracker.ietf.org/doc/html/rfc3164#section-4.1.2
// https://docs.rs/chrono/latest/chrono/format/strftime/index.html
//
// TODO: Should this remain as UTC or adjust to the local TZ of the environment (or Vector config)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep it UTC.

&self.message,
].concat()

// Q: RFC 5424 MSG part should technically ensure UTF-8 message begins with BOM?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want the challenge you are welcome to embark on the journey to make this comply. However, I feel that this occurs in practice rarely enough that it won't be a blocker to getting this merged.

}

fn get_structured_data(&self) -> Option<StructuredData> {
self.log.get("structured_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Vector namespace, (when log_namespace: true) we want to get this by meaning instead of name:

Suggested change
self.log.get("structured_data")
match self.log.namespace() {
LogNamespace::Vector => self.log.get_by_meaning("structured_data")
LogNamespace::Legacy => self.log.get("structured_data")
}

Copy link
Author

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy feedback! ❤️

I'll return to this next week to go over your feedback properly then and address it, cheers 😁

//
// Q: Why `$.message.` as the prefix? (Appears to be JSONPath syntax?)
// NOTE: Originally named in PR as: `get_field_or_config()`
fn replace_if_proxied(&self, value: &str) -> Option<String> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original PR IIRC had implemented a feature to support config that overrides a default value (such as always X severity), presumably because the log event did not provide an equivalent mapping. Or the config key would reference a field to lookup and get the value of from the event, which I think their $.message prefix was specific to their own use-case.

I'd rather drop it, but wasn't sure how much of the original PR was the correct approach as I'm not familiar enough with Vector myself. Thus this PR focused on refactoring the functionality that was already in the existing PR and moving forward from there 😅


So this method was mostly a helper:

        value
            .strip_prefix("$.message.")
            .map_or(
                Some(value.to_owned()),
                |field_key| self.value_by_key(field_key),
            )
    }

    fn value_by_key(&self, field_key: &str) -> Option<String> {
        self.log.get(field_key).map(|field_value| {
            field_value.to_string_lossy().to_string()
        })
    }

Used earlier like:

    fn decant_config(&self, config: &SyslogSerializerOptions) -> SyslogMessage {
        let y = |v| self.replace_if_proxied_opt(v);
        let app_name = y(&config.app_name).unwrap_or("vector".to_owned());
        let proc_id = y(&config.proc_id);
        let msg_id = y(&config.msg_id);

Since they're options, if no lookup key or default value was configured, y would return None:

  • proc_id and msg_id would later be formatted as -.
  • app_name default is meant to fallback to something IIRC so vector was the default chosen. In my local branch I used the SmartDefault crate to add a default to the actual syslog Tag struct for the app_name field instead.

I'm not familiar with log.get_field_by_meaning or if that was available when I started working on this PR early this year. I'll look into that if it's appropriate (I'm aware of this reference you shared in another feedback comment, I haven't gone over that yet).

IIRC this logic may have been refactored on my local branch.

@jcantrill
Copy link

@StephenWakely @polarathene

I don't fully understand all of what this entails but my general impression is the suggested changes are moving from flexible to an opinionated implementation, relying upon the impl to assume values. Maybe I'm reading incorrectly but I am pushing to ask for flexibility and configurability. The spec defines specific attributes and layout of a message but leaves quite a bit open for interpretation by an admin. As a user of vector in a cloud containerized environment that already is forwarding logs over syslog, we have use cases where it makes sense to allow administrators to set various fields.

The original PR as a straight port of a fluend plugin and what we have now for our vector implementation. Reviewing the syslog spec I can see where several config knobs should be dropped (e.g. tag, add_source) but IMO the others should remain

@polarathene
Copy link
Author

my general impression is the suggested changes are moving from flexible to an opinionated implementation, relying upon the impl to assume values.
Maybe I'm reading incorrectly but I am pushing to ask for flexibility and configurability.

I haven't had time to go over full review feedback, but I'll do so next week when I return to sync my local changes to this branch.

  • Any input from me thus far in the review process is mostly focused on the encoder config settings which IMO were not implemented well in the original PR.
  • My primary focus will be on getting the features core functionality sorted for review and approval. At which point I'd reach out for subscribed users to test it out for feedback so that we can verify the feature is useful your use-cases.

Config wise, I think any property/field mapping can be handled via VRL with remap. We can discuss that in future if it's inconvenient and could be improved, but given my limited time I hope you understand why I don't want too much bike shedding over that right now.


The spec defines specific attributes and layout of a message but leaves quite a bit open for interpretation by an admin.
As a user of vector in a cloud containerized environment that already is forwarding logs over syslog, we have use cases where it makes sense to allow administrators to set various fields.

When this PR reaches the next milestone (when I ping for proper review / testing), I'd be happy to discuss this with you here. As per my review on the original PR, it should be evident that I care about respecting the spec. Anything that should be - (NIL) when not present is handled, and you should have the ability to override each field IIRC, just via Vector remap transforms.

FWIW: My interest in Vector with this feature is also for usage in a container.


The original PR is a straight port of a fluentd plugin and what we have now for our vector implementation.

This PR adapted the original, you can follow the commit history for logical commits with descriptions that reason each change.

IIRC the original had various concerns that I've addressed or intend to (these should be detailed in the commit messages and/or inline commentary).

I remember the add_source functionality and dropping it here in one of the commits as it seemed to be redundant (could be handled via VRL), thus I think most of your concerns aren't likely to be a problem and you'd still have the control/flexibility for your requirements, it just may look a little different config wise.

@StephenWakely
Copy link
Contributor

@StephenWakely @polarathene

I don't fully understand all of what this entails but my general impression is the suggested changes are moving from flexible to an opinionated implementation, relying upon the impl to assume values. Maybe I'm reading incorrectly but I am pushing to ask for flexibility and configurability.

It's pushing the flexibility back into VRL, which we already have. This codec should be opinionated about what it expects the log message to look like. Any flexibility you need should be handled in a remap transform.

@pront
Copy link
Member

pront commented Nov 18, 2024

@StephenWakely @polarathene
I don't fully understand all of what this entails but my general impression is the suggested changes are moving from flexible to an opinionated implementation, relying upon the impl to assume values. Maybe I'm reading incorrectly but I am pushing to ask for flexibility and configurability.

It's pushing the flexibility back into VRL, which we already have. This codec should be opinionated about what it expects the log message to look like. Any flexibility you need should be handled in a remap transform.

This sounds like a good strategy in general. Remap can do any kind of event transformation you need. The encoder should specify the interface it expects so it can efficiently batch and send requests downstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: sinks Anything related to the Vector's sinks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants