-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(codecs): Add syslog encoder #21307
base: master
Are you sure you want to change the base?
Conversation
Original commit from syedriko
This is only a temporary change to make the diffs for future commits easier to follow.
- Introduce a `Pri` struct with fields for severity and facility as enum values. - `Pri` uses `strum` crate to parse string values into their appropriate enum variant. - Handles the responsibility of encoding the two enum values ordinal values into the `PRIVAL` value for the encoder. - As `Facility` and `Severity` enums better represent their ordinal mapping directly - The `Fixed` + `Field` subtyping with custom deserializer isn't necessary. Parsing a string that represents the enum by name or its ordinal representation is much simpler. - Likewise this removes the need for the get methods as the enum can provide both the `String` or `u8` representation as needed.
`SyslogSerializer::encode()` has been simplified. - Only matching `Event::Log` is relevant, an `if let` bind instead of `match` helps remove a redundant level of nesting. - This method only focuses on boilerplate now, delegating the rest to `ConfigDecanter` (_adapt `LogEvent` + encoder config_) and `SyslogMessage` (_encode into syslog message string_). - This removes some complexity during actual encoding logic, which should only be concerned about directly encoding from one representation to another, not complimentary features related to Vector config or it's type system. The new `ConfigDecanter` is where many of the original helper methods that were used by `SyslogSerializer::encode()` now reside. This change better communicates the scope of their usage. - Any interaction with `LogEvent` is now contained within the methods of this new struct. Likewise for the consumption of the encoder configuration (instead of queries to config throughout encoding). - The `decant_config()` method better illustrates an overview of the data we're encoding and where that's being sourced from via the new `SyslogMessage` struct, which splits off the actual encoding responsibility (see next commit).
`SyslogSerializerConfig` has been simplified. - Facility / Severity deserializer methods aren't needed, as per their prior refactor with `strum`. - The `app_name` default is set via `decant_config()` when not configured explicitly. - The other two fields calling a `default_nil_value()` method instead use an option value which encodes `None` into the expected `-` value. - Everything else does not need a serde attribute to apply a default, the `Default` trait on the struct is sufficient. - `trim_prefix` was removed as it didn't seem relevant. `tag` was also removed as it's represented by several subfields in RFC 5424 which RFC 3164 can also use. `SyslogMessage::encode()` refactors the original PR encoding logic: - Syslog Header fields focused, the PRI and final message value have already been prepared prior. They are only referenced at the end of `encode()` to combine into the final string output. - While less efficient than `push_str()`, each match variant has a clear structure returned via the array `join(" ")` which minimizes the noise of `SP` from the original PR. Value preparation prior to this is clear and better documented. - `Tag` is a child struct to keep the main logic easy to grok. `StructuredData` is a similar case.
No changes beyond relocating the code into a single file.
- Drop notes referring to original PR differences + StructuredData adaption references. None of it should be relevant going forward. - Revise some other notes. - Drop `add_log_source` method (introduced from the original PR author) in favor of using `StructuredData` support instead.
This should be simple and lightweight enough to justify for the DRY benefit? This way the method doesn't need to be duplicated redundantly. That was required because there is no trait for `FromRepr` provided via `strum`. That would require a similar amount of lines for the small duplication here. The `akin` macro duplicates the `impl` block for each value in the `&enums` array.
- `ConfigDecanter::get_message()` replaces the fallback method in favor of `to_string_lossy()` (a dedicated equivalent for converting `Value` type to a String type (_technically it is a CoW str, hence the follow-up with `to_string()`_)). - This also encodes the value better, especially for the default `log_namespace: false` as the message value (when `String`) is not quote wrapped, which matches the behaviour of the `text` encoder output. - Additionally uses the `LogEvent` method `get_message()` directly from `lib/vector-core/src/event /log_event.rs`. This can better retrieve the log message regardless of the `log_namespace` setting. - Encoding of RFC 5424 fields has changed to inline the `version` constant directly, instead of via a redundant variable. If there's ever multiple versions that need to be supported, it could be addressed then. - The RFC 5424 timestamp has a max precision of microseconds, thus this should be rounded and `AutoSi` can be used (_or `Micros` if it should have fixed padding instead of truncating trailing `000`_).
- The original PR author appears to have relied on a hard-coded timestamp key here. - `DateTime<Local>` would render the timestamp field with the local timezone offset, but other than that `DateTime<Utc>` would seem more consistent with usage in Vector, especially since any original TZ context is lost by this point? - Notes adjusted accordingly, with added TODO query for each encoding mode to potentially support configurable timezone.
- Move encoder config settings under a single `syslog` config field. This better mirrors configuration options for existing encoders like Avro and CSV. - `ConfigDecanter::value_by_key()` appears to accomplish roughly the same as the existing helper method `to_string_lossy()`. Prefer that instead. This also makes the `StructuredData` helper `value_to_string()` redundant too at a glance? - Added some reference for the priority value `PRIVAL`. - `Pri::from_str_variants()` uses the existing defaults for fallback, communicate that more clearly. Contextual note is no longer useful, removed.
To better communicate the allowed values, these two config fields can change from the `String` type to their appropriate enum type. - This relies on serde to deserialize the config value to the enum which adds a bit more noise to grok. - It does make `Pri::from_str_variants()` redundant, while the `into_variant()` methods are refactored to `deserialize()` with a proper error message emitted to match the what serde would normally emit for failed enum variant deserialization. - A drawback of this change is that these two config fields lost the ability to reference a different value path in the `LogEvent`. That'll be addressed in a future commit.
In a YAML config a string can optionally be wrapped with quotes, while a number that isn't quote wrapped will be treated as a number type. The current support was only for string numbers, this change now supports flexibility for config using ordinal values in YAML regardless of quote usage. The previous `Self::into_variant(&s)` logic could have been used instead of bringing in `serde-aux`, but the external helper attribute approach seems easier to grok/follow as the intermediary container still seems required for a terse implementation. The match statement uses a reference (_which requires a deref for `from_repr`_) to appease the borrow checker for the later borrow needed by `value` in the error message.
This seems redundant given the context? Mostly adds unnecessary noise. Could probably `impl Configurable` or similar to try workaround the requirement. The metadata description could generate the variant list similar to how it's been handled for error message handling?
Not sure if this is worthwhile, but it adopts error message convention elsewhere I've seen by managing them via Snafu.
@polarathene Any updates on this ? |
No sorry. Several unexpected events happened in a short time period, I was unsuccessful at avoiding burnout which I'm still in the process of exiting from. ContextFWIW, I over committed myself in 2023, while facing several setbacks and issues in 2024. This isn't the only work where I'm pinged for updates to move things forward that have come to a halt. I do understand the frustration of wanting to have this become actively developed again and merged, but without any assistance it's a task I struggle to allocate time for. More on that below. PR was opened due to demandWhen I raised a concern about the lack of feedback received thus far, it was insisted that I open a PR, even if it's not in the state desired for handing to reviewers.
However opening the PR early has made no difference vs the diff link I had previously provided. Which was anticipated:
My previous inputNone of this has changed.
Personal frictionWhen it comes to finding time to allocate to this PR, there's a bit more to it than covered below, but it's effectively a high cost item for me to tackle. I need to have:
As such, the most optimal time is when I have nothing pressing elsewhere and I'm able to save everything I can to perform a system reboot (I only have a few of these a year as preparing for one can take multiple days for me). I'd much rather have projects properly isolated to individual VMs which would address quite a lot of these problems and allow me to work much more efficiently, but that requires time. If I had a separate system temporarily dedicated for working on this PR I could tackle it much more easily. Presently my goals are:
Optimistically the system reboot will be completed by the 17th if nothing else interferes, so there should be a new push on this branch before the end of the month. If that happens, I'd toggle of the draft status and ping for review/feedback. If I receive feedback in a timely manner progress will be quick from then on, but if there's a notable delay (over a week) I'd likely have switched to something else and it could take a while to return back to the PR. I'm slightly concerned about timing as we approach December, but hopefully we can get this PR back on track 💪 |
@StephenWakely you commented in the previous PR, is it possible for you to provide additional feedback? My limited familiarity with RUST makes my input less then useful |
Sorry, I had assumed that since the PR is in draft you weren't looking for feedback yet, but I can definitely take a look |
Not after anything too deep, just a glance over by someone more familiar with Vector. My local branch still has some changes to get pushed up that refactor some of the current changes, so only a light review for providing guidance / expectations is needed. I left it in draft due to not having found time to sync my local branch to this PR yet, and that the PR itself is still not ready to merge as it's not functioning as desired. I ran into a blocker on the local branch earlier this year and will bring that up to discuss here or on Discord once the branch is sync'd (presently the local changes are only partially committed / stashed). I'm still on schedule per my prior comment timeline for returning to this PR next week. Ideally someone can look over and provide feedback before Tuesday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some parts I can recall that I would like to receive feedback on, I'll bring up other concerns next week.
@@ -10,11 +10,13 @@ name = "generate-avro-fixtures" | |||
path = "tests/bin/generate-avro-fixtures.rs" | |||
|
|||
[dependencies] | |||
akin = { version = "0.4", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know if there's any policies about the additional crate deps I've added to support the PR. They're all marked optional and assigned to the syslog
feature.
I could minimize the deps but it'd make the feature code itself less maintainer friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally if it is needed and the license is ok then we are fine to pull it in like you have done here.
However, I don't really see that this crate is adding enough to warrant it's inclusion. It's only used in one place from what I can see. There it would be more idiomatic to just use a simple macro:
macro_rules! deserialize_impl {
($enum:ty) => {
impl $enum {
fn deserialize<'de, D>(deserializer: D) -> Result<Self, D::Error>
...
}
deserialize_impl!(Facility);
deserialize_impl!(Severity);
/// Syslog serializer options. | ||
#[configurable_component] | ||
#[derive(Clone, Debug, Default)] | ||
// Serde default makes all config keys optional. | ||
// Each field assigns either a fixed value, or field name (lookup field key to retrieve dynamic value per `LogEvent`). | ||
#[serde(default)] | ||
pub struct SyslogSerializerOptions { | ||
/// RFC | ||
rfc: SyslogRFC, | ||
/// Facility | ||
#[serde(deserialize_with = "Facility::deserialize")] | ||
facility: Facility, | ||
/// Severity | ||
#[serde(deserialize_with = "Severity::deserialize")] | ||
severity: Severity, | ||
|
||
/// App Name | ||
app_name: Option<String>, | ||
/// Proc ID | ||
proc_id: Option<String>, | ||
/// Msg ID | ||
msg_id: Option<String>, | ||
|
||
/// Payload key | ||
payload_key: String, | ||
|
||
// Q: The majority of the fields above pragmatically only make sense as config for keys to query? | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was adapted from the original PR, payload_key
has since been removed in favor of the actual Vector log message
when available or could be set via VRL.
In my local branch I was looking into having most of these field that have an equivalent syslog struct field to instead share the same struct:
pub struct SyslogSerializerOptions {
/// RFC
rfc: SyslogRFC,
#[serde(flatten)]
#[configurable(derived)]
priority: Pri,
#[serde(flatten)]
#[configurable(derived)]
tag: Tag,
}
There was another variation that instead of configuring overrides to the defaults, would use ConfigTargetPath
as a more appropriate config approach by allowing the user to adjust the VRL keys to map into the syslog struct fields:
pub struct SyslogSerializerOptions {
/// Facility
#[default = "facility"]
facility: ConfigTargetPath,
/// Severity
#[default = "severity"]
severity: ConfigTargetPath,
/// Application Name
#[default = "appname"]
app_name: ConfigTargetPath,
/// Process ID
#[default = "procid"]
proc_id: ConfigTargetPath,
/// Message ID
#[default = "msgid"]
msg_id: ConfigTargetPath,
/// Hostname
#[default = "hostname"]
hostname: ConfigTargetPath,
/// Message
message: Option<ConfigTargetPath>,
/// Timestamp
timestamp: Option<ConfigTargetPath>,
/// Structured Data
structured_data: Option<ConfigTargetPath>,
}
Probably the right way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really these values should ideally be extracted by their meaning from the event. I've added some links to the comments I made on the other PR about them.
I added some explanation of what we actually mean by meaning here. The concept of meaning is in a bit of transition in Vector at the minute, hence perhaps it's not completely clear how it all fits in - but hopefully in time it will!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your efforts in putting this together! I appreciate it has been a bit of journey.
Mostly the changes needed are to make it work better when we have log_namespace: true
set.
There's a few other checks that need to pass in order to merge this. See here for more details.
Any questions, please give me a shout.
rfc: SyslogRFC, | ||
/// Facility | ||
#[serde(deserialize_with = "Facility::deserialize")] | ||
facility: Facility, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I said here should apply to this.
facility: Facility, | ||
/// Severity | ||
#[serde(deserialize_with = "Severity::deserialize")] | ||
severity: Severity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I said here should apply to this.
msg_id: Option<String>, | ||
|
||
/// Payload key | ||
payload_key: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Remove this option. This should always be the field with the meaning message.
I would disagree with this suggestion as it reduces the flexibility of using the protocol. This implies you always want the "message" portion of the payload to be what vector determines is the "message" and not for instance how an administrator may have transformed it in a prior pipeline to be something else. The default if not specified IMO should be "message" but allow the flexibility; there is no reason not to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies you always want the "message" portion of the payload to be what vector determines is the "message" and not for instance how an administrator may have transformed it in a prior pipeline to be something else.
Could you elaborate why you'd have introduced a difference in semantics? Please document a valid use-case for why you'd want such functionality so we know it's not an XY problem.
When would you transform the log message to not represent the message, yet provide it as input to an encoder to but require a different key for the actual message content?
I am familiar with the original PR use with add_source
to compose metadata related to k8s with the actual message. You could do that with VRL, or for the metadata leverage structured data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why you'd have introduced a difference in semantics? Please document a valid use-case for why you'd want such functionality so we know it's not an XY problem.
Vector is configurable as a transformation engine. Logs coming in do not need to equal logs going out. Consider almost any case where my desired payload is a subset of the original message, a value in the message which was a JSON string. In a prior transform, I could parse the message and now it is available to me as key/value pairs, where I only care about some portion of the message.
By removing this option does it mean in some prior transform do I always have to modify the event and set the key "message" to the desired outcome? Maybe I'm missing where the encoding ends and transforms and sinks begin 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing this option does it mean in some prior transform do I always have to modify the event and set the key "message" to the desired outcome? Maybe I'm missing where the encoding ends and transforms and sinks begin 🤷♂️
Encoding is A => B
, given input A produce output B (syslog formatted log in this case).
Perhaps I should phrase this a bit differently?:
- If Vector supports encoding your input to other existing log encoders the way you want, you should expect to get the same with this syslog encoder.
- If it does not, then there would need to be a very valid reason why this encoder is an exception vs what you'd otherwise need to do with any other encoder already supported by Vector.
As such, you should find that what you're requesting isn't specific to this encoder implemented via the PR, but something one would expect to be encoder agnostic.
That would be a separate feature request to raise, which may be rejected on the basis that VRL would be advised for your use-case to pre-process your input for the encoder instead.
I'm not an active Vector user myself yet, but from what I do understand message
is an internal convention and the encoders rely on that to be semantically correct as input, thus you'd be expected to use existing Vector features to handle that prior to encoding.
@StephenWakely will chime in if I'm mistaken, and you're also welcome to correct me too.
severity: Severity, | ||
|
||
/// App Name | ||
app_name: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// App Name | ||
app_name: Option<String>, | ||
/// Proc ID | ||
proc_id: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let y = |v| self.replace_if_proxied_opt(v); | ||
let app_name = y(&config.app_name).unwrap_or("vector".to_owned()); | ||
let proc_id = y(&config.proc_id); | ||
let msg_id = y(&config.msg_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all be set like:
field = config.value.unwrap_or_else(|| log.get_field_by_meaning("service"))
let payload = if config.payload_key.is_empty() { | ||
self.log.get_message().map(|v| v.to_string_lossy().to_string() ) | ||
} else { | ||
self.value_by_key(&config.payload_key) | ||
}; | ||
|
||
payload.unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No payload key, so just something like this should do:
let payload = if config.payload_key.is_empty() { | |
self.log.get_message().map(|v| v.to_string_lossy().to_string() ) | |
} else { | |
self.value_by_key(&config.payload_key) | |
}; | |
payload.unwrap_or_default() | |
self.log.get_message().map(|v| v.to_string_lossy().to_string()) |
// https://datatracker.ietf.org/doc/html/rfc3164#section-4.1.2 | ||
// https://docs.rs/chrono/latest/chrono/format/strftime/index.html | ||
// | ||
// TODO: Should this remain as UTC or adjust to the local TZ of the environment (or Vector config)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep it UTC.
&self.message, | ||
].concat() | ||
|
||
// Q: RFC 5424 MSG part should technically ensure UTF-8 message begins with BOM? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want the challenge you are welcome to embark on the journey to make this comply. However, I feel that this occurs in practice rarely enough that it won't be a blocker to getting this merged.
} | ||
|
||
fn get_structured_data(&self) -> Option<StructuredData> { | ||
self.log.get("structured_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Vector namespace, (when log_namespace: true
) we want to get this by meaning instead of name:
self.log.get("structured_data") | |
match self.log.namespace() { | |
LogNamespace::Vector => self.log.get_by_meaning("structured_data") | |
LogNamespace::Legacy => self.log.get("structured_data") | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the speedy feedback! ❤️
I'll return to this next week to go over your feedback properly then and address it, cheers 😁
// | ||
// Q: Why `$.message.` as the prefix? (Appears to be JSONPath syntax?) | ||
// NOTE: Originally named in PR as: `get_field_or_config()` | ||
fn replace_if_proxied(&self, value: &str) -> Option<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original PR IIRC had implemented a feature to support config that overrides a default value (such as always X severity), presumably because the log event did not provide an equivalent mapping. Or the config key would reference a field to lookup and get the value of from the event, which I think their $.message
prefix was specific to their own use-case.
I'd rather drop it, but wasn't sure how much of the original PR was the correct approach as I'm not familiar enough with Vector myself. Thus this PR focused on refactoring the functionality that was already in the existing PR and moving forward from there 😅
So this method was mostly a helper:
value
.strip_prefix("$.message.")
.map_or(
Some(value.to_owned()),
|field_key| self.value_by_key(field_key),
)
}
fn value_by_key(&self, field_key: &str) -> Option<String> {
self.log.get(field_key).map(|field_value| {
field_value.to_string_lossy().to_string()
})
}
Used earlier like:
fn decant_config(&self, config: &SyslogSerializerOptions) -> SyslogMessage {
let y = |v| self.replace_if_proxied_opt(v);
let app_name = y(&config.app_name).unwrap_or("vector".to_owned());
let proc_id = y(&config.proc_id);
let msg_id = y(&config.msg_id);
Since they're options, if no lookup key or default value was configured, y
would return None
:
proc_id
andmsg_id
would later be formatted as-
.app_name
default is meant to fallback to something IIRC sovector
was the default chosen. In my local branch I used theSmartDefault
crate to add a default to the actual syslogTag
struct for theapp_name
field instead.
I'm not familiar with log.get_field_by_meaning
or if that was available when I started working on this PR early this year. I'll look into that if it's appropriate (I'm aware of this reference you shared in another feedback comment, I haven't gone over that yet).
IIRC this logic may have been refactored on my local branch.
I don't fully understand all of what this entails but my general impression is the suggested changes are moving from flexible to an opinionated implementation, relying upon the impl to assume values. Maybe I'm reading incorrectly but I am pushing to ask for flexibility and configurability. The spec defines specific attributes and layout of a message but leaves quite a bit open for interpretation by an admin. As a user of vector in a cloud containerized environment that already is forwarding logs over syslog, we have use cases where it makes sense to allow administrators to set various fields. The original PR as a straight port of a fluend plugin and what we have now for our vector implementation. Reviewing the syslog spec I can see where several config knobs should be dropped (e.g. tag, add_source) but IMO the others should remain |
I haven't had time to go over full review feedback, but I'll do so next week when I return to sync my local changes to this branch.
Config wise, I think any property/field mapping can be handled via VRL with remap. We can discuss that in future if it's inconvenient and could be improved, but given my limited time I hope you understand why I don't want too much bike shedding over that right now.
When this PR reaches the next milestone (when I ping for proper review / testing), I'd be happy to discuss this with you here. As per my review on the original PR, it should be evident that I care about respecting the spec. Anything that should be FWIW: My interest in Vector with this feature is also for usage in a container.
This PR adapted the original, you can follow the commit history for logical commits with descriptions that reason each change. IIRC the original had various concerns that I've addressed or intend to (these should be detailed in the commit messages and/or inline commentary). I remember the |
It's pushing the flexibility back into VRL, which we already have. This codec should be opinionated about what it expects the log message to look like. Any flexibility you need should be handled in a remap transform. |
This sounds like a good strategy in general. Remap can do any kind of event transformation you need. The encoder should specify the interface it expects so it can efficiently batch and send requests downstream. |
Replaces: #17668
git blame
on my branch (Git web UI works fine for that) may provide the extra context needed 👍NOTE:
I'll try to allocate time within October to get those changes applied to this branch, along with a proper PR description at that point.
The conflicts were already resolved locally IIRC, they'll be taken care of with the future update.