-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add support for complex capture group names in parse_regex. #505
base: master
Are you sure you want to change the base?
Conversation
9cc461f
to
17f52e6
Compare
// rewriteComplexCaptures translates disallowed capture group names into placeholders and a transformation to rename them back in the log record. | ||
func rewriteComplexCaptures(regex, tag string) (string, []fluentbit.Component) { | ||
// Short-circuit regexes that don't have disallowed capture group names | ||
disallowed := regexp.MustCompile(`\(\?P<(?:[A-Za-z0-9_]*[^A-Za-z0-9_>])+(?:[A-Za-z0-9_]*)>`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to parse a regex with a regex, because of the nesting parentheses problem. You would need a smarter parser/lexer. We may yet have to go down that road to adopt Stanza, but I really don't want to if we can avoid it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, you're correct. Luckily, this isn't actually parsing a regex — it's only looking for the named capturing groups, which have a much simpler syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure I could construct a regex that this code does not parse correctly. e.g. I don't think it would correctly handle
(?P<groupone>\(?P<notagroup>content\))
which would match the string
P<notagroup>content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. It's not really a nesting problem, as much as an escaping problem, and a potential limitation of Go's regexp engine (which, e.g., does not allow lookbehind). It may be possible to slurp up escaped backslashes prefixed by the start anchor or a non-backslash character, which will take care of that. Let me fix the regex (and I can add your example to the test to make sure it stays untouched).
// Rename all captured fields | ||
oc := make([][2]string, len(rewrites)) | ||
for i, g := range rewrites { | ||
oc = append(oc, [2]string{"Rename", fmt.Sprintf("__%d %q", i+1, g)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't allow users to set complicated fields they might want to set such as httpRequest.status
. I think you need to use modify_fields
here to perform the rename operation so that it can parse all the field names correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it doesn't, but this wasn't the intended goal. This was supposed to match the parse_json
functionality. As I understand it, parse_json
would not transform {"httpRequest.status": 200}
into LogEntry(httpRequest={status=200})
either… For that, you'd need something like the modify_fields
processor you're adding in #474.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not match parse_json
's behavior here. It intentionally has a different input shape.
Either way, you're still not giving people a way to set nested fields like this. e.g. you could give parse_json
{
"logging.googleapis.com/labels": {
"hello": "world"
}
}
which in OA config would be called labels.hello
, but either way you can't extract into that field. In fact I think severity is one of the only fields you can set that doesn't require nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
-
(?<logging.googleapis.com/labels/hello>...)
-
(?<"logging.googleapis.com/labels".hello>...)
The second form is basically the filter syntax, except it uses a non-LogEntry structure which would be confusing...
But if severity
is the only one that doesn't need nesting, and severity
is the one causing the most trouble for users, what if we just forget about nested fields? Seems like it would still add value even if severity
is the only "special" that really works with the syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. PTAL.
// rewriteComplexCaptures translates disallowed capture group names into placeholders and a transformation to rename them back in the log record. | ||
func rewriteComplexCaptures(regex, tag string) (string, []fluentbit.Component) { | ||
// Short-circuit regexes that don't have disallowed capture group names | ||
disallowed := regexp.MustCompile(`\(\?P<(?:[A-Za-z0-9_]*[^A-Za-z0-9_>])+(?:[A-Za-z0-9_]*)>`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, you're correct. Luckily, this isn't actually parsing a regex — it's only looking for the named capturing groups, which have a much simpler syntax.
// Rename all captured fields | ||
oc := make([][2]string, len(rewrites)) | ||
for i, g := range rewrites { | ||
oc = append(oc, [2]string{"Rename", fmt.Sprintf("__%d %q", i+1, g)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it doesn't, but this wasn't the intended goal. This was supposed to match the parse_json
functionality. As I understand it, parse_json
would not transform {"httpRequest.status": 200}
into LogEntry(httpRequest={status=200})
either… For that, you'd need something like the modify_fields
processor you're adding in #474.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This will allow capturing special fields using
parse_regex
.b/227803515