Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add support for complex capture group names in parse_regex. #505

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

igorpeshansky
Copy link
Member

This will allow capturing special fields using parse_regex.

b/227803515

@igorpeshansky igorpeshansky force-pushed the igorpeshansky-parse_regex-capture-groups branch from 9cc461f to 17f52e6 Compare April 2, 2022 00:04
// rewriteComplexCaptures translates disallowed capture group names into placeholders and a transformation to rename them back in the log record.
func rewriteComplexCaptures(regex, tag string) (string, []fluentbit.Component) {
// Short-circuit regexes that don't have disallowed capture group names
disallowed := regexp.MustCompile(`\(\?P<(?:[A-Za-z0-9_]*[^A-Za-z0-9_>])+(?:[A-Za-z0-9_]*)>`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to parse a regex with a regex, because of the nesting parentheses problem. You would need a smarter parser/lexer. We may yet have to go down that road to adopt Stanza, but I really don't want to if we can avoid it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, you're correct. Luckily, this isn't actually parsing a regex — it's only looking for the named capturing groups, which have a much simpler syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I could construct a regex that this code does not parse correctly. e.g. I don't think it would correctly handle

(?P<groupone>\(?P<notagroup>content\))

which would match the string

P<notagroup>content

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. It's not really a nesting problem, as much as an escaping problem, and a potential limitation of Go's regexp engine (which, e.g., does not allow lookbehind). It may be possible to slurp up escaped backslashes prefixed by the start anchor or a non-backslash character, which will take care of that. Let me fix the regex (and I can add your example to the test to make sure it stays untouched).

// Rename all captured fields
oc := make([][2]string, len(rewrites))
for i, g := range rewrites {
oc = append(oc, [2]string{"Rename", fmt.Sprintf("__%d %q", i+1, g)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't allow users to set complicated fields they might want to set such as httpRequest.status. I think you need to use modify_fields here to perform the rename operation so that it can parse all the field names correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it doesn't, but this wasn't the intended goal. This was supposed to match the parse_json functionality. As I understand it, parse_json would not transform {"httpRequest.status": 200} into LogEntry(httpRequest={status=200}) either… For that, you'd need something like the modify_fields processor you're adding in #474.

Copy link
Member

@quentinmit quentinmit Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not match parse_json's behavior here. It intentionally has a different input shape.

Either way, you're still not giving people a way to set nested fields like this. e.g. you could give parse_json

{
 "logging.googleapis.com/labels": {
  "hello": "world"
 }
}

which in OA config would be called labels.hello, but either way you can't extract into that field. In fact I think severity is one of the only fields you can set that doesn't require nesting.

Copy link
Member

@jefferbrecht jefferbrecht Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

  1. (?<logging.googleapis.com/labels/hello>...)

  2. (?<"logging.googleapis.com/labels".hello>...)

The second form is basically the filter syntax, except it uses a non-LogEntry structure which would be confusing...

But if severity is the only one that doesn't need nesting, and severity is the one causing the most trouble for users, what if we just forget about nested fields? Seems like it would still add value even if severity is the only "special" that really works with the syntax.

Copy link
Member Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. PTAL.

// rewriteComplexCaptures translates disallowed capture group names into placeholders and a transformation to rename them back in the log record.
func rewriteComplexCaptures(regex, tag string) (string, []fluentbit.Component) {
// Short-circuit regexes that don't have disallowed capture group names
disallowed := regexp.MustCompile(`\(\?P<(?:[A-Za-z0-9_]*[^A-Za-z0-9_>])+(?:[A-Za-z0-9_]*)>`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, you're correct. Luckily, this isn't actually parsing a regex — it's only looking for the named capturing groups, which have a much simpler syntax.

// Rename all captured fields
oc := make([][2]string, len(rewrites))
for i, g := range rewrites {
oc = append(oc, [2]string{"Rename", fmt.Sprintf("__%d %q", i+1, g)})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it doesn't, but this wasn't the intended goal. This was supposed to match the parse_json functionality. As I understand it, parse_json would not transform {"httpRequest.status": 200} into LogEntry(httpRequest={status=200}) either… For that, you'd need something like the modify_fields processor you're adding in #474.

Copy link

github-actions bot commented Jan 7, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants