-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add new Drain tokenizer that splits on most punctuation #13143
Conversation
@@ -0,0 +1 @@ | |||
package output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
@@ -139,7 +141,7 @@ func DefaultConfig() *Config { | |||
// MaxClusterDepth and SimTh, the less the chance that there will be | |||
// "similar" clusters, but the greater the footprint. | |||
SimTh: 0.3, | |||
MaxChildren: 100, | |||
MaxChildren: 15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that better ?
|
||
type LineTokenizer interface { | ||
Tokenize(line string) []string | ||
Join(tokens []string) string | ||
Tokenize(line string) ([]string, interface{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if generics would work here, just a thought. I know interface have a cost when casting for instance.
|
||
func (p *punctuationTokenizer) Tokenize(line string) ([]string, interface{}) { | ||
tokens := make([]string, len(line)) // Maximum size is every character is punctuation | ||
spacesAfter := make([]int, strings.Count(line, " ")) // Could be a bitmap, but it's not worth it for a few bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use a pool for this one. Prometheus has a good sync.Pool that works in buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Let's try it !
What this PR does / why we need it:
-
characters are treated as part of a single token.Perf wise, this PR is ~50% higher CPU usage compared to previous Drain & much less allocations (so hopefully less GC). I will continue to do some perf optimizations in a separate PR to try and improve this.
Data 1:
Benchmark for using the new "punctuation" tokenizer vs the old "splitting" tokenizer:
Data 2:
Benchmark for my custom deduplicatePlaceholders vs a solution using
regexp.MustCompile("<_>+").ReplaceAllLiteralString
: