-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix calculating utf-8 string length of event #330
Conversation
Signed-off-by: Wesley Pettit <[email protected]>
Can you share the link of the original issue? Does other plugins (S3 / Kinesis) have the same issue? |
@lubingfeng the issue is internal. This is CloudWatch API aspect, CW API processes as UTF-8, and the other APIs do not. Other APIs are not affected. |
return len(event) + perEventBytes | ||
effectiveBytes := perEventBytes | ||
for _, rune := range event { | ||
effectiveBytes += utf8.RuneLen(rune) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you to @ShelbyZ for noting the go doc: https://pkg.go.dev/unicode/utf8#RuneLen
We might need to explicitly handle the -1
return value here, which AWSLogs driver does not... that may be a bug... we should possibly fix the log driver as well..
Will test this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this code works and is fine: I tried this in the go playground:
// You can edit this code!
// Click here and start typing.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "Hello, 世界"
s2 := "◺"
s3 := string([]byte{65, 66, 239, 191, 0})
s4 := string([]byte{239, 191, 0})
s5 := string([]byte{239, 191})
s6 := string([]byte{194, 194, 192, 193})
s7 := string([]byte{195, 40})
s8 := "valid"
record := []byte{195, 40}
record = append(record, -66)
test(s)
test(s2)
test(s3)
test(s4)
test(s5)
test(s6)
test(s7)
test(s8)
// the only values I have found that returns -1 from RuneLen are negative values
// not valid go: s6 := string([]byte{-66, 194, 192, 193})
// in go a byte is a unsigned type
fmt.Println(utf8.RuneLen(rune(int64(-66))))
}
func test(s string) {
fmt.Println(s)
fmt.Printf("Runes: %d\n", utf8.RuneCountInString(s))
fmt.Printf("Byte Len in valid utf-8: %d\n", cloudwatchLen(s))
}
func cloudwatchLen(event string) int {
effectiveBytes := 0
for _, rune := range event {
effectiveBytes += utf8.RuneLen(rune)
}
return effectiveBytes
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RuneLen and the range for loop handles invalid characters properly by replacing them with 3 bytes for the replacement char.
The only way I could get RuneLen
to return a negative value is to give it a negative value... this can not happen in our code since we process the logs as a byte slice and bytes are an unsigned type... so we could never get a negative value into this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.