Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculating utf-8 string length of event #330

Merged
merged 3 commits into from
May 22, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion cloudwatch/cloudwatch.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"sort"
"strings"
"time"
"unicode/utf8"

"github.com/aws/amazon-kinesis-firehose-for-fluent-bit/plugins"
"github.com/aws/aws-sdk-go/aws"
Expand Down Expand Up @@ -841,8 +842,19 @@ func (output *OutputPlugin) processRejectedEventsInfo(response *cloudwatchlogs.P
}
}

// counts the effective number of bytes in the string, after
// UTF-8 normalization. UTF-8 normalization includes replacing bytes that do
// not constitute valid UTF-8 encoded Unicode codepoints with the Unicode
// replacement codepoint U+FFFD (a 3-byte UTF-8 sequence, represented in Go as
// utf8.RuneError)
// this works because Go range will parse the string as UTF-8 runes
// copied from AWSLogs driver: https://github.com/moby/moby/commit/1e8ef386279e2e28aff199047e798fad660efbdd
func cloudwatchLen(event string) int {
return len(event) + perEventBytes
effectiveBytes := perEventBytes
for _, rune := range event {
effectiveBytes += utf8.RuneLen(rune)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you to @ShelbyZ for noting the go doc: https://pkg.go.dev/unicode/utf8#RuneLen

We might need to explicitly handle the -1 return value here, which AWSLogs driver does not... that may be a bug... we should possibly fix the log driver as well..

Will test this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this code works and is fine: I tried this in the go playground:

// You can edit this code!
// Click here and start typing.
package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	s := "Hello, 世界"
	s2 := "◺"
	s3 := string([]byte{65, 66, 239, 191, 0})
	s4 := string([]byte{239, 191, 0})
	s5 := string([]byte{239, 191})
	s6 := string([]byte{194, 194, 192, 193})
	s7 := string([]byte{195, 40})
	s8 := "valid"

	record := []byte{195, 40}
	record = append(record, -66)

	test(s)
	test(s2)
	test(s3)
	test(s4)
	test(s5)
	test(s6)
	test(s7)
	test(s8)

	// the only values I have found that returns -1 from RuneLen are negative values
	// not valid go: s6 := string([]byte{-66, 194, 192, 193})
	// in go a byte is a unsigned type
	fmt.Println(utf8.RuneLen(rune(int64(-66))))
}

func test(s string) {
	fmt.Println(s)
	fmt.Printf("Runes: %d\n", utf8.RuneCountInString(s))
	fmt.Printf("Byte Len in valid utf-8: %d\n", cloudwatchLen(s))
}

func cloudwatchLen(event string) int {
	effectiveBytes := 0
	for _, rune := range event {
		effectiveBytes += utf8.RuneLen(rune)
	}
	return effectiveBytes
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuneLen and the range for loop handles invalid characters properly by replacing them with 3 bytes for the replacement char.

The only way I could get RuneLen to return a negative value is to give it a negative value... this can not happen in our code since we process the logs as a byte slice and bytes are an unsigned type... so we could never get a negative value into this code.

}
return effectiveBytes
}

func (stream *logStream) logBatchSpan(timestamp time.Time) time.Duration {
Expand Down