Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculating utf-8 string length of event #330

Merged
merged 3 commits into from
May 22, 2023

Conversation

PettitWesley
Copy link
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@PettitWesley PettitWesley requested a review from a team as a code owner May 21, 2023 22:07
@lubingfeng
Copy link

Can you share the link of the original issue? Does other plugins (S3 / Kinesis) have the same issue?

@PettitWesley
Copy link
Contributor Author

@lubingfeng the issue is internal. This is CloudWatch API aspect, CW API processes as UTF-8, and the other APIs do not. Other APIs are not affected.

return len(event) + perEventBytes
effectiveBytes := perEventBytes
for _, rune := range event {
effectiveBytes += utf8.RuneLen(rune)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you to @ShelbyZ for noting the go doc: https://pkg.go.dev/unicode/utf8#RuneLen

We might need to explicitly handle the -1 return value here, which AWSLogs driver does not... that may be a bug... we should possibly fix the log driver as well..

Will test this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this code works and is fine: I tried this in the go playground:

// You can edit this code!
// Click here and start typing.
package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	s := "Hello, 世界"
	s2 := "◺"
	s3 := string([]byte{65, 66, 239, 191, 0})
	s4 := string([]byte{239, 191, 0})
	s5 := string([]byte{239, 191})
	s6 := string([]byte{194, 194, 192, 193})
	s7 := string([]byte{195, 40})
	s8 := "valid"

	record := []byte{195, 40}
	record = append(record, -66)

	test(s)
	test(s2)
	test(s3)
	test(s4)
	test(s5)
	test(s6)
	test(s7)
	test(s8)

	// the only values I have found that returns -1 from RuneLen are negative values
	// not valid go: s6 := string([]byte{-66, 194, 192, 193})
	// in go a byte is a unsigned type
	fmt.Println(utf8.RuneLen(rune(int64(-66))))
}

func test(s string) {
	fmt.Println(s)
	fmt.Printf("Runes: %d\n", utf8.RuneCountInString(s))
	fmt.Printf("Byte Len in valid utf-8: %d\n", cloudwatchLen(s))
}

func cloudwatchLen(event string) int {
	effectiveBytes := 0
	for _, rune := range event {
		effectiveBytes += utf8.RuneLen(rune)
	}
	return effectiveBytes
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuneLen and the range for loop handles invalid characters properly by replacing them with 3 bytes for the replacement char.

The only way I could get RuneLen to return a negative value is to give it a negative value... this can not happen in our code since we process the logs as a byte slice and bytes are an unsigned type... so we could never get a negative value into this code.

Copy link

@zwj102030 zwj102030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants