Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914

Merged
merged 3 commits into from
May 10, 2021

Conversation

knz
Copy link
Contributor

@knz knz commented May 8, 2021

Backport 2/2 commits from #64900.

/cc @cockroachdb/release


Alleviates #61681.
Fixes #64896.

Release note (cli change): Long lines in the crdb-v2 output
logging format are now split when they exceed 16K, instead
of 1K as previously.

Release note (bug fix): The automatic long line splitting behavior
of the new crdb-v2 logging format now avoids splitting lines in the
middle of UTF-8 sequences, and avoids unbalanced redaction markers.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/util/log/format_crdb_v2.go, line 329 at r1 (raw file):

// TODO(knz): This could also be made configurable by the user.
//
// NB: the value of this variable might be mentioned in the format's

"Might be"? Is it or not?

If the limit is documented, it should also be documented that the effective limit is a little bit higher than this constant (by one fractional rune and an end redaction marker, if we want to expose the precise details)

Should the limit be specified in terms of bytes or runes? (either way, document which it is)


pkg/util/log/format_crdb_v2.go, line 387 at r2 (raw file):

			// If we see an opening redaction marker, remember this fact
			// so that we close/open it properly.
			if strings.HasPrefix(msg[i:], startRedactionMarker) {

This looks relatively expensive, although on godbolt.org it looks like the compiler is optimizing away the copy implied by a string slice when it's passed straight to HasPrefix. We're leaning pretty heavily on the compiler's smarts here. It might be better to use a range loop over the string (which iterates over runes) instead of iterating over bytes and sometimes doing multi-byte stuff. (or for a larger refactoring, use strings.IndexAny instead of a byte-by-byte or rune-by-rune loop)


pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):

		// Avoid breaking in the middle of UTF-8 sequences.
		_, width := utf8.DecodeRuneInString(msg[i:])

If we're calling DecodeRuneInString every byte, it's definitely going to be cheaper to just use a range loop over the string. But we could also avoid this check and only worry about multi-byte runes at the point where we decide to split the line.


pkg/util/log/testdata/crdb_v2_break_lines, line 61 at r2 (raw file):

☃☃☃☃☃☃☃☃☃☃☃☃
----
I000101 00:00:00.000000 0 :0 ⋮ [-]   ☃☃☃☃☃☃☃

I'm not sure what's going on, but these characters are rendering differently for me in reviewable (a small illegible blob) than the other snowmen.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am replying to your comments in this reviewable session, but I have implemented the changes in the other PR - #64900 - let's move the convo over there.

I will update this backport PR after we're happy about the main PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)


pkg/util/log/format_crdb_v2.go, line 329 at r1 (raw file):

"Might be"? Is it or not?

We have an agreement with the doc writers that they can edit the text above. So it may or may not be there any more. Hence the phrasing.

Should the limit be specified in terms of bytes or runes? (either way, document which it is)

Bytes, good idea to document it. Done.

it should also be documented that the effective limit is a little bit higher than this constant (by one fractional rune and an end redaction marker, if we want to expose the precise details)

I think we want this effective limit to be honored. Even though it currently happens not to be documented, we have requests to make it configurable (and thus become documented), and then it would likely turn into another security problem if folk truncate the data to fit in a buffer sized after the configured size.

Added a commit to enforce it (with tests).


pkg/util/log/format_crdb_v2.go, line 387 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This looks relatively expensive, although on godbolt.org it looks like the compiler is optimizing away the copy implied by a string slice when it's passed straight to HasPrefix. We're leaning pretty heavily on the compiler's smarts here. It might be better to use a range loop over the string (which iterates over runes) instead of iterating over bytes and sometimes doing multi-byte stuff. (or for a larger refactoring, use strings.IndexAny instead of a byte-by-byte or rune-by-rune loop)

IMHO this is premature optimization - the common case is when there are no UTF-8 sequences, and that is covered by the conditional just above.


pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):
That is what this condition above already is about:

if markersStartWithSingleByteRune && msg[i] < utf8.RuneSelf { continue }

pkg/util/log/testdata/crdb_v2_break_lines, line 61 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm not sure what's going on, but these characters are rendering differently for me in reviewable (a small illegible blob) than the other snowmen.

image.png

What do you have on your screen?

(I also checked in a hex dump that the utf-8 sequence is preserved.)

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)


pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):

Previously, knz (kena) wrote…

That is what this condition above already is about:

if markersStartWithSingleByteRune && msg[i] < utf8.RuneSelf { continue }

(Although that condition was wrong, I just realized after replying to your comment. Fixed over in #64900.

knz added 3 commits May 10, 2021 18:35
Release note (cli change): Long lines in the `crdb-v2` output
logging format are now split when they exceed 16K, instead
of 1K as previously.
Release note (bug fix): The automatic long line splitting behavior
of the new `crdb-v2` logging format now avoids splitting lines in the
middle of UTF-8 sequences, and avoids unbalanced redaction markers.
This way if/when we ever make it configurable and document it,
there won't be any surprises.

Release note: None
@knz knz force-pushed the backport21.1-64900 branch from 2fa0b95 to 35fea82 Compare May 10, 2021 16:36
@knz
Copy link
Contributor Author

knz commented May 10, 2021

Refreshed to reflect the latest changes in #64900.

@knz knz merged commit eb9e1a7 into cockroachdb:release-21.1 May 10, 2021
@knz knz deleted the backport21.1-64900 branch May 10, 2021 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants