release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914

knz · 2021-05-08T11:33:51Z

Backport 2/2 commits from #64900.

/cc @cockroachdb/release

Alleviates #61681.
Fixes #64896.

Release note (cli change): Long lines in the crdb-v2 output
logging format are now split when they exceed 16K, instead
of 1K as previously.

Release note (bug fix): The automatic long line splitting behavior
of the new crdb-v2 logging format now avoids splitting lines in the
middle of UTF-8 sequences, and avoids unbalanced redaction markers.

cockroach-teamcity · 2021-05-08T11:33:59Z

This change is

bdarnell

Reviewed 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/util/log/format_crdb_v2.go, line 329 at r1 (raw file):

// TODO(knz): This could also be made configurable by the user.
//
// NB: the value of this variable might be mentioned in the format's

"Might be"? Is it or not?

If the limit is documented, it should also be documented that the effective limit is a little bit higher than this constant (by one fractional rune and an end redaction marker, if we want to expose the precise details)

Should the limit be specified in terms of bytes or runes? (either way, document which it is)

pkg/util/log/format_crdb_v2.go, line 387 at r2 (raw file):

			// If we see an opening redaction marker, remember this fact
			// so that we close/open it properly.
			if strings.HasPrefix(msg[i:], startRedactionMarker) {

This looks relatively expensive, although on godbolt.org it looks like the compiler is optimizing away the copy implied by a string slice when it's passed straight to HasPrefix. We're leaning pretty heavily on the compiler's smarts here. It might be better to use a range loop over the string (which iterates over runes) instead of iterating over bytes and sometimes doing multi-byte stuff. (or for a larger refactoring, use strings.IndexAny instead of a byte-by-byte or rune-by-rune loop)

pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):

		// Avoid breaking in the middle of UTF-8 sequences.
		_, width := utf8.DecodeRuneInString(msg[i:])

If we're calling DecodeRuneInString every byte, it's definitely going to be cheaper to just use a range loop over the string. But we could also avoid this check and only worry about multi-byte runes at the point where we decide to split the line.

pkg/util/log/testdata/crdb_v2_break_lines, line 61 at r2 (raw file):

☃☃☃☃☃☃☃☃☃☃☃☃
----
I000101 00:00:00.000000 0 :0 ⋮ [-]   ☃☃☃☃☃☃☃

I'm not sure what's going on, but these characters are rendering differently for me in reviewable (a small illegible blob) than the other snowmen.

knz

I am replying to your comments in this reviewable session, but I have implemented the changes in the other PR - #64900 - let's move the convo over there.

I will update this backport PR after we're happy about the main PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

pkg/util/log/format_crdb_v2.go, line 329 at r1 (raw file):

"Might be"? Is it or not?

We have an agreement with the doc writers that they can edit the text above. So it may or may not be there any more. Hence the phrasing.

Should the limit be specified in terms of bytes or runes? (either way, document which it is)

Bytes, good idea to document it. Done.

it should also be documented that the effective limit is a little bit higher than this constant (by one fractional rune and an end redaction marker, if we want to expose the precise details)

I think we want this effective limit to be honored. Even though it currently happens not to be documented, we have requests to make it configurable (and thus become documented), and then it would likely turn into another security problem if folk truncate the data to fit in a buffer sized after the configured size.

Added a commit to enforce it (with tests).

pkg/util/log/format_crdb_v2.go, line 387 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This looks relatively expensive, although on godbolt.org it looks like the compiler is optimizing away the copy implied by a string slice when it's passed straight to HasPrefix. We're leaning pretty heavily on the compiler's smarts here. It might be better to use a range loop over the string (which iterates over runes) instead of iterating over bytes and sometimes doing multi-byte stuff. (or for a larger refactoring, use strings.IndexAny instead of a byte-by-byte or rune-by-rune loop)

IMHO this is premature optimization - the common case is when there are no UTF-8 sequences, and that is covered by the conditional just above.

pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):
That is what this condition above already is about:

if markersStartWithSingleByteRune && msg[i] < utf8.RuneSelf { continue }

pkg/util/log/testdata/crdb_v2_break_lines, line 61 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm not sure what's going on, but these characters are rendering differently for me in reviewable (a small illegible blob) than the other snowmen.

What do you have on your screen?

(I also checked in a hex dump that the utf-8 sequence is preserved.)

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):

Previously, knz (kena) wrote…

That is what this condition above already is about:
if markersStartWithSingleByteRune && msg[i] < utf8.RuneSelf { continue }

(Although that condition was wrong, I just realized after replying to your comment. Fixed over in #64900.

Release note (cli change): Long lines in the `crdb-v2` output logging format are now split when they exceed 16K, instead of 1K as previously.

Release note (bug fix): The automatic long line splitting behavior of the new `crdb-v2` logging format now avoids splitting lines in the middle of UTF-8 sequences, and avoids unbalanced redaction markers.

This way if/when we ever make it configurable and document it, there won't be any surprises. Release note: None

knz · 2021-05-10T16:36:32Z

Refreshed to reflect the latest changes in #64900.

knz assigned bdarnell May 8, 2021

bdarnell reviewed May 8, 2021

View reviewed changes

bdarnell mentioned this pull request May 8, 2021

util/log: fix the long line splitting behavior with UTF-8 sequences #64900

Merged

knz commented May 10, 2021

View reviewed changes

knz added 3 commits May 10, 2021 18:35

util/log: default the line split size to 16K instead of 1K

386d87c

Release note (cli change): Long lines in the `crdb-v2` output logging format are now split when they exceed 16K, instead of 1K as previously.

util/log: fix line splitting behavior

3b2828e

Release note (bug fix): The automatic long line splitting behavior of the new `crdb-v2` logging format now avoids splitting lines in the middle of UTF-8 sequences, and avoids unbalanced redaction markers.

util/log: better honor the maximum line length

35fea82

This way if/when we ever make it configurable and document it, there won't be any surprises. Release note: None

knz force-pushed the backport21.1-64900 branch from 2fa0b95 to 35fea82 Compare May 10, 2021 16:36

bdarnell approved these changes May 10, 2021

View reviewed changes

knz merged commit eb9e1a7 into cockroachdb:release-21.1 May 10, 2021

knz deleted the backport21.1-64900 branch May 10, 2021 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914

release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914

knz commented May 8, 2021

cockroach-teamcity commented May 8, 2021

bdarnell left a comment

knz left a comment

knz left a comment

knz commented May 10, 2021

release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914

release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914

Conversation

knz commented May 8, 2021

cockroach-teamcity commented May 8, 2021

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented May 10, 2021