-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.1: util/log: fix the long line splitting behavior with UTF-8 sequences #64914
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/util/log/format_crdb_v2.go, line 329 at r1 (raw file):
// TODO(knz): This could also be made configurable by the user. // // NB: the value of this variable might be mentioned in the format's
"Might be"? Is it or not?
If the limit is documented, it should also be documented that the effective limit is a little bit higher than this constant (by one fractional rune and an end redaction marker, if we want to expose the precise details)
Should the limit be specified in terms of bytes or runes? (either way, document which it is)
pkg/util/log/format_crdb_v2.go, line 387 at r2 (raw file):
// If we see an opening redaction marker, remember this fact // so that we close/open it properly. if strings.HasPrefix(msg[i:], startRedactionMarker) {
This looks relatively expensive, although on godbolt.org it looks like the compiler is optimizing away the copy implied by a string slice when it's passed straight to HasPrefix. We're leaning pretty heavily on the compiler's smarts here. It might be better to use a range
loop over the string (which iterates over runes) instead of iterating over bytes and sometimes doing multi-byte stuff. (or for a larger refactoring, use strings.IndexAny instead of a byte-by-byte or rune-by-rune loop)
pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):
// Avoid breaking in the middle of UTF-8 sequences. _, width := utf8.DecodeRuneInString(msg[i:])
If we're calling DecodeRuneInString every byte, it's definitely going to be cheaper to just use a range loop over the string. But we could also avoid this check and only worry about multi-byte runes at the point where we decide to split the line.
pkg/util/log/testdata/crdb_v2_break_lines, line 61 at r2 (raw file):
☃☃☃☃☃☃☃☃☃☃☃☃ ---- I000101 00:00:00.000000 0 :0 ⋮ [-] ☃☃☃☃☃☃☃
I'm not sure what's going on, but these characters are rendering differently for me in reviewable (a small illegible blob) than the other snowmen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am replying to your comments in this reviewable session, but I have implemented the changes in the other PR - #64900 - let's move the convo over there.
I will update this backport PR after we're happy about the main PR.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)
pkg/util/log/format_crdb_v2.go, line 329 at r1 (raw file):
"Might be"? Is it or not?
We have an agreement with the doc writers that they can edit the text above. So it may or may not be there any more. Hence the phrasing.
Should the limit be specified in terms of bytes or runes? (either way, document which it is)
Bytes, good idea to document it. Done.
it should also be documented that the effective limit is a little bit higher than this constant (by one fractional rune and an end redaction marker, if we want to expose the precise details)
I think we want this effective limit to be honored. Even though it currently happens not to be documented, we have requests to make it configurable (and thus become documented), and then it would likely turn into another security problem if folk truncate the data to fit in a buffer sized after the configured size.
Added a commit to enforce it (with tests).
pkg/util/log/format_crdb_v2.go, line 387 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This looks relatively expensive, although on godbolt.org it looks like the compiler is optimizing away the copy implied by a string slice when it's passed straight to HasPrefix. We're leaning pretty heavily on the compiler's smarts here. It might be better to use a
range
loop over the string (which iterates over runes) instead of iterating over bytes and sometimes doing multi-byte stuff. (or for a larger refactoring, use strings.IndexAny instead of a byte-by-byte or rune-by-rune loop)
IMHO this is premature optimization - the common case is when there are no UTF-8 sequences, and that is covered by the conditional just above.
pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):
That is what this condition above already is about:
if markersStartWithSingleByteRune && msg[i] < utf8.RuneSelf { continue }
pkg/util/log/testdata/crdb_v2_break_lines, line 61 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I'm not sure what's going on, but these characters are rendering differently for me in reviewable (a small illegible blob) than the other snowmen.
What do you have on your screen?
(I also checked in a hex dump that the utf-8 sequence is preserved.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)
pkg/util/log/format_crdb_v2.go, line 403 at r2 (raw file):
Previously, knz (kena) wrote…
That is what this condition above already is about:
if markersStartWithSingleByteRune && msg[i] < utf8.RuneSelf { continue }
(Although that condition was wrong, I just realized after replying to your comment. Fixed over in #64900.
Release note (cli change): Long lines in the `crdb-v2` output logging format are now split when they exceed 16K, instead of 1K as previously.
Release note (bug fix): The automatic long line splitting behavior of the new `crdb-v2` logging format now avoids splitting lines in the middle of UTF-8 sequences, and avoids unbalanced redaction markers.
This way if/when we ever make it configurable and document it, there won't be any surprises. Release note: None
Refreshed to reflect the latest changes in #64900. |
Backport 2/2 commits from #64900.
/cc @cockroachdb/release
Alleviates #61681.
Fixes #64896.
Release note (cli change): Long lines in the
crdb-v2
outputlogging format are now split when they exceed 16K, instead
of 1K as previously.
Release note (bug fix): The automatic long line splitting behavior
of the new
crdb-v2
logging format now avoids splitting lines in themiddle of UTF-8 sequences, and avoids unbalanced redaction markers.