Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid String.format usages in hot paths #15258

Merged
merged 10 commits into from
Dec 6, 2022

Conversation

pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Nov 30, 2022

Description

Replaces usages of String.format in expected codepaths to improve performance. In particular time / timestamp related JSON serializers and casts to varchar improve dramatically with timestamp(3) casting throughput improving 2x, time(3) improving more than 30x, and time with time zone(3) improving ~40x (benchmark results: jmh.morethan.io).

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Nov 30, 2022
@pettyjamesm pettyjamesm requested a review from martint November 30, 2022 18:21
@pettyjamesm pettyjamesm force-pushed the reduce-string-format-usages branch 2 times, most recently from 95aa4e3 to 6644105 Compare November 30, 2022 22:22
@pettyjamesm pettyjamesm marked this pull request as ready for review December 1, 2022 02:49
@Fork(1)
@Warmup(iterations = 5, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = SECONDS)
public class BenchmarkCastTimestampToVarchar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not benchmark the cast methods directly instead of going through the expression compiler and processor machinery?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but initially I did it this way to make sure I was actually changing the correct functions that would be used during casting operations. After I was sure that I was, it turned out actually to be easier to do it this way since I could test all of the different types with a single benchmark instead of having to specialize benchmark methods for each type as well as short/long variants of timestamps which end up dispatching to different methods based on precision.

Comment on lines +118 to +122
for (int index = builder.length() - 1; index > 8; index--) {
long temp = scaledFraction / 10;
int digit = (int) (scaledFraction - (temp * 10));
scaledFraction = temp;
builder.setCharAt(index, (char) ('0' + digit));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we add similar code somewhere else recently? We may want to consider putting it somewhere we can reuse it, since it's not particularly trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did indeed. That change was to SqlTimestamp in the SPI which improved performance just for the JSON serialization path. I think there are a few problems that I think has caused this code to proliferate:

  • SqlTimestamp and SqlTimestampWithTimeZone are SPI objects that seem to only exist in order to serialize values as JSON. They have similar, but slightly different logic from one another (the time zone part)
  • Casting to timestamp (and with time zone) types are currently implemented in trino-main and also formatted to strings (even though casting to varchar needs a UTF8 slice) because the formatting logic is being shared for long and short timestamp types both with and without time zones and for conversions and in parts of the JSON type handling paths. So it’s a 2x2 grid of timestamp types each targeting at least 2 output representations, and not accessible from the SPI - meanwhile the SPI can’t host the implementation because it depends on the packing / unpacking implementations in main.
  • Time with and without time zone is similar in Trino main, except that was already producing a slice directly (presumably for performance to avoid the intermediate string and since the format is simple enough to implement by hand) which means it also can’t live in the SPI because of the dependency on slice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants