Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage Block#mayHaveNull() in HashSemiJoinOperator #8639

Merged
merged 3 commits into from
Jul 28, 2021

Conversation

pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Jul 22, 2021

Comparable changes to prestodb/presto#16459

Consults Block#mayHaveNull() outside of the main processing loop in HashSemiJoinOperator so that Block#isNull checks can be skipped inside the loop when none are necessary.

Also adds:

  • LazyBlock#mayHaveNull() implementation to make the Block#mayHaveNull() checks like this one more effective
  • Page#getLoadedPage(int... columns) helper methods to support simultaneous column extraction and LazyBlock unwrapping

@cla-bot cla-bot bot added the cla-signed label Jul 22, 2021
@pettyjamesm pettyjamesm requested a review from martint July 22, 2021 14:52
@raunaqmorarka
Copy link
Member

Could you please add JMH benchmark results before and after the change showing the improvement ?

@pettyjamesm
Copy link
Member Author

Could you please add JMH benchmark results before and after the change showing the improvement ?

It's going to take some time to put that together since no JMH benchmark exists for HashSemiJoinOperator. I tried using SqlSemiJoinInPredicateBenchmark but unfortunately that query is converted to an inner-join by the planner, so I modified that benchmark to use the query string: "SELECT orderkey FROM lineitem WHERE orderkey < 0 OR orderkey IN (SELECT orderkey FROM orders WHERE orderkey % 2 = 0)" and ran it locally with 100 warm-up iterations and 25 measured ones. Run to run variability is high and this benchmark is much less scientific than a JMH based approach would be, but the output looks something like this:

Before:

peak_memory:1410102,elapsed_millis:102,input_rows_per_second:735943,output_rows_per_second:294181,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:102147725,cpu_nanos:99034000,user_nanos:98101000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:127,input_rows_per_second:591470,output_rows_per_second:236430,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:127098501,cpu_nanos:109192000,user_nanos:107764000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:139,input_rows_per_second:538891,output_rows_per_second:215413,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:139499266,cpu_nanos:106468000,user_nanos:105621000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:95,input_rows_per_second:789199,output_rows_per_second:315469,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:95254783,cpu_nanos:92582000,user_nanos:91866000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:103,input_rows_per_second:730367,output_rows_per_second:291952,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:102927612,cpu_nanos:97680000,user_nanos:96275000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:114,input_rows_per_second:657643,output_rows_per_second:262882,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:114309571,cpu_nanos:106110000,user_nanos:104536000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
                    sql_semijoin_in ::  106.720 cpu ms :: 1.34MB peak memory :: in 75.2K,      0B,    704K/s,      0B/s :: out 30.1K,   264KB,    282K/s,  2.42MB/s

After:

peak_memory:1410102,elapsed_millis:78,input_rows_per_second:963263,output_rows_per_second:385049,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:78041971,cpu_nanos:77661000,user_nanos:77446000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:77,input_rows_per_second:970159,output_rows_per_second:387805,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:77487229,cpu_nanos:77326000,user_nanos:77172000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:86,input_rows_per_second:878736,output_rows_per_second:351260,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:85548977,cpu_nanos:84220000,user_nanos:83745000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:87,input_rows_per_second:864962,output_rows_per_second:345754,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:86911267,cpu_nanos:86610000,user_nanos:86282000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:81,input_rows_per_second:925078,output_rows_per_second:369785,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:81263382,cpu_nanos:81105000,user_nanos:80900000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
peak_memory:1410102,elapsed_millis:82,input_rows_per_second:911462,output_rows_per_second:364342,input_megabytes:0,input_megabytes_per_second:0,wall_nanos:82477310,cpu_nanos:82269000,user_nanos:81981000,input_rows:75175,input_bytes:0,output_rows:30050,output_bytes:270450
                    sql_semijoin_in ::   82.769 cpu ms :: 1.34MB peak memory :: in 75.2K,      0B,    908K/s,      0B/s :: out 30.1K,   264KB,    363K/s,  3.12MB/s

Logically, these changes here are extending the pattern of using mayHaveNull() to unswitch primary loop isNull checks like was done in #8437

@raunaqmorarka
Copy link
Member

I think it would be nice to add a JMH benchmark if one doesn't exist while making such optimizations. The linked PR doesn't have JMH numbers either.
Maybe we could go one step further and create 2 separate even more streamlined for loops for the never-null and may-have-null cases. Having a JMH benchmark would help to guide to what extent we should make optimizations like that.

@pettyjamesm
Copy link
Member Author

pettyjamesm commented Jul 22, 2021

Maybe we could go one step further and create 2 separate even more streamlined for loops for the never-null and may-have-null cases. Having a JMH benchmark would help to guide to what extent we should make optimizations like that.

I chose to leave the loop unswitching decision up to the C2 JIT instead of manually unswitching the loops here because a manually unswitched version would probably want variants for #mayHaveNull() and probeHashChannel >= 0 (and maybe channelSet.hasNull() and channelSet.isEmpty() too?) which would be at least 4 different loop variants. Having done that exercise, I can say the even with just 2 different patterns, the resulting code is very difficult to read and reason about. While so while C2 may be more fickle about unswitching on both loop invariants than a manual approach, it seemed like the right compromise in this context.

I think it would be nice to add a JMH benchmark if one doesn't exist while making such optimizations. The linked PR doesn't have JMH numbers either.

I agree, but on my end its been a question of available time and the effort required to produce meaningful benchmarks. In the case of the PartitionedOutputOperator PR, I chose not to attach numbers for the existing JMH benchmarks because it would have required a lot of changes to better simulate real-world conditions, and without those changes the existing benchmark numbers would be misleading at best. For example, the existing benchmarks don't exercise null channel or any-row replication and don't vary block or PartitionFunction implementation classes, which allows the JIT to optimistically inline things deeply in ways that a real worker JVM instance would not.


// update hashing strategy to use probe cursor
for (int position = 0; position < inputPage.getPositionCount(); position++) {
if (probeJoinPage.getBlock(0).isNull(position)) {
if (probeJoinNulls != null && probeJoinNulls.isNull(position)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a boolean variable like probeJoinCanBeNull instead of null in probeJoinNulls?
The code would be more explicit.

Copy link
Member Author

@pettyjamesm pettyjamesm Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, but then you'd need a boolean probeJoinCanBeNull and Block probeJoinNulls which doesn't seem like an improvement in clarity.

@pettyjamesm pettyjamesm force-pushed the improve-hashsemijoinoperator branch from 4b90f8e to e1deae9 Compare July 26, 2021 14:25
@pettyjamesm pettyjamesm force-pushed the improve-hashsemijoinoperator branch from e1deae9 to f4cb08a Compare July 27, 2021 13:56
@pettyjamesm
Copy link
Member Author

Renamed getLoadedColumns to getLoadedPage and added an inline comment about why channelSet was added as a local variable instead of referring to the instance field directly based on review feedback on the PrestoDB PR. The changes should be ready for final review / merge.

@martint martint merged commit 41ab5e6 into trinodb:master Jul 28, 2021
@martint martint added this to the 360 milestone Jul 28, 2021
@martint martint mentioned this pull request Jul 28, 2021
11 tasks
@pettyjamesm pettyjamesm deleted the improve-hashsemijoinoperator branch July 28, 2021 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants